Chapter 12 - Tidy Data
Load the libraries needed for these exercises.
library(tidyverse)
12.2 - Tidy Data
Problem 1
Using prose, describe how the variables and observations are organised in each of the sample tables.
table1
is the ‘tidy’ dataset: each variable has its own column, each observation has its own row, and each value has its own cell:
table1
## # A tibble: 6 x 4
## country year cases population
## <chr> <int> <int> <int>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
combines cases
and population
into one column called type
, this means that each variable does not have its own column, and that each observation spans multiple rows:
table2
## # A tibble: 12 x 4
## country year type count
## <chr> <int> <chr> <int>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
In table3
the variable rate
violates a tidy principle, with multiple values contained in a cell, which also means that each variable does not have its own column:
table3
## # A tibble: 6 x 3
## country year rate
## * <chr> <int> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
table4a
and table4b
separate cases and population into their own tables across years, with multiple observations in each row:
table4a
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 745 2666
## 2 Brazil 37737 80488
## 3 China 212258 213766
table4b
## # A tibble: 3 x 3
## country `1999` `2000`
## * <chr> <int> <int>
## 1 Afghanistan 19987071 20595360
## 2 Brazil 172006362 174504898
## 3 China 1272915272 1280428583
Problem 2
Compute the rate for table2
, and table4a + table4b
. You will need to perform four operations:
- Extract the number of TB cases per country per year.
- Extract the matching population per country per year.
- Divide cases by population, and multiply by 10000.
- Store back in the appropriate place.
Which representation is easiest to work with? Which is hardest? Why?
NOTE these exercises demonstrate the difficulties of working with non-tidy data, the methods to come later in this chapter will greatly simplify the below code.
Create a tidy
version of table2
by filtering type
into two tables and using the dplyr
full_join()
function to recreate table1
table2a <- table2 %>%
filter(type == 'cases') %>%
select(country, year, cases = count)
table2b <- table2 %>%
filter(type == 'population') %>%
select(country, year, population = count)
full_join(table2a, table2b) %>%
mutate(rate = cases / population * 10000)
## Joining, by = c("country", "year")
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <int> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
Use similar logic on table4a
and table4b
- this ends up being a bit more work as the data are already stored across two tables:
table4a_1 <- table4a %>%
mutate(year = 1999) %>%
select(country, year, cases = `1999`)
table4a_2 <- table4a %>%
mutate(year = 2000) %>%
select(country, year, cases = `2000`)
table4b_1 <- table4b %>%
mutate(year = 1999) %>%
select(country, year, population = `1999`)
table4b_2 <- table4b %>%
mutate(year = 2000) %>%
select(country, year, population = `2000`)
bind_rows(table4a_1, table4a_2) %>%
full_join(bind_rows(table4b_1, table4b_2)) %>%
mutate(rate = cases / population * 10000)
## Joining, by = c("country", "year")
## # A tibble: 6 x 5
## country year cases population rate
## <chr> <dbl> <int> <int> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Brazil 1999 37737 172006362 2.19
## 3 China 1999 212258 1272915272 1.67
## 4 Afghanistan 2000 2666 20595360 1.29
## 5 Brazil 2000 80488 174504898 4.61
## 6 China 2000 213766 1280428583 1.67
Problem 3
Recreate the plot showing change in cases over time using table2
instead of table1
. What do you need to do first?
Filter type
so that only cases
is plotted:
table2 %>%
filter(type == 'cases') %>%
ggplot(aes(year, count)) +
geom_line(aes(group = country), colour = "grey50") +
geom_point(aes(colour = country))
12.3 - Spreading and Gathering
Problem 1
Why are gather()
and spread()
not perfectly symmetrical? Carefully consider the following example:
stocks <- tibble(
year = c(2015, 2015, 2016, 2016),
half = c( 1, 2, 1, 2),
return = c(1.88, 0.59, 0.92, 0.17)
)
stocks
## # A tibble: 4 x 3
## year half return
## <dbl> <dbl> <dbl>
## 1 2015 1 1.88
## 2 2015 2 0.59
## 3 2016 1 0.92
## 4 2016 2 0.17
stocks %>%
spread(year, return) %>%
gather("year", "return", `2015`:`2016`)
## # A tibble: 4 x 3
## half year return
## <dbl> <chr> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 1 2016 0.92
## 4 2 2016 0.17
(Hint: look at the variable types and think about column names.) Both spread()
and gather()
have a convert
argument. What does it do?
In the above example gather()
and spread()
are not perfectly symmetrical as year
is converted from a numeric to a character variable. Use the convert
argument to automatically run type.convert()
on the key
column:
stocks %>%
spread(year, return, convert = TRUE) %>%
gather("year", "return", `2015`:`2016`, convert = TRUE)
## # A tibble: 4 x 3
## half year return
## <dbl> <int> <dbl>
## 1 1 2015 1.88
## 2 2 2015 0.59
## 3 1 2016 0.92
## 4 2 2016 0.17
Problem 2
Why does this code fail?
table4a %>%
gather(1999, 2000, key = "year", value = "cases")
## Error in inds_combine(.vars, ind_list): Position must be between 0 and n
Be sure to use backticks to include a nonstandard variable name within a tibble
:
table4a %>%
gather(`1999`, `2000`, key = "year", value = "cases")
## # A tibble: 6 x 3
## country year cases
## <chr> <chr> <int>
## 1 Afghanistan 1999 745
## 2 Brazil 1999 37737
## 3 China 1999 212258
## 4 Afghanistan 2000 2666
## 5 Brazil 2000 80488
## 6 China 2000 213766
Problem 3
Why does spreading this tibble
fail? How could you add a new column to fix the problem?
people <- tribble(
~name, ~key, ~value,
#-----------------|--------|------
"Phillip Woods", "age", 45,
"Phillip Woods", "height", 186,
"Phillip Woods", "age", 50,
"Jessica Cordero", "age", 37,
"Jessica Cordero", "height", 156
)
Spreading the given tibble
will fail because rows 1 and 3 are identical observations. Add a count
variable to fix the problem:
people$count <- c(1,1,2,1,1)
people %>%
spread(key, value)
## # A tibble: 3 x 4
## name count age height
## <chr> <dbl> <dbl> <dbl>
## 1 Jessica Cordero 1 37 156
## 2 Phillip Woods 1 45 186
## 3 Phillip Woods 2 50 NA
Problem 4
Tidy the simple tibble
below. Do you need to spread
or gather
it? What are the variables?
preg <- tribble(
~pregnant, ~male, ~female,
"yes", NA, 10,
"no", 20, 12
)
The given tibble
is not tidy as one variable (sex
) is spread across multiple columns. Use gather()
:
preg %>%
gather(male:female, key = sex, value = count)
## # A tibble: 4 x 3
## pregnant sex count
## <chr> <chr> <dbl>
## 1 yes male NA
## 2 no male 20
## 3 yes female 10
## 4 no female 12
12.4 - Spreading and Uniting
Problem 1
What do the extra
and fill
arguments do in separate()
? Experiment with the various options for the following two toy datasets.
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Additional pieces discarded in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f
## 3 h i j
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"))
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [2].
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e <NA>
## 3 f g i
extra
controls what happens when separate()
results in too many pieces. In the first example, the second row appears to have an extra observation, which is dropped by default. Using extra = 'merge'
will preserve the value:
tibble(x = c("a,b,c", "d,e,f,g", "h,i,j")) %>%
separate(x, c("one", "two", "three"), extra = 'merge')
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 d e f,g
## 3 h i j
fill
controls what happens when separate()
results in not enough pieces. In the second example, the second row appears to be missing an observation, which will be filled from the right be default. Using fill = 'left'
will fill from left instead.
tibble(x = c("a,b,c", "d,e", "f,g,i")) %>%
separate(x, c("one", "two", "three"), fill = 'left')
## # A tibble: 3 x 3
## one two three
## <chr> <chr> <chr>
## 1 a b c
## 2 <NA> d e
## 3 f g i
Problem 2
Both unite()
and separate()
have a remove
argument. What does it do? Why would you set it to FALSE
?
remove
will drop the original input column from the data frame. Set it to FALSE
in order to keep it in the data:
tibble(x = c("a,b,c", "d,e,f", "g,h,i")) %>%
separate(x, c("one", "two", "three"), remove = FALSE)
## # A tibble: 3 x 4
## x one two three
## <chr> <chr> <chr> <chr>
## 1 a,b,c a b c
## 2 d,e,f d e f
## 3 g,h,i g h i
Problem 3
Compare and contrast separate()
and extract()
. Why are there three variations of separation (by position, by separator, and with groups), but only one unite?
separate()
will create columns using either a position or a separator, while extract()
will create columns using a regular expression groups. Consider the differences in the following:
df <- data.frame(x = c(NA, "a-b", "a-d", "b-c", "d-e"))
df %>% separate(x, c("A", "B"))
## A B
## 1 <NA> <NA>
## 2 a b
## 3 a d
## 4 b c
## 5 d e
df %>% extract(x, c("A", "B"), "([a-d]+)-([a-d]+)")
## A B
## 1 <NA> <NA>
## 2 a b
## 3 a d
## 4 b c
## 5 <NA> <NA>
There is only one variation of unite()
since it is a many to one mapping. The arguments passed to unite()
will always be concatenated to single result.