Tools for reshaping data into tidy format.
Pivoting refers to the process of changing the interpretation of each row in a data frame. It is useful for addressing problems 1 - 2 in the previous lecture, which we repeat here for completeness.
To address (a), we can use the pivot_longer
function in tidyr
. It takes an implicitly stored variable and explicitly stores it in a column defined by the names_to
argument. In
The example below shows pivot_longer
being used to tidy one of the non-tidy tuberculosis datasets. Note that the data has doubled in length, because there are now two rows per country (one per year).
For reference, these are the original data.
table4a
# A tibble: 3 x 3
country `1999` `2000`
* <chr> <int> <int>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
This step lengthens the data,
table4a_longer <- table4a %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
table4a_longer
# A tibble: 6 x 3
country year cases
<chr> <chr> <int>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
# helper function, to avoid copying and pasting code
pivot_fun <- function(x, value_column = "cases") {
x %>%
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = value_column)
}
table4 <- left_join(
pivot_fun(table4a), # look for all country x year combinations in left table
pivot_fun(table4b, "population") # and find matching rows in right table
)
table4
# A tibble: 6 x 4
country year cases population
<chr> <chr> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
This lets us make the year vs. rate plot that we had tried to put together in the last lecture. It’s much easier to recognize trends when comparing the rates, than when looking at the raw case counts.
ggplot(table4, aes(x = year, y = cases / population, col = country)) +
geom_point() +
geom_line(aes(group = country))
To address (b), we can use the pivot_wider
function. It spreads the column in the values_from
argument across new columns specified by the names_from
argument.
The example below shows pivot_wider
being used to tidy one of the other non-tidy datasets. Note when there are more than two levels in the names_from
column, this will always be wider than the starting data frame, which is why this operation is called pivot_wider
.
For reference, here is table2
before pivoting.
table2
# A tibble: 12 x 4
country year type count
<chr> <int> <chr> <int>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
Now, we spread the cases
and population
variables into their own columns.
table2 %>%
pivot_wider(names_from = type, values_from = count)
# A tibble: 6 x 4
country year cases population
<chr> <int> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583