Tools for reshaping data into tidy format.
Pivoting refers to the process of changing the interpretation of each row in a data frame. It is useful for addressing problems 1 - 2 in the previous lecture, which we repeat here for completeness.
To address (a), we can use the pivot_longer function in tidyr. It takes an implicitly stored variable and explicitly stores it in a column defined by the names_to argument. In
The example below shows pivot_longer being used to tidy one of the non-tidy tuberculosis datasets. Note that the data has doubled in length, because there are now two rows per country (one per year).
For reference, these are the original data.
table4a# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766This step lengthens the data,
table4a_longer <- table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")
table4a_longer# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766# helper function, to avoid copying and pasting code
pivot_fun <- function(x, value_column = "cases") {
  x %>%
    pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = value_column)
}
table4 <- left_join(
  pivot_fun(table4a), # look for all country x year combinations in left table
  pivot_fun(table4b, "population") # and find matching rows in right table
)
table4# A tibble: 6 × 4
  country     year   cases population
  <chr>       <chr>  <dbl>      <dbl>
1 Afghanistan 1999     745   19987071
2 Afghanistan 2000    2666   20595360
3 Brazil      1999   37737  172006362
4 Brazil      2000   80488  174504898
5 China       1999  212258 1272915272
6 China       2000  213766 1280428583This lets us make the year vs. rate plot that we had tried to put together in the last lecture. It’s much easier to recognize trends when comparing the rates, than when looking at the raw case counts.
ggplot(table4, aes(x = year, y = cases / population, col = country)) +
  geom_point() +
  geom_line(aes(group = country))
To address (b), we can use the pivot_wider function. It spreads the column
in the values_from argument across new columns specified by the names_from
argument.
The example below shows pivot_wider being used to tidy one of the other non-tidy datasets. Note when there are more than two levels in the names_from column, this will always be wider than the starting data frame, which is why this operation is called pivot_wider.
For reference, here is table2 before pivoting.
table2# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583Now, we spread the cases and population variables into their own columns.
table2 %>%
    pivot_wider(names_from = type, values_from = count)# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583