STAT 436 (Spring 2023): Deriving Variables

Kris Sankaran

library(tidyverse)

It’s easiest to define visual encodings when the variables we want to encode are contained in their own columns. After sketching out a visualization of interest, we may find that these variables are not explicitly represented among the columns of the raw dataset. In this case, we may need to derive them based on what is available. The dplyr and tidyr packages provide functions for deriving new variables, which we review in these notes.
Sometimes a single column is used to implicitly store several variables. To make the data tidy, separate can be used to split that single column into several columns, each of which corresponds to exactly one variable.
The block below separates our earlier table3, which stored rate as a fraction in a character column. The original table was,

table3

# A tibble: 6 × 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

and the separated version is,

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE) # try without convert, and compare the data types of the columns

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Note that this function has an inverse, called unite, which can merge several columns into one. This is sometimes useful, but not as often as separate, since it isn’t needed to tidy a dataset.
Separating a single column into several is a special case of a more general operation, mutate, which defines new columns as functions of existing ones. We have used this is in previous lectures, but now we can philosophically justify it: the variables we want to encode need to be defined in advance.
For example, we may want to create a column rate that includes cases over population,

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE) %>%
  mutate(rate = cases / year)

# A tibble: 6 × 5
  country      year  cases population    rate
  <chr>       <int>  <int>      <int>   <dbl>
1 Afghanistan  1999    745   19987071   0.373
2 Afghanistan  2000   2666   20595360   1.33 
3 Brazil       1999  37737  172006362  18.9  
4 Brazil       2000  80488  174504898  40.2  
5 China        1999 212258 1272915272 106.   
6 China        2000 213766 1280428583 107.

Sometimes, the variables of interest are functions of several rows. For example, perhaps we want to visualize averages of a variable across age groups. In this case, we can derive a summary across groups of rows using the group_by-followed-by-summarise pattern.
For example, perhaps we want the average rate over both years.

table3 %>% 
  separate(rate, into = c("cases", "population"), convert = TRUE) %>%
  mutate(rate = cases / year) %>%
  group_by(country) %>%
  summarise(avg_rate = mean(rate))

# A tibble: 3 × 2
  country     avg_rate
  <chr>          <dbl>
1 Afghanistan    0.853
2 Brazil        29.6  
3 China        107.

Deriving Variables

Citation