Using separate
, mutate
, and summarise
to derive new variables for
downstream visualization.
It’s easiest to define visual encodings when the variables we want to encode are contained in their own columns. After sketching out a visualization of interest, we may find that these variables are not explicitly represented among the columns of the raw dataset. In this case, we may need to derive them based on what is available. The dplyr
and tidyr
packages provide functions for deriving new variables, which we review in these notes.
Sometimes a single column is used to implicitly store several variables. To make the data tidy, separate
can be used to split that single column into several columns, each of which corresponds to exactly one variable.
The block below separates our earlier table3
, which stored rate as a
fraction in a character column. The original table was,
table3
# A tibble: 6 x 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
and the separated version is,
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE) # try without convert, and compare the data types of the columns
# A tibble: 6 x 4
country year cases population
<chr> <dbl> <int> <int>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Note that this function has an inverse, called unite
, which can merge
several columns into one. This is sometimes useful, but not as often as
separate
, since it isn’t needed to tidy a dataset.
Separating a single column into several is a special case of a more general
operation, mutate
, which defines new columns as functions of existing ones. We
have used this is in previous lectures, but now we can philosophically justify
it: the variables we want to encode need to be defined in advance.
For example, we may want to create a column rate
that includes cases over
population,
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE) %>%
mutate(rate = cases / year)
# A tibble: 6 x 5
country year cases population rate
<chr> <dbl> <int> <int> <dbl>
1 Afghanistan 1999 745 19987071 0.373
2 Afghanistan 2000 2666 20595360 1.33
3 Brazil 1999 37737 172006362 18.9
4 Brazil 2000 80488 174504898 40.2
5 China 1999 212258 1272915272 106.
6 China 2000 213766 1280428583 107.
Sometimes, the variables of interest are functions of several rows. For
example, perhaps we want to visualize averages of a variable across age groups.
In this case, we can derive a summary across groups of rows using the
group_by
-followed-by-summarise
pattern.
For example, perhaps we want the average rate over both years.
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE) %>%
mutate(rate = cases / year) %>%
group_by(country) %>%
summarise(avg_rate = mean(rate))
# A tibble: 3 x 2
country avg_rate
<chr> <dbl>
1 Afghanistan 0.853
2 Brazil 29.6
3 China 107.
For attribution, please cite this work as
Sankaran (2024, Jan. 7). STAT 436 (Spring 2024): Deriving Variables. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week02-03/
BibTeX citation
@misc{sankaran2024deriving, author = {Sankaran, Kris}, title = {STAT 436 (Spring 2024): Deriving Variables}, url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week02-03/}, year = {2024} }