Tidy Data Example

An extended example of tidying a real-world dataset.


Author

Affiliation

Kris Sankaran

UW Madison

Published

Jan. 7, 2024

Citation

Sankaran, 2024


Reading, Recording, Rmarkdown

library(tidyverse)
library(skimr)
  1. Let’s work through the example of tidying a WHO dataset. This was discussed in the reading and is good practice in pivoting and deriving new variables.

  2. The raw data, along with a summary from the skimr package, is shown below (notice the small multiples!).

who
# A tibble: 7,240 x 60
   country     iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534
   <chr>       <chr> <chr> <dbl>       <dbl>        <dbl>        <dbl>
 1 Afghanistan AF    AFG    1980          NA           NA           NA
 2 Afghanistan AF    AFG    1981          NA           NA           NA
 3 Afghanistan AF    AFG    1982          NA           NA           NA
 4 Afghanistan AF    AFG    1983          NA           NA           NA
 5 Afghanistan AF    AFG    1984          NA           NA           NA
 6 Afghanistan AF    AFG    1985          NA           NA           NA
 7 Afghanistan AF    AFG    1986          NA           NA           NA
 8 Afghanistan AF    AFG    1987          NA           NA           NA
 9 Afghanistan AF    AFG    1988          NA           NA           NA
10 Afghanistan AF    AFG    1989          NA           NA           NA
# i 7,230 more rows
# i 53 more variables: new_sp_m3544 <dbl>, new_sp_m4554 <dbl>,
#   new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_f014 <dbl>,
#   new_sp_f1524 <dbl>, new_sp_f2534 <dbl>, new_sp_f3544 <dbl>,
#   new_sp_f4554 <dbl>, new_sp_f5564 <dbl>, new_sp_f65 <dbl>,
#   new_sn_m014 <dbl>, new_sn_m1524 <dbl>, new_sn_m2534 <dbl>,
#   new_sn_m3544 <dbl>, new_sn_m4554 <dbl>, new_sn_m5564 <dbl>, ...
skim(who)
Table 1: Data summary
Name who
Number of rows 7240
Number of columns 60
_______________________
Column type frequency:
character 3
numeric 57
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 4 52 0 219 0
iso2 34 1 2 2 0 218 0
iso3 0 1 3 3 0 219 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 1996.56 9.83 1980 1988.00 1997.0 2005.00 2013 <U+2587><U+2587><U+2587><U+2587><U+2587>
new_sp_m014 4067 0.44 83.71 316.14 0 0.00 5.0 37.00 5001 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_m1524 4031 0.44 1015.66 4885.38 0 9.00 90.0 502.00 78278 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_m2534 4034 0.44 1403.80 5718.39 0 14.00 150.0 715.50 84003 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_m3544 4021 0.44 1315.88 6003.26 0 13.00 130.0 583.50 90830 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_m4554 4017 0.45 1103.86 5441.06 0 12.00 102.0 440.00 82921 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_m5564 4022 0.44 800.70 4418.31 0 8.00 63.0 279.00 63814 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_m65 4031 0.44 682.82 4089.14 0 8.00 53.0 232.00 70376 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_f014 4066 0.44 114.33 504.63 0 1.00 7.0 50.75 8576 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_f1524 4046 0.44 826.11 3552.02 0 7.00 66.0 421.00 53975 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_f2534 4040 0.44 917.30 3580.15 0 9.00 84.0 476.25 49887 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_f3544 4041 0.44 640.43 2542.51 0 6.00 57.0 308.00 34698 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_f4554 4036 0.44 445.78 1799.23 0 4.00 38.0 211.00 23977 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_f5564 4045 0.44 313.87 1381.25 0 3.00 25.0 146.50 18203 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sp_f65 4043 0.44 283.93 1267.94 0 4.00 30.0 129.00 21339 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_m014 6195 0.14 308.75 1727.25 0 1.00 9.0 61.00 22355 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_m1524 6210 0.14 513.02 3643.27 0 2.00 15.5 102.00 60246 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_m2534 6218 0.14 653.69 3430.03 0 2.00 23.0 135.50 50282 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_m3544 6215 0.14 837.87 8524.53 0 2.00 19.0 132.00 250051 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_m4554 6213 0.14 520.79 3301.70 0 2.00 19.0 127.50 57181 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_m5564 6219 0.14 448.62 3488.68 0 2.00 16.0 101.00 64972 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_m65 6220 0.14 460.36 3991.90 0 2.00 20.5 111.75 74282 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_f014 6200 0.14 291.95 1647.30 0 1.00 8.0 58.00 21406 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_f1524 6218 0.14 407.90 2379.13 0 1.00 12.0 89.00 35518 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_f2534 6224 0.14 466.26 2272.86 0 2.00 18.0 103.25 28753 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_f3544 6220 0.14 506.59 5013.53 0 1.00 11.0 82.25 148811 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_f4554 6222 0.14 271.16 1511.72 0 1.00 10.0 76.75 23869 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_f5564 6223 0.14 213.39 1468.62 0 1.00 8.0 56.00 26085 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_sn_f65 6221 0.14 230.75 1597.70 0 1.00 13.0 74.00 29630 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_m014 6202 0.14 128.61 460.14 0 0.00 6.0 55.00 7869 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_m1524 6214 0.14 158.30 537.74 0 1.00 11.0 88.00 8558 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_m2534 6220 0.14 201.23 764.05 0 1.00 13.0 124.00 11843 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_m3544 6216 0.14 272.72 3381.41 0 1.00 10.5 91.25 105825 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_m4554 6220 0.14 108.11 380.61 0 1.00 8.5 63.25 5875 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_m5564 6225 0.14 72.17 234.55 0 1.00 7.0 46.00 3957 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_m65 6222 0.14 78.94 227.34 0 1.00 10.0 55.00 3061 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_f014 6208 0.14 112.89 446.55 0 0.00 5.0 50.00 6960 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_f1524 6219 0.14 149.17 543.89 0 1.00 9.0 78.00 7866 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_f2534 6219 0.14 189.52 761.79 0 1.00 12.0 95.00 10759 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_f3544 6219 0.14 241.70 3218.50 0 1.00 9.0 77.00 101015 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_f4554 6223 0.14 93.77 339.33 0 1.00 8.0 56.00 6759 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_f5564 6223 0.14 63.04 212.95 0 1.00 6.0 42.00 4684 <U+2587><U+2581><U+2581><U+2581><U+2581>
new_ep_f65 6226 0.14 72.31 202.72 0 0.00 10.0 51.00 2548 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_m014 7050 0.03 538.18 2082.18 0 5.00 32.5 210.00 18617 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_m1524 7058 0.03 1489.51 6848.18 0 17.50 171.0 684.25 84785 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_m2534 7057 0.03 2139.72 7539.87 0 25.00 217.0 1091.00 76917 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_m3544 7056 0.03 2036.40 7847.94 0 24.75 208.0 851.25 84565 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_m4554 7056 0.03 1835.07 8324.28 0 19.00 175.0 688.50 100297 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_m5564 7055 0.03 1525.30 8760.27 0 13.00 136.0 536.00 112558 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_m65 7058 0.03 1426.00 9431.99 0 17.00 117.0 453.50 124476 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_f014 7050 0.03 532.84 2117.78 0 5.00 32.5 226.00 18054 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_f1524 7056 0.03 1161.85 4606.76 0 10.75 123.0 587.75 49491 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_f2534 7058 0.03 1472.80 5259.59 0 18.00 161.0 762.50 44985 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_f3544 7057 0.03 1125.01 4210.58 0 12.50 125.0 544.50 38804 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_f4554 7057 0.03 877.27 3556.18 0 10.00 92.0 400.50 37138 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_f5564 7057 0.03 686.41 3379.33 0 8.00 69.0 269.00 40892 <U+2587><U+2581><U+2581><U+2581><U+2581>
newrel_f65 7055 0.03 683.76 3618.47 0 9.00 69.0 339.00 47438 <U+2587><U+2581><U+2581><U+2581><U+2581>
  1. According to the data dictionary, the columns have the following meanings,
  1. Our first step is to pivot_longer. There is quite a bit of information implicitly stored in the column names, and we want to make those variables explicitly available for visual encoding.
who_longer <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65,  # notice we can refer to groups of columns without naming each one
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE # if a cell is empty, we do not keep it in the tidy version
  )

who_longer
# A tibble: 76,046 x 6
   country     iso2  iso3   year key          cases
   <chr>       <chr> <chr> <dbl> <chr>        <dbl>
 1 Afghanistan AF    AFG    1997 new_sp_m014      0
 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
 4 Afghanistan AF    AFG    1997 new_sp_m3544     3
 5 Afghanistan AF    AFG    1997 new_sp_m4554     5
 6 Afghanistan AF    AFG    1997 new_sp_m5564     2
 7 Afghanistan AF    AFG    1997 new_sp_m65       0
 8 Afghanistan AF    AFG    1997 new_sp_f014      5
 9 Afghanistan AF    AFG    1997 new_sp_f1524    38
10 Afghanistan AF    AFG    1997 new_sp_f2534    36
# i 76,036 more rows
  1. The new column key contains several variables at once. We can separate it into gender and age group.
who_separate <- who_longer %>% 
  mutate(key = str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "type", "sexage"), sep = "_") %>%
  separate(sexage, c("sex", "age"), sep = 1)

who_separate
# A tibble: 76,046 x 9
   country     iso2  iso3   year new   type  sex   age   cases
   <chr>       <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
 1 Afghanistan AF    AFG    1997 new   sp    m     014       0
 2 Afghanistan AF    AFG    1997 new   sp    m     1524     10
 3 Afghanistan AF    AFG    1997 new   sp    m     2534      6
 4 Afghanistan AF    AFG    1997 new   sp    m     3544      3
 5 Afghanistan AF    AFG    1997 new   sp    m     4554      5
 6 Afghanistan AF    AFG    1997 new   sp    m     5564      2
 7 Afghanistan AF    AFG    1997 new   sp    m     65        0
 8 Afghanistan AF    AFG    1997 new   sp    f     014       5
 9 Afghanistan AF    AFG    1997 new   sp    f     1524     38
10 Afghanistan AF    AFG    1997 new   sp    f     2534     36
# i 76,036 more rows
  1. While we have performed each step one at a time, it’s possible to chain them into a single block of code. This is good practice, because it avoids having to define intermediate variables that are only ever used once. This is also typically more concise.
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE # if a cell is empty, we do not keep it in the tidy version
  ) %>%
  mutate(key = str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "type", "sexage"), sep = "_") %>%
  separate(sexage, c("sex", "age"), sep = 1)
# A tibble: 76,046 x 9
   country     iso2  iso3   year new   type  sex   age   cases
   <chr>       <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
 1 Afghanistan AF    AFG    1997 new   sp    m     014       0
 2 Afghanistan AF    AFG    1997 new   sp    m     1524     10
 3 Afghanistan AF    AFG    1997 new   sp    m     2534      6
 4 Afghanistan AF    AFG    1997 new   sp    m     3544      3
 5 Afghanistan AF    AFG    1997 new   sp    m     4554      5
 6 Afghanistan AF    AFG    1997 new   sp    m     5564      2
 7 Afghanistan AF    AFG    1997 new   sp    m     65        0
 8 Afghanistan AF    AFG    1997 new   sp    f     014       5
 9 Afghanistan AF    AFG    1997 new   sp    f     1524     38
10 Afghanistan AF    AFG    1997 new   sp    f     2534     36
# i 76,036 more rows
  1. A recommendation for visualization in javascript. We have only discussed tidying in R. While there is work to implement tidy-style transformations in javascript, the R tidyverse provides a more mature suite of tools. If you are making an interactive visualization in javascript, I recommend first tidying data in R so that each row corresponds to a visual mark and each column to a visual property. You can always save the result as either a json or csv, which can serve as the source data for your javascript visualization.

Footnotes

    Citation

    For attribution, please cite this work as

    Sankaran (2024, Jan. 7). STAT 436 (Spring 2024): Tidy Data Example. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week02-04/

    BibTeX citation

    @misc{sankaran2024tidy,
      author = {Sankaran, Kris},
      title = {STAT 436 (Spring 2024): Tidy Data Example},
      url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week02-04/},
      year = {2024}
    }