Tidy Data Example

An extended example of tidying a real-world dataset.

Kris Sankaran (UW Madison)
02-18-2021

Reading, Recording, Rmarkdown

  1. Let’s work through the example of tidying a WHO dataset. This was discussed in the reading and is good practice in pivoting and deriving new variables.

  2. The raw data, along with a summary from the skimr package, is shown below (notice the small multiples!).

who
# A tibble: 7,240 x 60
   country iso2  iso3   year new_sp_m014 new_sp_m1524 new_sp_m2534
   <chr>   <chr> <chr> <int>       <int>        <int>        <int>
 1 Afghan… AF    AFG    1980          NA           NA           NA
 2 Afghan… AF    AFG    1981          NA           NA           NA
 3 Afghan… AF    AFG    1982          NA           NA           NA
 4 Afghan… AF    AFG    1983          NA           NA           NA
 5 Afghan… AF    AFG    1984          NA           NA           NA
 6 Afghan… AF    AFG    1985          NA           NA           NA
 7 Afghan… AF    AFG    1986          NA           NA           NA
 8 Afghan… AF    AFG    1987          NA           NA           NA
 9 Afghan… AF    AFG    1988          NA           NA           NA
10 Afghan… AF    AFG    1989          NA           NA           NA
# … with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
#   new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
#   new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
#   new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
#   new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
#   new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
#   new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
#   new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
#   new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
#   new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
#   new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
#   new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
#   new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
#   new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
#   newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
#   newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
#   newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
#   newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
#   newrel_f65 <int>
skim(who)
Table 1: Data summary
Name who
Number of rows 7240
Number of columns 60
_______________________
Column type frequency:
character 3
numeric 57
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
country 0 1 4 52 0 219 0
iso2 0 1 2 2 0 219 0
iso3 0 1 3 3 0 219 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 1996.56 9.83 1980 1988.00 1997.0 2005.00 2013 ▇▇▇▇▇
new_sp_m014 4067 0.44 83.71 316.14 0 0.00 5.0 37.00 5001 ▇▁▁▁▁
new_sp_m1524 4031 0.44 1015.66 4885.38 0 9.00 90.0 502.00 78278 ▇▁▁▁▁
new_sp_m2534 4034 0.44 1403.80 5718.39 0 14.00 150.0 715.50 84003 ▇▁▁▁▁
new_sp_m3544 4021 0.44 1315.88 6003.26 0 13.00 130.0 583.50 90830 ▇▁▁▁▁
new_sp_m4554 4017 0.45 1103.86 5441.06 0 12.00 102.0 440.00 82921 ▇▁▁▁▁
new_sp_m5564 4022 0.44 800.70 4418.31 0 8.00 63.0 279.00 63814 ▇▁▁▁▁
new_sp_m65 4031 0.44 682.82 4089.14 0 8.00 53.0 232.00 70376 ▇▁▁▁▁
new_sp_f014 4066 0.44 114.33 504.63 0 1.00 7.0 50.75 8576 ▇▁▁▁▁
new_sp_f1524 4046 0.44 826.11 3552.02 0 7.00 66.0 421.00 53975 ▇▁▁▁▁
new_sp_f2534 4040 0.44 917.30 3580.15 0 9.00 84.0 476.25 49887 ▇▁▁▁▁
new_sp_f3544 4041 0.44 640.43 2542.51 0 6.00 57.0 308.00 34698 ▇▁▁▁▁
new_sp_f4554 4036 0.44 445.78 1799.23 0 4.00 38.0 211.00 23977 ▇▁▁▁▁
new_sp_f5564 4045 0.44 313.87 1381.25 0 3.00 25.0 146.50 18203 ▇▁▁▁▁
new_sp_f65 4043 0.44 283.93 1267.94 0 4.00 30.0 129.00 21339 ▇▁▁▁▁
new_sn_m014 6195 0.14 308.75 1727.25 0 1.00 9.0 61.00 22355 ▇▁▁▁▁
new_sn_m1524 6210 0.14 513.02 3643.27 0 2.00 15.5 102.00 60246 ▇▁▁▁▁
new_sn_m2534 6218 0.14 653.69 3430.03 0 2.00 23.0 135.50 50282 ▇▁▁▁▁
new_sn_m3544 6215 0.14 837.87 8524.53 0 2.00 19.0 132.00 250051 ▇▁▁▁▁
new_sn_m4554 6213 0.14 520.79 3301.70 0 2.00 19.0 127.50 57181 ▇▁▁▁▁
new_sn_m5564 6219 0.14 448.62 3488.68 0 2.00 16.0 101.00 64972 ▇▁▁▁▁
new_sn_m65 6220 0.14 460.36 3991.90 0 2.00 20.5 111.75 74282 ▇▁▁▁▁
new_sn_f014 6200 0.14 291.95 1647.30 0 1.00 8.0 58.00 21406 ▇▁▁▁▁
new_sn_f1524 6218 0.14 407.90 2379.13 0 1.00 12.0 89.00 35518 ▇▁▁▁▁
new_sn_f2534 6224 0.14 466.26 2272.86 0 2.00 18.0 103.25 28753 ▇▁▁▁▁
new_sn_f3544 6220 0.14 506.59 5013.53 0 1.00 11.0 82.25 148811 ▇▁▁▁▁
new_sn_f4554 6222 0.14 271.16 1511.72 0 1.00 10.0 76.75 23869 ▇▁▁▁▁
new_sn_f5564 6223 0.14 213.39 1468.62 0 1.00 8.0 56.00 26085 ▇▁▁▁▁
new_sn_f65 6221 0.14 230.75 1597.70 0 1.00 13.0 74.00 29630 ▇▁▁▁▁
new_ep_m014 6202 0.14 128.61 460.14 0 0.00 6.0 55.00 7869 ▇▁▁▁▁
new_ep_m1524 6214 0.14 158.30 537.74 0 1.00 11.0 88.00 8558 ▇▁▁▁▁
new_ep_m2534 6220 0.14 201.23 764.05 0 1.00 13.0 124.00 11843 ▇▁▁▁▁
new_ep_m3544 6216 0.14 272.72 3381.41 0 1.00 10.5 91.25 105825 ▇▁▁▁▁
new_ep_m4554 6220 0.14 108.11 380.61 0 1.00 8.5 63.25 5875 ▇▁▁▁▁
new_ep_m5564 6225 0.14 72.17 234.55 0 1.00 7.0 46.00 3957 ▇▁▁▁▁
new_ep_m65 6222 0.14 78.94 227.34 0 1.00 10.0 55.00 3061 ▇▁▁▁▁
new_ep_f014 6208 0.14 112.89 446.55 0 0.00 5.0 50.00 6960 ▇▁▁▁▁
new_ep_f1524 6219 0.14 149.17 543.89 0 1.00 9.0 78.00 7866 ▇▁▁▁▁
new_ep_f2534 6219 0.14 189.52 761.79 0 1.00 12.0 95.00 10759 ▇▁▁▁▁
new_ep_f3544 6219 0.14 241.70 3218.50 0 1.00 9.0 77.00 101015 ▇▁▁▁▁
new_ep_f4554 6223 0.14 93.77 339.33 0 1.00 8.0 56.00 6759 ▇▁▁▁▁
new_ep_f5564 6223 0.14 63.04 212.95 0 1.00 6.0 42.00 4684 ▇▁▁▁▁
new_ep_f65 6226 0.14 72.31 202.72 0 0.00 10.0 51.00 2548 ▇▁▁▁▁
newrel_m014 7050 0.03 538.18 2082.18 0 5.00 32.5 210.00 18617 ▇▁▁▁▁
newrel_m1524 7058 0.03 1489.51 6848.18 0 17.50 171.0 684.25 84785 ▇▁▁▁▁
newrel_m2534 7057 0.03 2139.72 7539.87 0 25.00 217.0 1091.00 76917 ▇▁▁▁▁
newrel_m3544 7056 0.03 2036.40 7847.94 0 24.75 208.0 851.25 84565 ▇▁▁▁▁
newrel_m4554 7056 0.03 1835.07 8324.28 0 19.00 175.0 688.50 100297 ▇▁▁▁▁
newrel_m5564 7055 0.03 1525.30 8760.27 0 13.00 136.0 536.00 112558 ▇▁▁▁▁
newrel_m65 7058 0.03 1426.00 9431.99 0 17.00 117.0 453.50 124476 ▇▁▁▁▁
newrel_f014 7050 0.03 532.84 2117.78 0 5.00 32.5 226.00 18054 ▇▁▁▁▁
newrel_f1524 7056 0.03 1161.85 4606.76 0 10.75 123.0 587.75 49491 ▇▁▁▁▁
newrel_f2534 7058 0.03 1472.80 5259.59 0 18.00 161.0 762.50 44985 ▇▁▁▁▁
newrel_f3544 7057 0.03 1125.01 4210.58 0 12.50 125.0 544.50 38804 ▇▁▁▁▁
newrel_f4554 7057 0.03 877.27 3556.18 0 10.00 92.0 400.50 37138 ▇▁▁▁▁
newrel_f5564 7057 0.03 686.41 3379.33 0 8.00 69.0 269.00 40892 ▇▁▁▁▁
newrel_f65 7055 0.03 683.76 3618.47 0 9.00 69.0 339.00 47438 ▇▁▁▁▁
  1. According to the data dictionary, the columns have the following meanings,
  1. Our first step is to pivot_longer. There is quite a bit of information implicitly stored in the column names, and we want to make those variables explicitly available for visual encoding.
who_longer <- who %>% 
  pivot_longer(
    cols = new_sp_m014:newrel_f65,  # notice we can refer to groups of columns without naming each one
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE # if a cell is empty, we do not keep it in the tidy version
  )

who_longer
# A tibble: 76,046 x 6
   country     iso2  iso3   year key          cases
   <chr>       <chr> <chr> <int> <chr>        <int>
 1 Afghanistan AF    AFG    1997 new_sp_m014      0
 2 Afghanistan AF    AFG    1997 new_sp_m1524    10
 3 Afghanistan AF    AFG    1997 new_sp_m2534     6
 4 Afghanistan AF    AFG    1997 new_sp_m3544     3
 5 Afghanistan AF    AFG    1997 new_sp_m4554     5
 6 Afghanistan AF    AFG    1997 new_sp_m5564     2
 7 Afghanistan AF    AFG    1997 new_sp_m65       0
 8 Afghanistan AF    AFG    1997 new_sp_f014      5
 9 Afghanistan AF    AFG    1997 new_sp_f1524    38
10 Afghanistan AF    AFG    1997 new_sp_f2534    36
# … with 76,036 more rows
  1. The new column key contains several variables at once. We can separate it into gender and age group.
who_separate <- who_longer %>% 
  mutate(key = str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "type", "sexage"), sep = "_") %>%
  separate(sexage, c("sex", "age"), sep = 1)

who_separate
# A tibble: 76,046 x 9
   country     iso2  iso3   year new   type  sex   age   cases
   <chr>       <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
 1 Afghanistan AF    AFG    1997 new   sp    m     014       0
 2 Afghanistan AF    AFG    1997 new   sp    m     1524     10
 3 Afghanistan AF    AFG    1997 new   sp    m     2534      6
 4 Afghanistan AF    AFG    1997 new   sp    m     3544      3
 5 Afghanistan AF    AFG    1997 new   sp    m     4554      5
 6 Afghanistan AF    AFG    1997 new   sp    m     5564      2
 7 Afghanistan AF    AFG    1997 new   sp    m     65        0
 8 Afghanistan AF    AFG    1997 new   sp    f     014       5
 9 Afghanistan AF    AFG    1997 new   sp    f     1524     38
10 Afghanistan AF    AFG    1997 new   sp    f     2534     36
# … with 76,036 more rows
  1. While we have performed each step one at a time, it’s possible to chain them into a single block of code. This is good practice, because it avoids having to define intermediate variables that are only ever used once. This is also typically more concise.
who %>%
  pivot_longer(
    cols = new_sp_m014:newrel_f65, 
    names_to = "key", 
    values_to = "cases", 
    values_drop_na = TRUE # if a cell is empty, we do not keep it in the tidy version
  ) %>%
  mutate(key = str_replace(key, "newrel", "new_rel")) %>%
  separate(key, c("new", "type", "sexage"), sep = "_") %>%
  separate(sexage, c("sex", "age"), sep = 1)
# A tibble: 76,046 x 9
   country     iso2  iso3   year new   type  sex   age   cases
   <chr>       <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
 1 Afghanistan AF    AFG    1997 new   sp    m     014       0
 2 Afghanistan AF    AFG    1997 new   sp    m     1524     10
 3 Afghanistan AF    AFG    1997 new   sp    m     2534      6
 4 Afghanistan AF    AFG    1997 new   sp    m     3544      3
 5 Afghanistan AF    AFG    1997 new   sp    m     4554      5
 6 Afghanistan AF    AFG    1997 new   sp    m     5564      2
 7 Afghanistan AF    AFG    1997 new   sp    m     65        0
 8 Afghanistan AF    AFG    1997 new   sp    f     014       5
 9 Afghanistan AF    AFG    1997 new   sp    f     1524     38
10 Afghanistan AF    AFG    1997 new   sp    f     2534     36
# … with 76,036 more rows
  1. A recommendation for visualization in javascript. We have only discussed tidying in R. While there is work to implement tidy-style transformations in javascript, the R tidyverse provides a more mature suite of tools. If you are making an interactive visualization in javascript, I recommend first tidying data in R so that each row corresponds to a visual mark and each column to a visual property. You can always save the result as either a json or csv, which can serve as the source data for your javascript visualization.