An extended example of tidying a real-world dataset.
Let’s work through the example of tidying a WHO dataset. This was discussed in the reading and is good practice in pivoting and deriving new variables.
The raw data, along with a summary from the skimr
package, is shown below (notice the small multiples!).
who
# A tibble: 7,240 x 60
country iso2 iso3 year new_sp_m014 new_sp_m1524 new_sp_m2534
<chr> <chr> <chr> <int> <int> <int> <int>
1 Afghan… AF AFG 1980 NA NA NA
2 Afghan… AF AFG 1981 NA NA NA
3 Afghan… AF AFG 1982 NA NA NA
4 Afghan… AF AFG 1983 NA NA NA
5 Afghan… AF AFG 1984 NA NA NA
6 Afghan… AF AFG 1985 NA NA NA
7 Afghan… AF AFG 1986 NA NA NA
8 Afghan… AF AFG 1987 NA NA NA
9 Afghan… AF AFG 1988 NA NA NA
10 Afghan… AF AFG 1989 NA NA NA
# … with 7,230 more rows, and 53 more variables: new_sp_m3544 <int>,
# new_sp_m4554 <int>, new_sp_m5564 <int>, new_sp_m65 <int>,
# new_sp_f014 <int>, new_sp_f1524 <int>, new_sp_f2534 <int>,
# new_sp_f3544 <int>, new_sp_f4554 <int>, new_sp_f5564 <int>,
# new_sp_f65 <int>, new_sn_m014 <int>, new_sn_m1524 <int>,
# new_sn_m2534 <int>, new_sn_m3544 <int>, new_sn_m4554 <int>,
# new_sn_m5564 <int>, new_sn_m65 <int>, new_sn_f014 <int>,
# new_sn_f1524 <int>, new_sn_f2534 <int>, new_sn_f3544 <int>,
# new_sn_f4554 <int>, new_sn_f5564 <int>, new_sn_f65 <int>,
# new_ep_m014 <int>, new_ep_m1524 <int>, new_ep_m2534 <int>,
# new_ep_m3544 <int>, new_ep_m4554 <int>, new_ep_m5564 <int>,
# new_ep_m65 <int>, new_ep_f014 <int>, new_ep_f1524 <int>,
# new_ep_f2534 <int>, new_ep_f3544 <int>, new_ep_f4554 <int>,
# new_ep_f5564 <int>, new_ep_f65 <int>, newrel_m014 <int>,
# newrel_m1524 <int>, newrel_m2534 <int>, newrel_m3544 <int>,
# newrel_m4554 <int>, newrel_m5564 <int>, newrel_m65 <int>,
# newrel_f014 <int>, newrel_f1524 <int>, newrel_f2534 <int>,
# newrel_f3544 <int>, newrel_f4554 <int>, newrel_f5564 <int>,
# newrel_f65 <int>
skim(who)
Name | who |
Number of rows | 7240 |
Number of columns | 60 |
_______________________ | |
Column type frequency: | |
character | 3 |
numeric | 57 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
country | 0 | 1 | 4 | 52 | 0 | 219 | 0 |
iso2 | 0 | 1 | 2 | 2 | 0 | 219 | 0 |
iso3 | 0 | 1 | 3 | 3 | 0 | 219 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 1996.56 | 9.83 | 1980 | 1988.00 | 1997.0 | 2005.00 | 2013 | ▇▇▇▇▇ |
new_sp_m014 | 4067 | 0.44 | 83.71 | 316.14 | 0 | 0.00 | 5.0 | 37.00 | 5001 | ▇▁▁▁▁ |
new_sp_m1524 | 4031 | 0.44 | 1015.66 | 4885.38 | 0 | 9.00 | 90.0 | 502.00 | 78278 | ▇▁▁▁▁ |
new_sp_m2534 | 4034 | 0.44 | 1403.80 | 5718.39 | 0 | 14.00 | 150.0 | 715.50 | 84003 | ▇▁▁▁▁ |
new_sp_m3544 | 4021 | 0.44 | 1315.88 | 6003.26 | 0 | 13.00 | 130.0 | 583.50 | 90830 | ▇▁▁▁▁ |
new_sp_m4554 | 4017 | 0.45 | 1103.86 | 5441.06 | 0 | 12.00 | 102.0 | 440.00 | 82921 | ▇▁▁▁▁ |
new_sp_m5564 | 4022 | 0.44 | 800.70 | 4418.31 | 0 | 8.00 | 63.0 | 279.00 | 63814 | ▇▁▁▁▁ |
new_sp_m65 | 4031 | 0.44 | 682.82 | 4089.14 | 0 | 8.00 | 53.0 | 232.00 | 70376 | ▇▁▁▁▁ |
new_sp_f014 | 4066 | 0.44 | 114.33 | 504.63 | 0 | 1.00 | 7.0 | 50.75 | 8576 | ▇▁▁▁▁ |
new_sp_f1524 | 4046 | 0.44 | 826.11 | 3552.02 | 0 | 7.00 | 66.0 | 421.00 | 53975 | ▇▁▁▁▁ |
new_sp_f2534 | 4040 | 0.44 | 917.30 | 3580.15 | 0 | 9.00 | 84.0 | 476.25 | 49887 | ▇▁▁▁▁ |
new_sp_f3544 | 4041 | 0.44 | 640.43 | 2542.51 | 0 | 6.00 | 57.0 | 308.00 | 34698 | ▇▁▁▁▁ |
new_sp_f4554 | 4036 | 0.44 | 445.78 | 1799.23 | 0 | 4.00 | 38.0 | 211.00 | 23977 | ▇▁▁▁▁ |
new_sp_f5564 | 4045 | 0.44 | 313.87 | 1381.25 | 0 | 3.00 | 25.0 | 146.50 | 18203 | ▇▁▁▁▁ |
new_sp_f65 | 4043 | 0.44 | 283.93 | 1267.94 | 0 | 4.00 | 30.0 | 129.00 | 21339 | ▇▁▁▁▁ |
new_sn_m014 | 6195 | 0.14 | 308.75 | 1727.25 | 0 | 1.00 | 9.0 | 61.00 | 22355 | ▇▁▁▁▁ |
new_sn_m1524 | 6210 | 0.14 | 513.02 | 3643.27 | 0 | 2.00 | 15.5 | 102.00 | 60246 | ▇▁▁▁▁ |
new_sn_m2534 | 6218 | 0.14 | 653.69 | 3430.03 | 0 | 2.00 | 23.0 | 135.50 | 50282 | ▇▁▁▁▁ |
new_sn_m3544 | 6215 | 0.14 | 837.87 | 8524.53 | 0 | 2.00 | 19.0 | 132.00 | 250051 | ▇▁▁▁▁ |
new_sn_m4554 | 6213 | 0.14 | 520.79 | 3301.70 | 0 | 2.00 | 19.0 | 127.50 | 57181 | ▇▁▁▁▁ |
new_sn_m5564 | 6219 | 0.14 | 448.62 | 3488.68 | 0 | 2.00 | 16.0 | 101.00 | 64972 | ▇▁▁▁▁ |
new_sn_m65 | 6220 | 0.14 | 460.36 | 3991.90 | 0 | 2.00 | 20.5 | 111.75 | 74282 | ▇▁▁▁▁ |
new_sn_f014 | 6200 | 0.14 | 291.95 | 1647.30 | 0 | 1.00 | 8.0 | 58.00 | 21406 | ▇▁▁▁▁ |
new_sn_f1524 | 6218 | 0.14 | 407.90 | 2379.13 | 0 | 1.00 | 12.0 | 89.00 | 35518 | ▇▁▁▁▁ |
new_sn_f2534 | 6224 | 0.14 | 466.26 | 2272.86 | 0 | 2.00 | 18.0 | 103.25 | 28753 | ▇▁▁▁▁ |
new_sn_f3544 | 6220 | 0.14 | 506.59 | 5013.53 | 0 | 1.00 | 11.0 | 82.25 | 148811 | ▇▁▁▁▁ |
new_sn_f4554 | 6222 | 0.14 | 271.16 | 1511.72 | 0 | 1.00 | 10.0 | 76.75 | 23869 | ▇▁▁▁▁ |
new_sn_f5564 | 6223 | 0.14 | 213.39 | 1468.62 | 0 | 1.00 | 8.0 | 56.00 | 26085 | ▇▁▁▁▁ |
new_sn_f65 | 6221 | 0.14 | 230.75 | 1597.70 | 0 | 1.00 | 13.0 | 74.00 | 29630 | ▇▁▁▁▁ |
new_ep_m014 | 6202 | 0.14 | 128.61 | 460.14 | 0 | 0.00 | 6.0 | 55.00 | 7869 | ▇▁▁▁▁ |
new_ep_m1524 | 6214 | 0.14 | 158.30 | 537.74 | 0 | 1.00 | 11.0 | 88.00 | 8558 | ▇▁▁▁▁ |
new_ep_m2534 | 6220 | 0.14 | 201.23 | 764.05 | 0 | 1.00 | 13.0 | 124.00 | 11843 | ▇▁▁▁▁ |
new_ep_m3544 | 6216 | 0.14 | 272.72 | 3381.41 | 0 | 1.00 | 10.5 | 91.25 | 105825 | ▇▁▁▁▁ |
new_ep_m4554 | 6220 | 0.14 | 108.11 | 380.61 | 0 | 1.00 | 8.5 | 63.25 | 5875 | ▇▁▁▁▁ |
new_ep_m5564 | 6225 | 0.14 | 72.17 | 234.55 | 0 | 1.00 | 7.0 | 46.00 | 3957 | ▇▁▁▁▁ |
new_ep_m65 | 6222 | 0.14 | 78.94 | 227.34 | 0 | 1.00 | 10.0 | 55.00 | 3061 | ▇▁▁▁▁ |
new_ep_f014 | 6208 | 0.14 | 112.89 | 446.55 | 0 | 0.00 | 5.0 | 50.00 | 6960 | ▇▁▁▁▁ |
new_ep_f1524 | 6219 | 0.14 | 149.17 | 543.89 | 0 | 1.00 | 9.0 | 78.00 | 7866 | ▇▁▁▁▁ |
new_ep_f2534 | 6219 | 0.14 | 189.52 | 761.79 | 0 | 1.00 | 12.0 | 95.00 | 10759 | ▇▁▁▁▁ |
new_ep_f3544 | 6219 | 0.14 | 241.70 | 3218.50 | 0 | 1.00 | 9.0 | 77.00 | 101015 | ▇▁▁▁▁ |
new_ep_f4554 | 6223 | 0.14 | 93.77 | 339.33 | 0 | 1.00 | 8.0 | 56.00 | 6759 | ▇▁▁▁▁ |
new_ep_f5564 | 6223 | 0.14 | 63.04 | 212.95 | 0 | 1.00 | 6.0 | 42.00 | 4684 | ▇▁▁▁▁ |
new_ep_f65 | 6226 | 0.14 | 72.31 | 202.72 | 0 | 0.00 | 10.0 | 51.00 | 2548 | ▇▁▁▁▁ |
newrel_m014 | 7050 | 0.03 | 538.18 | 2082.18 | 0 | 5.00 | 32.5 | 210.00 | 18617 | ▇▁▁▁▁ |
newrel_m1524 | 7058 | 0.03 | 1489.51 | 6848.18 | 0 | 17.50 | 171.0 | 684.25 | 84785 | ▇▁▁▁▁ |
newrel_m2534 | 7057 | 0.03 | 2139.72 | 7539.87 | 0 | 25.00 | 217.0 | 1091.00 | 76917 | ▇▁▁▁▁ |
newrel_m3544 | 7056 | 0.03 | 2036.40 | 7847.94 | 0 | 24.75 | 208.0 | 851.25 | 84565 | ▇▁▁▁▁ |
newrel_m4554 | 7056 | 0.03 | 1835.07 | 8324.28 | 0 | 19.00 | 175.0 | 688.50 | 100297 | ▇▁▁▁▁ |
newrel_m5564 | 7055 | 0.03 | 1525.30 | 8760.27 | 0 | 13.00 | 136.0 | 536.00 | 112558 | ▇▁▁▁▁ |
newrel_m65 | 7058 | 0.03 | 1426.00 | 9431.99 | 0 | 17.00 | 117.0 | 453.50 | 124476 | ▇▁▁▁▁ |
newrel_f014 | 7050 | 0.03 | 532.84 | 2117.78 | 0 | 5.00 | 32.5 | 226.00 | 18054 | ▇▁▁▁▁ |
newrel_f1524 | 7056 | 0.03 | 1161.85 | 4606.76 | 0 | 10.75 | 123.0 | 587.75 | 49491 | ▇▁▁▁▁ |
newrel_f2534 | 7058 | 0.03 | 1472.80 | 5259.59 | 0 | 18.00 | 161.0 | 762.50 | 44985 | ▇▁▁▁▁ |
newrel_f3544 | 7057 | 0.03 | 1125.01 | 4210.58 | 0 | 12.50 | 125.0 | 544.50 | 38804 | ▇▁▁▁▁ |
newrel_f4554 | 7057 | 0.03 | 877.27 | 3556.18 | 0 | 10.00 | 92.0 | 400.50 | 37138 | ▇▁▁▁▁ |
newrel_f5564 | 7057 | 0.03 | 686.41 | 3379.33 | 0 | 8.00 | 69.0 | 269.00 | 40892 | ▇▁▁▁▁ |
newrel_f65 | 7055 | 0.03 | 683.76 | 3618.47 | 0 | 9.00 | 69.0 | 339.00 | 47438 | ▇▁▁▁▁ |
3544
should be interpreted as 35 - 44 years old.pivot_longer
. There is quite a bit of information implicitly stored in the column names, and we want to make those variables explicitly available for visual encoding.who_longer <- who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65, # notice we can refer to groups of columns without naming each one
names_to = "key",
values_to = "cases",
values_drop_na = TRUE # if a cell is empty, we do not keep it in the tidy version
)
who_longer
# A tibble: 76,046 x 6
country iso2 iso3 year key cases
<chr> <chr> <chr> <int> <chr> <int>
1 Afghanistan AF AFG 1997 new_sp_m014 0
2 Afghanistan AF AFG 1997 new_sp_m1524 10
3 Afghanistan AF AFG 1997 new_sp_m2534 6
4 Afghanistan AF AFG 1997 new_sp_m3544 3
5 Afghanistan AF AFG 1997 new_sp_m4554 5
6 Afghanistan AF AFG 1997 new_sp_m5564 2
7 Afghanistan AF AFG 1997 new_sp_m65 0
8 Afghanistan AF AFG 1997 new_sp_f014 5
9 Afghanistan AF AFG 1997 new_sp_f1524 38
10 Afghanistan AF AFG 1997 new_sp_f2534 36
# … with 76,036 more rows
key
contains several variables at once. We can separate
it into gender and age group.who_separate <- who_longer %>%
mutate(key = str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, c("sex", "age"), sep = 1)
who_separate
# A tibble: 76,046 x 9
country iso2 iso3 year new type sex age cases
<chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
1 Afghanistan AF AFG 1997 new sp m 014 0
2 Afghanistan AF AFG 1997 new sp m 1524 10
3 Afghanistan AF AFG 1997 new sp m 2534 6
4 Afghanistan AF AFG 1997 new sp m 3544 3
5 Afghanistan AF AFG 1997 new sp m 4554 5
6 Afghanistan AF AFG 1997 new sp m 5564 2
7 Afghanistan AF AFG 1997 new sp m 65 0
8 Afghanistan AF AFG 1997 new sp f 014 5
9 Afghanistan AF AFG 1997 new sp f 1524 38
10 Afghanistan AF AFG 1997 new sp f 2534 36
# … with 76,036 more rows
who %>%
pivot_longer(
cols = new_sp_m014:newrel_f65,
names_to = "key",
values_to = "cases",
values_drop_na = TRUE # if a cell is empty, we do not keep it in the tidy version
) %>%
mutate(key = str_replace(key, "newrel", "new_rel")) %>%
separate(key, c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, c("sex", "age"), sep = 1)
# A tibble: 76,046 x 9
country iso2 iso3 year new type sex age cases
<chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <int>
1 Afghanistan AF AFG 1997 new sp m 014 0
2 Afghanistan AF AFG 1997 new sp m 1524 10
3 Afghanistan AF AFG 1997 new sp m 2534 6
4 Afghanistan AF AFG 1997 new sp m 3544 3
5 Afghanistan AF AFG 1997 new sp m 4554 5
6 Afghanistan AF AFG 1997 new sp m 5564 2
7 Afghanistan AF AFG 1997 new sp m 65 0
8 Afghanistan AF AFG 1997 new sp f 014 5
9 Afghanistan AF AFG 1997 new sp f 1524 38
10 Afghanistan AF AFG 1997 new sp f 2534 36
# … with 76,036 more rows