Statistical Data Visualization: Tidy Data

Reading, Recording, Rmarkdown

library("tidyr")
library("ggplot2")
theme_set(theme_bw())

A dataset is called tidy if rows correspond to distinct observations and columns correspond to distinct variables.

For visualization, it is important that data be in tidy format. This is because (a) each visual mark will be associated with a row of the dataset and (b) properties of the visual marks will determined by values within the columns. A plot that is easy to create when the data are in tidy format might be very hard to create otherwise.
The tidy data might seem like an idea so natural that it’s not worth teaching (let alone formalizing). However, exceptions are encountered frequently, and it’s important that you be able to spot them. Further, there are now many utilities for “tidying” data, and they are worth becoming familiar with.
Here is an example of a tidy dataset.

table1

# A tibble: 6 x 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

It is easy to visualize the tidy dataset.

ggplot(table1, aes(x = year, y = cases, col = country)) +
  geom_point() +
  geom_line()

Below are three non-tidy versions of the same dataset. They are representative of more general classes of problems that may arise,
1. A variable might be implicitly stored within column names, rather than explicitly stored in its own column. Here, the years are stored as column names. It’s not really possible to create the plot above using the data in this format.

table4a # cases

# A tibble: 3 x 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

table4b # population

# A tibble: 3 x 3
  country         `1999`     `2000`
* <chr>            <int>      <int>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

The same observation may appear in multiple rows, where each instance of the row is associated with a different variable. Here, the observations are the country by year combinations.

table2

# A tibble: 12 x 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

A single column actually stores multiple variables. Here, rate is being used to store both the population and case count variables.

table3

# A tibble: 6 x 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

The trouble is that this variable has to be stored as a character; otherwise, we lose access to the original population and case variable. But, this makes the plot useless.

ggplot(table3, aes(x = year, y = rate)) +
  geom_point() +
  geom_line(aes(group = country))

The next few lectures provide tools for addressing these three problems.

A few caveats are in order. It’s easy to become a tidy-data purist, and lose sight of the bigger data-analytic picture. To prevent that, first, remember that what is or is not tidy may be context dependent. Maybe you want to treat each week as an observation, rather than each day. Second, know that there are sometimes computational reasons to prefer non-tidy data. For example, “long” data often require more memory, since column names that were originally stored once now have to be copied onto each row. Certain statistical models are also sometimes best framed as matrix operations on non-tidy datasets.