A look at how visualization can help characterize missing data.
Visualization is about more than simply communicating results at the end of a data analysis. It can be used to improve the process of data analysis itself. This is the subject of this week’s notes.
When data science workflows are opaque — when sequences of commands are followed blindly — that’s when mistakes are most likely to be made1. Visualization can help improve the transparency of different steps across the workflow.
To make this idea concrete, let’s consider missing data. We can encounter missing data for a variety of reasons. Perhaps a sensor was broken or no response was submitted. Maybe several datasets were merged, but some fields are only present in the more recent versions.
If there are missing data, and if we fail to account for it, we might accidentally draw incorrect conclusions. Or, if we are going to attempt to impute the missing values, we should verify the assumptions of whatever imputation algorithm we intend to use.
A first plot to make is to look at the proportion of missing data within each field. Let’s consider the riskfactors
data, which is a subset from the Behavioral Risk Factor Surveillance System.
gg_miss_var(riskfactors)
gg_miss_upset(riskfactors)
vis_miss(riskfactors)
There doesn’t seem to be much clumping in that dataset, but consider the airquality
dataset below, whose rows are sorted by time.
vis_miss(airquality)
Prof. Broman has noted a compendium of horror stories↩︎
Which when you think about it sounds pretty dystopian / scifi.↩︎