A deeper look at missing data, imputation, and characterization.
The previous notes described how visualization can help quantify how much missingness if present, and where it occurs. Here, we will explore how visualization is also helpful in efforts to impute and characterize missing values.
For example, visualization is helpful for understanding the results of imputation algorithms. Before describing this, it will be helpful to have a crash course on missing data imputation.
Imputation algorithms try to replace all the missing values with plausible values. The reason we might do this is that we don’t want to discard all the observations with the missing values, but our usual models might throw errors if we provide data with missing values.
Median imputation replaces each missing values in any given column by the median of the observed values in that field. It works one column at a time.
x <- data.frame(value = c(rnorm(450), rep(NA, 50))) %>%
sample_frac(1) %>% # randomly reorder
bind_shadow() %>% # create column checking if missing
mutate(imputed = naniar::impute_median(value))
x
# A tibble: 500 x 3
value value_NA imputed
<dbl> <fct> <dbl>
1 -1.01 !NA -1.01
2 -0.0649 !NA -0.0649
3 -0.600 !NA -0.600
4 0.468 !NA 0.468
5 -0.0619 !NA -0.0619
6 -1.90 !NA -1.90
7 -0.561 !NA -0.561
8 -0.677 !NA -0.677
9 0.226 !NA 0.226
10 -0.894 !NA -0.894
# … with 490 more rows
ggplot(x) +
geom_histogram(aes(x = imputed, fill = value_NA)) +
scale_fill_brewer(palette = "Set2")
Sigma <- matrix(c(1, 0.9, 0.9, 1), 2)
x <- mvrnorm(500, c(0, 0), Sigma = Sigma) %>%
as.data.frame()
for (j in seq_len(2)) {
x[sample(nrow(x), 50), j] <- NA
}
ggplot(x) +
geom_miss_point(aes(x = V1, y = V2), jitter = 0) +
scale_color_brewer(palette = "Set2") +
coord_fixed()
You would probably project the missing values onto the line that goes through the bulk of the fully observed data.
airquality
dataset – the multivariate relationship between ozone, temperature, and wind is used to fill in missing values for ozone.aq_imputed <- airquality %>%
bind_shadow() %>%
as.data.frame() %>%
impute_lm(Ozone ~ Temp + Wind)
ggplot(aq_imputed) +
geom_point(aes(x = Temp, y = Ozone)) +
scale_color_brewer(palette = "Set2")
ggplot(aq_imputed) +
geom_point(aes(x = Temp, y = Ozone, col = Ozone_NA)) +
scale_color_brewer(palette = "Set2")
Finally, let’s ask, why are the data missing in the first place? A natural idea is to try to find characteristics of observations that are predictive of their having missing values for some particular field. For example, maybe respondents within a given age group always leave a question blank. To this end, a well chosen plot can be very suggestive.
We can illustrate this idea with the airquality
data. It seems that the sensor might have broken down in June. When there are very many possible variables to compare against, a model can be especially helpful for guiding the search for informative plots. We’ll develop this idea further two lectures from now, when we look at how mutual information is used to guide the visualization search in Profiler.
rpart_model <- airquality %>%
add_prop_miss() %>%
rpart(prop_miss_all ~ ., data = .)
rpart_model$variable.importance
Month Temp Solar.R Day Wind Ozone
0.20606801 0.18878180 0.13064962 0.09943322 0.07316449 0.00638807
ggplot(airquality) +
geom_miss_point(aes(x = Ozone, y = Solar.R)) +
scale_color_brewer(palette = "Set2") +
facet_wrap(~ Month)