Missing Data (Part 2)

A deeper look at missing data, imputation, and characterization.

Author

Affiliation

Kris Sankaran

UW Madison

Published

Feb. 22, 2021

DOI

Reading, Recording, Rmarkdown

library("MASS")
library("dplyr")
library("ggplot2")
library("naniar")
library("rpart")
library("simputation")

The previous notes described how visualization can help quantify how much missingness if present, and where it occurs. Here, we will explore how visualization is also helpful in efforts to impute and characterize missing values.
For example, visualization is helpful for understanding the results of imputation algorithms. Before describing this, it will be helpful to have a crash course on missing data imputation.
Imputation algorithms try to replace all the missing values with plausible values. The reason we might do this is that we don’t want to discard all the observations with the missing values, but our usual models might throw errors if we provide data with missing values.
Median imputation replaces each missing values in any given column by the median of the observed values in that field. It works one column at a time.

x <- data.frame(value = c(rnorm(450), rep(NA, 50))) %>%
  sample_frac(1) %>% # randomly reorder
  bind_shadow() %>% # create column checking if missing
  mutate(imputed = naniar::impute_median(value))

x

# A tibble: 500 x 3
     value value_NA imputed
     <dbl> <fct>      <dbl>
 1 -1.01   !NA      -1.01  
 2 -0.0649 !NA      -0.0649
 3 -0.600  !NA      -0.600 
 4  0.468  !NA       0.468 
 5 -0.0619 !NA      -0.0619
 6 -1.90   !NA      -1.90  
 7 -0.561  !NA      -0.561 
 8 -0.677  !NA      -0.677 
 9  0.226  !NA       0.226 
10 -0.894  !NA      -0.894 
# … with 490 more rows

ggplot(x) +
  geom_histogram(aes(x = imputed, fill = value_NA)) +
  scale_fill_brewer(palette = "Set2")

We can be cleverer though, if we imagine that multiple columns are related in some way. Consider the scatterplot below. When both observations are present, a point is drawn along the middle. When there is a missing value in one of the columns, we plot the one dimension that we do have along the appropriate edge. How would you impute the second column if you knew its value in the first column was low?

Sigma <- matrix(c(1, 0.9, 0.9, 1), 2)
x <- mvrnorm(500, c(0, 0), Sigma = Sigma) %>%
  as.data.frame()
for (j in seq_len(2)) {
  x[sample(nrow(x), 50), j] <- NA
}

ggplot(x) +
  geom_miss_point(aes(x = V1, y = V2), jitter = 0) +
  scale_color_brewer(palette = "Set2") +
  coord_fixed()

You would probably project the missing values onto the line that goes through the bulk of the fully observed data.

Multiple imputation methods formalize this intuition. They make use of multivariate relationships between columns to guess plausible values for missing data. This is an example of multiple imputation on the airquality dataset – the multivariate relationship between ozone, temperature, and wind is used to fill in missing values for ozone.

aq_imputed <- airquality %>%
  bind_shadow() %>%
  as.data.frame() %>%
  impute_lm(Ozone ~ Temp + Wind)

ggplot(aq_imputed) + 
  geom_point(aes(x = Temp, y = Ozone)) +
  scale_color_brewer(palette = "Set2")

But how can we tell if a multiple imputation method is effective? We can plot the data, making sure to distinguish between true and imputed observations.

ggplot(aq_imputed) + 
  geom_point(aes(x = Temp, y = Ozone, col = Ozone_NA)) +
  scale_color_brewer(palette = "Set2")

Finally, let’s ask, why are the data missing in the first place? A natural idea is to try to find characteristics of observations that are predictive of their having missing values for some particular field. For example, maybe respondents within a given age group always leave a question blank. To this end, a well chosen plot can be very suggestive.
We can illustrate this idea with the airquality data. It seems that the sensor might have broken down in June. When there are very many possible variables to compare against, a model can be especially helpful for guiding the search for informative plots. We’ll develop this idea further two lectures from now, when we look at how mutual information is used to guide the visualization search in Profiler.

rpart_model <- airquality %>%
  add_prop_miss() %>%
  rpart(prop_miss_all ~ ., data = .)
rpart_model$variable.importance

     Month       Temp    Solar.R        Day       Wind      Ozone 
0.20606801 0.18878180 0.13064962 0.09943322 0.07316449 0.00638807

ggplot(airquality) +
  geom_miss_point(aes(x = Ozone, y = Solar.R)) +
  scale_color_brewer(palette = "Set2") +
  facet_wrap(~ Month)

If we can predict missingness well, then the data are not missing at random. The bad news is that more specialized imputation strategies may be necessary. The good news is that we may be able to actually characterize the mechanism behind the missing data.

Missing Data (Part 2)

Author

Affiliation

Published

DOI

Footnotes