Visualization for Model Building

The relationship between exploratory analysis and model development.

Kris Sankaran (UW Madison)
2021-04-14

Reading, Recording, Rmarkdown

library("rstan")
library("dplyr")
library("ggplot2")
library("tidyr")
library("purrr")
theme479 <- theme_minimal() + 
  theme(
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "#f7f7f7"),
    panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
    legend.position = "bottom"
  )
theme_set(theme479)
  1. Exploratory data analysis and model building complement each other well. In practical problems, visualization can guide us towards more plausible models.

  2. We rarely know the exact form of a model in advance, but usually have a few reasonable candidates. Exploratory analysis can rule out some candidates and suggest new, previously unanticipated, relationships.

  3. We will illustrate these ideas using an example. A researcher is interested in monitoring the level of PM2.5, a type of small air particlute that can be bad for public health. High quality data are available from weather stations scattered around the world, but their data only apply locally. On the other hand, low quality data, available from satellites, are available everywhere. A model is desired that uses the weather station measurements to calibrate the satellite data. If it works well, it could be used to monitor PM2.5 levels at global scale.

f <- tempfile()
download.file("https://github.com/jgabry/bayes-vis-paper/blob/master/bayes-vis.RData?raw=true", f)
GM <- get(load(f))
GM@data <- GM@data %>% 
  mutate(
    log_pm25 = log(pm25), 
    log_sat = log(sat_2014)
  )
  1. The simplest model simply fits \(\text{station} = a + b \times \text{satellite}\) at locations where they are both available. This model was used in practice by the Global Burden of Disease project until 2016.
ggplot(GM@data, aes(log_sat, log_pm25)) +
  geom_point(aes(col = super_region_name), size = 0.8, alpha = 0.7) +
  scale_color_brewer(palette = "Set2") +
  labs(
    x = "log(satellite)",
    y = "log(ground station)",
    col = "WHO Region"
  ) +
  coord_fixed()
The relationship between satellite and ground station estimates of PM2.5.

Figure 1: The relationship between satellite and ground station estimates of PM2.5.

  1. However, when we plot these two variables against one another, we notice that there is still quite a bit of heterogeneity. The residuals are large — what features might be correlated with these residuals, which if included, would improve the model fit?
    • The error \(\epsilon_{i}\) in a model \(y_i = f\left(x_{i}\right) + \epsilon_{i}\) represents out our ignorance of the myriad of unmeasured factors that determine the relationship between \(x\) and \(y\).
    • For example, desert sand is known to increase PM2.5, but it is not visible from space. The residuals are probably correlated with whether the model is in a desert area (we underpredict PM2.5 in deserts), and so would be improved if we included a term with this feature.
  2. One hypothesis is that country region is an important factor. Below, we fit regression lines separately for different country super-regions, as specified by the WHO. The fact that the slopes are not the same in each region means that we should modify our model to have a different slope in each region1.
ggplot(GM@data, aes(log_sat, log_pm25)) +
  geom_point(aes(col = super_region_name), size = 0.4, alpha = 0.7) +
  geom_smooth(aes(col = super_region_name), method = "lm", se = F, size = 2) +
  scale_color_brewer(palette = "Set2") +
  labs(
    x = "log(satellite)",
    y = "log(ground station)",
    col = "WHO Region"
  ) +
  coord_fixed()
The relationship between these variables is not the same across regions.

Figure 2: The relationship between these variables is not the same across regions.

  1. The WHO categorizations are somewhat arbitrary. Maybe there are better country groupings, tailored specifically to the PM2.5 problem? One idea is to cluster the ground stations based on PM2.5 level and use these clusters as a different region grouping.
average <- GM@data %>% 
  group_by(iso3) %>% 
  summarise(pm25 = mean(pm25))

clust <- dist(average) %>%
  hclust() %>%
  cutree(k = 6)

GM@data$cluster_region <- map_chr(GM@data$iso3, ~ clust[which(average$iso3 == .)])
ggplot(GM@data, aes(log_sat, log_pm25)) +
  geom_point(aes(col = cluster_region), size = 0.4, alpha = 0.7) +
  geom_smooth(aes(col = cluster_region), method = "lm", se = F, size = 2) +
  scale_color_brewer(palette = "Set2") +
  labs(
    x = "log(satellite)",
    y = "log(ground station)",
    col = "Cluster Region"
  ) +
  coord_fixed()
We can define clusters of regions on our own, using a hierarchical clustering.

Figure 3: We can define clusters of regions on our own, using a hierarchical clustering.

  1. We now have a « network » of models. We’re going to want more refined tools for distinguishing between them. This is the subject of the next two lectures.

  1. Viewed differently, this is like adding an interaction between the satellite measurements and WHO region.↩︎