The relationship between exploratory analysis and model development.
library(rstan)
library(tidyverse)
theme479 <- theme_minimal() + 
  theme(
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "#f7f7f7"),
    panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
    legend.position = "bottom"
  )
theme_set(theme479)Exploratory data analysis and model building complement each other well. In practical problems, visualization can guide us towards more plausible models.
We rarely know the exact form of a model in advance, but usually have a few reasonable candidates. Exploratory analysis can rule out some candidates and suggest new, previously unanticipated, relationships.
We will illustrate these ideas using an example. A researcher is interested in monitoring the level of PM2.5, a type of small air particlute that can be bad for public health. High quality data are available from weather stations scattered around the world, but their data only apply locally. On the other hand, low quality data, available from satellites, are available everywhere. A model is desired that uses the weather station measurements to calibrate the satellite data. If it works well, it could be used to monitor PM2.5 levels at global scale.
ggplot(GM@data, aes(log_sat, log_pm25)) +
  geom_point(aes(col = super_region_name), size = 0.8, alpha = 0.7) +
  scale_color_brewer(palette = "Set2") +
  labs(
    x = "log(satellite)",
    y = "log(ground station)",
    col = "WHO Region"
  ) +
  coord_fixed() 
Figure 1: The relationship between satellite and ground station estimates of PM2.5.
ggplot(GM@data, aes(log_sat, log_pm25)) +
  geom_point(aes(col = super_region_name), size = 0.4, alpha = 0.7) +
  geom_smooth(aes(col = super_region_name), method = "lm", se = F, size = 2) +
  scale_color_brewer(palette = "Set2") +
  labs(
    x = "log(satellite)",
    y = "log(ground station)",
    col = "WHO Region"
  ) +
  coord_fixed() 
Figure 2: The relationship between these variables is not the same across regions.
average <- GM@data %>% 
  group_by(iso3) %>% 
  summarise(pm25 = mean(pm25))
clust <- dist(average) %>%
  hclust() %>%
  cutree(k = 6)
GM@data$cluster_region <- map_chr(GM@data$iso3, ~ clust[which(average$iso3 == .)])
ggplot(GM@data, aes(log_sat, log_pm25)) +
  geom_point(aes(col = cluster_region), size = 0.4, alpha = 0.7) +
  geom_smooth(aes(col = cluster_region), method = "lm", se = F, size = 2) +
  scale_color_brewer(palette = "Set2") +
  labs(
    x = "log(satellite)",
    y = "log(ground station)",
    col = "Cluster Region"
  ) +
  coord_fixed() 
Figure 3: We can define clusters of regions on our own, using a hierarchical clustering.
For attribution, please cite this work as
Sankaran (2024, Jan. 7). STAT 436 (Spring 2024): Visualization for Model Building. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week12-3/
BibTeX citation
@misc{sankaran2024visualization,
  author = {Sankaran, Kris},
  title = {STAT 436 (Spring 2024): Visualization for Model Building},
  url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week12-3/},
  year = {2024}
}