The relationship between exploratory analysis and model development.
library(rstan)
library(tidyverse)
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
Exploratory data analysis and model building complement each other well. In practical problems, visualization can guide us towards more plausible models.
We rarely know the exact form of a model in advance, but usually have a few reasonable candidates. Exploratory analysis can rule out some candidates and suggest new, previously unanticipated, relationships.
We will illustrate these ideas using an example. A researcher is interested in monitoring the level of PM2.5, a type of small air particlute that can be bad for public health. High quality data are available from weather stations scattered around the world, but their data only apply locally. On the other hand, low quality data, available from satellites, are available everywhere. A model is desired that uses the weather station measurements to calibrate the satellite data. If it works well, it could be used to monitor PM2.5 levels at global scale.
ggplot(GM@data, aes(log_sat, log_pm25)) +
geom_point(aes(col = super_region_name), size = 0.8, alpha = 0.7) +
scale_color_brewer(palette = "Set2") +
labs(
x = "log(satellite)",
y = "log(ground station)",
col = "WHO Region"
) +
coord_fixed()
ggplot(GM@data, aes(log_sat, log_pm25)) +
geom_point(aes(col = super_region_name), size = 0.4, alpha = 0.7) +
geom_smooth(aes(col = super_region_name), method = "lm", se = F, size = 2) +
scale_color_brewer(palette = "Set2") +
labs(
x = "log(satellite)",
y = "log(ground station)",
col = "WHO Region"
) +
coord_fixed()
average <- GM@data %>%
group_by(iso3) %>%
summarise(pm25 = mean(pm25))
clust <- dist(average) %>%
hclust() %>%
cutree(k = 6)
GM@data$cluster_region <- map_chr(GM@data$iso3, ~ clust[which(average$iso3 == .)])
ggplot(GM@data, aes(log_sat, log_pm25)) +
geom_point(aes(col = cluster_region), size = 0.4, alpha = 0.7) +
geom_smooth(aes(col = cluster_region), method = "lm", se = F, size = 2) +
scale_color_brewer(palette = "Set2") +
labs(
x = "log(satellite)",
y = "log(ground station)",
col = "Cluster Region"
) +
coord_fixed()
For attribution, please cite this work as
Sankaran (2024, Jan. 7). STAT 436 (Spring 2024): Visualization for Model Building. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week12-3/
BibTeX citation
@misc{sankaran2024visualization, author = {Sankaran, Kris}, title = {STAT 436 (Spring 2024): Visualization for Model Building}, url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week12-3/}, year = {2024} }