The relationship between exploratory analysis and model development.
library("rstan")
library("dplyr")
library("ggplot2")
library("tidyr")
library("purrr")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
Exploratory data analysis and model building complement each other well. In practical problems, visualization can guide us towards more plausible models.
We rarely know the exact form of a model in advance, but usually have a few reasonable candidates. Exploratory analysis can rule out some candidates and suggest new, previously unanticipated, relationships.
We will illustrate these ideas using an example. A researcher is interested in monitoring the level of PM2.5, a type of small air particlute that can be bad for public health. High quality data are available from weather stations scattered around the world, but their data only apply locally. On the other hand, low quality data, available from satellites, are available everywhere. A model is desired that uses the weather station measurements to calibrate the satellite data. If it works well, it could be used to monitor PM2.5 levels at global scale.
ggplot(GM@data, aes(log_sat, log_pm25)) +
geom_point(aes(col = super_region_name), size = 0.8, alpha = 0.7) +
scale_color_brewer(palette = "Set2") +
labs(
x = "log(satellite)",
y = "log(ground station)",
col = "WHO Region"
) +
coord_fixed()
ggplot(GM@data, aes(log_sat, log_pm25)) +
geom_point(aes(col = super_region_name), size = 0.4, alpha = 0.7) +
geom_smooth(aes(col = super_region_name), method = "lm", se = F, size = 2) +
scale_color_brewer(palette = "Set2") +
labs(
x = "log(satellite)",
y = "log(ground station)",
col = "WHO Region"
) +
coord_fixed()
average <- GM@data %>%
group_by(iso3) %>%
summarise(pm25 = mean(pm25))
clust <- dist(average) %>%
hclust() %>%
cutree(k = 6)
GM@data$cluster_region <- map_chr(GM@data$iso3, ~ clust[which(average$iso3 == .)])
ggplot(GM@data, aes(log_sat, log_pm25)) +
geom_point(aes(col = cluster_region), size = 0.4, alpha = 0.7) +
geom_smooth(aes(col = cluster_region), method = "lm", se = F, size = 2) +
scale_color_brewer(palette = "Set2") +
labs(
x = "log(satellite)",
y = "log(ground station)",
col = "Cluster Region"
) +
coord_fixed()
Viewed differently, this is like adding an interaction between the satellite measurements and WHO region.↩︎