Visualizing and interpreting PCA.
library("ggplot2")
library("ggrepel")
library("readr")
library("stringr")
library("tidymodels")
library("tidytext")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
# produced by code in previous notes
components <- read_csv("https://uwmadison.box.com/shared/static/dituepd0751qqsims22v2liukuk0v4bf.csv") %>%
filter(component %in% str_c("PC", 1:5))
scores <- read_csv("https://uwmadison.box.com/shared/static/qisbw1an4lo8naifoxyu4dqv4bfsotcu.csv")
variances <- read_csv("https://uwmadison.box.com/shared/static/ye125xf8800zc5eh3rfeyzszagqkaswf.csv") %>%
filter(terms == "percent variance")
ggplot(components, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text = element_text(size = 7))
reorder()
or releveling the associated factor, will reorder all the facets in the same way. If we want to reorder each facet on its own, we can use the reorder_within
function coupled with scale_*_reordered
, both from the tidytext package.components_ <- components %>%
filter(component %in% str_c("PC", 1:3)) %>%
mutate(terms = reorder_within(terms, abs(value), component))
ggplot(components_, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ component, scales = "free_y") +
scale_y_reordered() +
labs(y = NULL) +
theme(axis.text = element_text(size = 7))
ggplot(scores, aes(PC1, PC2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 1.5) +
geom_text_repel(check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) # rescale axes to reflect variance
For example, El Nino has high value for PC1, which means it has a high value of variables that are positive for PC1 (like simple syrup) and low value for those variables that are negative (like powdered sugar). Similarly, since \(\varphi_{2}\) puts high positive weight on vermouth-related variables, so H. P. W. Cocktail has many vermouth-related ingredients.
In practice, it will often be important to visualize several pairs of PC dimensions against one another, not just the top 2.
Let’s examine the original code in a little more detail. We are using tidymodels, which is a package for decoupling the definition and execution of a data pipeline. This compartmentalization makes it easier to design and reuse across settings.
pca_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_prep <- prep(pca_rec)
# split name and category out of the data frame
pca_result <- cocktails_df %>%
select(-name, -category) %>%
scale() %>%
princomp()
# join them back into the PCA result
metadata <- cocktails_df %>%
select(name, category)
scores_direct <- cbind(metadata, pca_result$scores)
ggplot(scores_direct, aes(Comp.1, Comp.2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 1.5) +
geom_text_repel(check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) # rescale axes to reflect variance
The equivalent tidymodels implementation handles the difference between supplementary and modeling data less bluntly, setting the name
and category
variables to id
roles, so that all_predictors()
knows to skip them.
We conclude with some characteristics of PCA, which can guide the choice between alternative dimensionality reduction methods.