Visualizing and interpreting PCA.
library(tidyverse)
library(ggrepel)
library(tidymodels)
library(tidytext)
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
# produced by code in previous notes
components <- read_csv("https://uwmadison.box.com/shared/static/dituepd0751qqsims22v2liukuk0v4bf.csv") %>%
filter(component %in% str_c("PC", 1:5))
scores <- read_csv("https://uwmadison.box.com/shared/static/qisbw1an4lo8naifoxyu4dqv4bfsotcu.csv")
variances <- read_csv("https://uwmadison.box.com/shared/static/ye125xf8800zc5eh3rfeyzszagqkaswf.csv") %>%
filter(terms == "percent variance")
ggplot(components, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text = element_text(size = 7))
reorder()
or releveling the associated factor, will reorder all the
facets in the same way. If we want to reorder each facet on its own, we can use
the reorder_within
function coupled with scale_*_reordered
, both from the
tidytext package.components_ <- components %>%
filter(component %in% str_c("PC", 1:3)) %>%
mutate(terms = reorder_within(terms, abs(value), component))
ggplot(components_, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ component, scales = "free_y") +
scale_y_reordered() +
labs(y = NULL) +
theme(axis.text = element_text(size = 7))
ggplot(scores, aes(PC1, PC2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 1.5) +
geom_text_repel(check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) # rescale axes to reflect variance
For example, El Nino has high value for PC1, which means it has a high value of variables that are positive for PC1 (like simple syrup) and low value for those variables that are negative (like powdered sugar). Similarly, since \(\varphi_{2}\) puts high positive weight on vermouth-related variables, so H. P. W. Cocktail has many vermouth-related ingredients.
In practice, it will often be important to visualize several pairs of PC dimensions against one another, not just the top 2.
Let’s examine the original code in a little more detail. We are using tidymodels, which is a package for decoupling the definition and execution of a data pipeline. This compartmentalization makes it easier to design and reuse across settings.
# split name and category out of the data frame
pca_result <- cocktails_df %>%
select(-name, -category) %>%
scale() %>%
princomp()
# join them back into the PCA result
metadata <- cocktails_df %>%
select(name, category)
scores_direct <- cbind(metadata, pca_result$scores)
ggplot(scores_direct, aes(Comp.1, Comp.2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 1.5) +
geom_text_repel(check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) # rescale axes to reflect variance
The equivalent tidymodels implementation handles the difference between
supplementary and modeling data less bluntly, setting the name
and category
variables to id
roles, so that all_predictors()
knows to skip them.
We conclude with some characteristics of PCA, which can guide the choice between alternative dimensionality reduction methods.
For attribution, please cite this work as
Sankaran (2024, March 31). STAT 436 (Spring 2024): Principal Components Analysis II. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week10-3/
BibTeX citation
@misc{sankaran2024principal, author = {Sankaran, Kris}, title = {STAT 436 (Spring 2024): Principal Components Analysis II}, url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week10-3/}, year = {2024} }