Visualizing and interpreting PCA.
library("ggplot2")
library("ggrepel")
library("readr")
library("stringr")
library("tidymodels")
library("tidytext")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
# produced by code in previous notes
components <- read_csv("https://uwmadison.box.com/shared/static/dituepd0751qqsims22v2liukuk0v4bf.csv") %>%
filter(component %in% str_c("PC", 1:5))
scores <- read_csv("https://uwmadison.box.com/shared/static/qisbw1an4lo8naifoxyu4dqv4bfsotcu.csv")
variances <- read_csv("https://uwmadison.box.com/shared/static/ye125xf8800zc5eh3rfeyzszagqkaswf.csv") %>%
filter(terms == "percent variance")
Figure 1: Proportion of variance explained by each component in the PCA.
ggplot(components, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text = element_text(size = 7))
Figure 2: The top 5 principal components associated with the cocktails dataset.
reorder()
or releveling the associated factor, will reorder all the facets in the same way. If we want to reorder each facet on its own, we can use the reorder_within
function coupled with scale_*_reordered
, both from the tidytext package.components_ <- components %>%
filter(component %in% str_c("PC", 1:3)) %>%
mutate(terms = reorder_within(terms, abs(value), component))
ggplot(components_, aes(value, terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ component, scales = "free_y") +
scale_y_reordered() +
labs(y = NULL) +
theme(axis.text = element_text(size = 7))
Figure 3: The top 3 principal components, with defining variables sorted by the magnitude of their coefficient.
ggplot(scores, aes(PC1, PC2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 1.5) +
geom_text_repel(check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) # rescale axes to reflect variance
Figure 4: The scores associated with the cocktails dataset.
For example, El Nino has high value for PC1, which means it has a high value of variables that are positive for PC1 (like simple syrup) and low value for those variables that are negative (like powdered sugar). Similarly, since \(\varphi_{2}\) puts high positive weight on vermouth-related variables, so H. P. W. Cocktail has many vermouth-related ingredients.
In practice, it will often be important to visualize several pairs of PC dimensions against one another, not just the top 2.
Let’s examine the original code in a little more detail. We are using tidymodels, which is a package for decoupling the definition and execution of a data pipeline. This compartmentalization makes it easier to design and reuse across settings.
pca_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_prep <- prep(pca_rec)
# split name and category out of the data frame
pca_result <- cocktails_df %>%
select(-name, -category) %>%
scale() %>%
princomp()
# join them back into the PCA result
metadata <- cocktails_df %>%
select(name, category)
scores_direct <- cbind(metadata, pca_result$scores)
ggplot(scores_direct, aes(Comp.1, Comp.2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 1.5) +
geom_text_repel(check_overlap = TRUE, size = 3) +
coord_fixed(sqrt(variances$value[2] / variances$value[1])) # rescale axes to reflect variance
Figure 5: A plot of the PCA scores made without using tidymodels.
The equivalent tidymodels implementation handles the difference between supplementary and modeling data less bluntly, setting the name
and category
variables to id
roles, so that all_predictors()
knows to skip them.
We conclude with some characteristics of PCA, which can guide the choice between alternative dimensionality reduction methods.