Linear dimensionality reduction using PCA.
In our last notes, we saw how we could organize a collection of images based on average pixel brightness. We can think of average pixel brightness as a derived feature that can be used to build a low-dimensional map.
We can partially automate the process of deriving new features. Though, in general, finding the best way to combine raw features into derived ones is a complicated problem, we can simplify things by restricting attention to,
Restricting to linear combinations allows for an analytical solution. We will relax this requirement when discussing UMAP.
Orthogonality means that the derived features will be uncorrelated with one another. This is a nice property, because it would be wasteful if features were redundant.
High variance is desirable because it means we preserve more of the essential structure of the underlying data. For example, if you look at this 2D representation of a 3D object, it’s hard to tell what it is,
But when viewing an alternative reduction which has higher variance…
Principal Components Analysis (PCA) is the optimal dimensionality reduction under these three restrictions, in the sense that it finds derived features with the highest variance. Formally, PCA finds a matrix \(\Phi \in \mathbb{R}^{D \times K}\) and a set of vector \(z_{i} \in \mathbb{R}^{K}\) such that \(x_{i} \approx \Phi z_{i}\) for all \(i\). The columns of \(\Phi\) are called principal components, and they specify the structure of the derived linear features. The vector \(z_{i}\) is called the score of \(x_{i}\) with respect to these components. The top component explains the most variance, the second captures the next most, and so on.
For example, if one of the columns of \(\Phi\) was equal to \(\left(\frac{1}{D}, \dots, \frac{1}{D}\right)\), then that feature computes the average of all coordinates (e.g., to get average brightness), and the corresponding \(z_{i}\) would be a measure of the average brightness of sample \(i\).
Geometrically, the columns of \(\Phi\) span a plane that approximates the data. The \(z_{i}\) provide coordinates of points projected onto this plane.
cocktails_df <- read_csv("https://uwmadison.box.com/shared/static/qyqof2512qsek8fpnkqqiw3p1jb77acf.csv")
cocktails_df[, 1:6]
# A tibble: 937 × 6
name category light_rum lemon_juice lime_juice sweet_vermouth
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gauguin Cocktai… 2 1 1 0
2 Fort Laud… Cocktai… 1.5 0 0.25 0.5
3 Cuban Coc… Cocktai… 2 0 0.5 0
4 Cool Carl… Cocktai… 0 0 0 0
5 John Coll… Whiskies 0 1 0 0
6 Cherry Rum Cocktai… 1.25 0 0 0
7 Casa Blan… Cocktai… 2 0 1.5 0
8 Caribbean… Cocktai… 0.5 0 0 0
9 Amber Amo… Cordial… 0 0.25 0 0
10 The Joe L… Whiskies 0 0.5 0 0
# ℹ 927 more rows
pca_rec
object below defines a tidymodels recipe for performing PCA.
Computation of the lower-dimensional representation is deferred until prep()
is called. This delineation between workflow definition and execution helps
clarify the overall workflow, and it is typical of the tidymodels package.pca_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_prep <- prep(pca_rec)
The step_normalize
call is used to center and scale all the columns. This
is needed because otherwise columns with larger variance will have more weight
in the final dimensionality reduction, but this is not conceptually meaningful.
For example, if one of the columns in a dataset were measuring length in
kilometers, then we could artificially increase its influence in a PCA by
expressing the same value in meters. To achieve invariance to this change in
units, it would be important to normalize first.
We can tidy
each element of the workflow object. Since PCA was the second
step in the workflow, the PCA components can be obtained by calling tidy with the
argument “2.” The scores of each sample with respect to these components can be
extracted using juice.
The amount of variance explained by each dimension is
also given by tidy
, but with the argument type = "variance"
. We’ll see how
to visualize and interpret these results in the next lecture.
tidy(pca_prep, 2)
# A tibble: 1,600 × 4
terms value component id
<chr> <dbl> <chr> <chr>
1 light_rum 0.163 PC1 pca_6enHS
2 lemon_juice -0.0140 PC1 pca_6enHS
3 lime_juice 0.224 PC1 pca_6enHS
4 sweet_vermouth -0.0661 PC1 pca_6enHS
5 orange_juice 0.0308 PC1 pca_6enHS
6 powdered_sugar -0.476 PC1 pca_6enHS
7 dark_rum 0.124 PC1 pca_6enHS
8 cranberry_juice 0.0954 PC1 pca_6enHS
9 pineapple_juice 0.119 PC1 pca_6enHS
10 bourbon_whiskey 0.0963 PC1 pca_6enHS
# ℹ 1,590 more rows
juice(pca_prep)
# A tibble: 937 × 7
name category PC1 PC2 PC3 PC4 PC5
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Gauguin Cocktai… 1.38 -1.15 1.34 -1.12 1.52
2 Fort Lauderdale Cocktai… 0.684 0.548 0.0308 -0.370 1.41
3 Cuban Cocktail No. 1 Cocktai… 0.285 -0.967 0.454 -0.931 2.02
4 Cool Carlos Cocktai… 2.19 -0.935 -1.21 2.47 1.80
5 John Collins Whiskies 1.28 -1.07 0.403 -1.09 -2.21
6 Cherry Rum Cocktai… -0.757 -0.460 0.909 0.0154 -0.748
7 Casa Blanca Cocktai… 1.53 -0.392 3.29 -3.39 3.87
8 Caribbean Champagne Cocktai… 0.324 0.137 -0.134 -0.147 0.303
9 Amber Amour Cordial… 1.31 -0.234 -1.55 0.839 -1.19
10 The Joe Lewis Whiskies 0.138 -0.0401 -0.0365 -0.100 -0.531
# ℹ 927 more rows
tidy(pca_prep, 2, type = "variance")
# A tibble: 160 × 4
terms value component id
<chr> <dbl> <int> <chr>
1 variance 2.00 1 pca_6enHS
2 variance 1.71 2 pca_6enHS
3 variance 1.50 3 pca_6enHS
4 variance 1.48 4 pca_6enHS
5 variance 1.37 5 pca_6enHS
6 variance 1.32 6 pca_6enHS
7 variance 1.30 7 pca_6enHS
8 variance 1.20 8 pca_6enHS
9 variance 1.19 9 pca_6enHS
10 variance 1.18 10 pca_6enHS
# ℹ 150 more rows
For attribution, please cite this work as
Sankaran (2024, March 31). STAT 436 (Spring 2024): Principal Components Analysis I. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week10-2/
BibTeX citation
@misc{sankaran2024principal, author = {Sankaran, Kris}, title = {STAT 436 (Spring 2024): Principal Components Analysis I}, url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week10-2/}, year = {2024} }