Linear dimensionality reduction using PCA.
In our last notes, we saw how we could organize a collection of images based on average pixel brightness. We can think of average pixel brightness as a derived feature that can be used to build a low-dimensional map.
We can partially automate the process of deriving new features. Though, in general, finding the best way to combine raw features into derived ones is a complicated problem, we can simplify things by restricting attention to,
Restricting to linear combinations allows for an analytical solution. We will relax this requirement when discussing UMAP.
Orthogonality means that the derived features will be uncorrelated with one another. This is a nice property, because it would be wasteful if features were redundant.
High variance is desirable because it means we preserve more of the essential structure of the underlying data. For example, if you look at this 2D representation of a 3D object, it’s hard to tell what it is,
But when viewing an alternative reduction which has higher variance…
Principal Components Analysis (PCA) is the optimal dimensionality reduction under these three restrictions, in the sense that it finds derived features with the highest variance. Formally, PCA finds a matrix \(\Phi \in \mathbb{R}^{D \times K}\) and a set of vector \(z_{i} \in \mathbb{R}^{K}\) such that \(x_{i} \approx \Phi z_{i}\) for all \(i\). The columns of \(\Phi\) are called principal components, and they specify the structure of the derived linear features. The vector \(z_{i}\) is called the score of \(x_{i}\) with respect to these components. The top component explains the most variance, the second captures the next most, and so on.
For example, if one of the columns of \(\Phi\) was equal to \(\left(\frac{1}{D}, \dots, \frac{1}{D}\right)\), then that feature computes the average of all coordinates (e.g., to get average brightness), and the corresponding \(z_{i}\) would be a measure of the average brightness of sample \(i\).
Geometrically, the columns of \(\Phi\) span a plane that approximates the data. The \(z_{i}\) provide coordinates of points projected onto this plane.
cocktails_df <- read_csv("https://uwmadison.box.com/shared/static/qyqof2512qsek8fpnkqqiw3p1jb77acf.csv")
cocktails_df[, 1:6]
# A tibble: 937 × 6
name category light…¹ lemon…² lime_…³ sweet…⁴
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gauguin Cocktail Clas… 2 1 1 0
2 Fort Lauderdale Cocktail Clas… 1.5 0 0.25 0.5
3 Cuban Cocktail No. 1 Cocktail Clas… 2 0 0.5 0
4 Cool Carlos Cocktail Clas… 0 0 0 0
5 John Collins Whiskies 0 1 0 0
6 Cherry Rum Cocktail Clas… 1.25 0 0 0
7 Casa Blanca Cocktail Clas… 2 0 1.5 0
8 Caribbean Champagne Cocktail Clas… 0.5 0 0 0
9 Amber Amour Cordials and … 0 0.25 0 0
10 The Joe Lewis Whiskies 0 0.5 0 0
# … with 927 more rows, and abbreviated variable names ¹light_rum,
# ²lemon_juice, ³lime_juice, ⁴sweet_vermouth
pca_rec
object below defines a tidymodels recipe for performing PCA.
Computation of the lower-dimensional representation is deferred until prep()
is called. This delineation between workflow definition and execution helps
clarify the overall workflow, and it is typical of the tidymodels package.The step_normalize
call is used to center and scale all the columns. This
is needed because otherwise columns with larger variance will have more weight
in the final dimensionality reduction, but this is not conceptually meaningful.
For example, if one of the columns in a dataset were measuring length in
kilometers, then we could artificially increase its influence in a PCA by
expressing the same value in meters. To achieve invariance to this change in
units, it would be important to normalize first.
We can tidy
each element of the workflow object. Since PCA was the second
step in the workflow, the PCA components can be obtained by calling tidy with the
argument “2.” The scores of each sample with respect to these components can be
extracted using juice.
The amount of variance explained by each dimension is
also given by tidy
, but with the argument type = "variance"
. We’ll see how
to visualize and interpret these results in the next lecture.
tidy(pca_prep, 2)
# A tibble: 1,600 × 4
terms value component id
<chr> <dbl> <chr> <chr>
1 light_rum 0.163 PC1 pca_DMW4l
2 lemon_juice -0.0140 PC1 pca_DMW4l
3 lime_juice 0.224 PC1 pca_DMW4l
4 sweet_vermouth -0.0661 PC1 pca_DMW4l
5 orange_juice 0.0308 PC1 pca_DMW4l
6 powdered_sugar -0.476 PC1 pca_DMW4l
7 dark_rum 0.124 PC1 pca_DMW4l
8 cranberry_juice 0.0954 PC1 pca_DMW4l
9 pineapple_juice 0.119 PC1 pca_DMW4l
10 bourbon_whiskey 0.0963 PC1 pca_DMW4l
# … with 1,590 more rows
juice(pca_prep)
# A tibble: 937 × 7
name category PC1 PC2 PC3 PC4 PC5
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Gauguin Cocktai… 1.38 -1.15 1.34 -1.12 1.52
2 Fort Lauderdale Cocktai… 0.684 0.548 0.0308 -0.370 1.41
3 Cuban Cocktail No. 1 Cocktai… 0.285 -0.967 0.454 -0.931 2.02
4 Cool Carlos Cocktai… 2.19 -0.935 -1.21 2.47 1.80
5 John Collins Whiskies 1.28 -1.07 0.403 -1.09 -2.21
6 Cherry Rum Cocktai… -0.757 -0.460 0.909 0.0154 -0.748
7 Casa Blanca Cocktai… 1.53 -0.392 3.29 -3.39 3.87
8 Caribbean Champagne Cocktai… 0.324 0.137 -0.134 -0.147 0.303
9 Amber Amour Cordial… 1.31 -0.234 -1.55 0.839 -1.19
10 The Joe Lewis Whiskies 0.138 -0.0401 -0.0365 -0.100 -0.531
# … with 927 more rows
tidy(pca_prep, 2, type = "variance")
# A tibble: 160 × 4
terms value component id
<chr> <dbl> <int> <chr>
1 variance 2.00 1 pca_DMW4l
2 variance 1.71 2 pca_DMW4l
3 variance 1.50 3 pca_DMW4l
4 variance 1.48 4 pca_DMW4l
5 variance 1.37 5 pca_DMW4l
6 variance 1.32 6 pca_DMW4l
7 variance 1.30 7 pca_DMW4l
8 variance 1.20 8 pca_DMW4l
9 variance 1.19 9 pca_DMW4l
10 variance 1.18 10 pca_DMW4l
# … with 150 more rows
For attribution, please cite this work as
Sankaran (2023, Jan. 1). STAT 436 (Spring 2023): Principal Components Analysis I. Retrieved from https://krisrs1128.github.io/stat436_s23/website/stat436_s23/posts/2022-12-27-week10-2/
BibTeX citation
@misc{sankaran2023principal, author = {Sankaran, Kris}, title = {STAT 436 (Spring 2023): Principal Components Analysis I}, url = {https://krisrs1128.github.io/stat436_s23/website/stat436_s23/posts/2022-12-27-week10-2/}, year = {2023} }