Linear dimensionality reduction using PCA.
In our last notes, we saw how we could organize a collection of images based on average pixel brightness. We can think of average pixel brightness as a derived feature that can be used to build a low-dimensional map.
We can partially automate the process of deriving new features. Though, in general, finding the best way to combine raw features into derived ones is a complicated problem, we can simplify things by restricting attention to,
Restricting to linear combinations allows for an analytical solution. We will relax this requirement when discussing UMAP.
Orthogonality means that the derived features will be uncorrelated with one another. This is a nice property, because it would be wasteful if features were redundant.
High variance is desirable because it means we preserve more of the essential structure of the underlying data. For example, if you look at this 2D representation of a 3D object, it’s hard to tell what it is,
But when viewing an alternative reduction which has higher variance…
Principal Components Analysis (PCA) is the optimal dimensionality reduction under these three restrictions, in the sense that it finds derived features with the highest variance. Formally, PCA finds a matrix \(\Phi \in \mathbb{R}^{D \times K}\) and a set of vector \(z_{i} \in \mathbb{R}^{K}\) such that \(x_{i} \approx \Phi z_{i}\) for all \(i\). The columns of \(\Phi\) are called principal components, and they specify the structure of the derived linear features. The vector \(z_{i}\) is called the score of \(x_{i}\) with respect to these components. The top component explains the most variance, the second captures the next most, and so on.
For example, if one of the columns of \(\Phi\) was equal to \(\left(\frac{1}{D}, \dots, \frac{1}{D}\right)\), then that feature computes the average of all coordinates (e.g., to get average brightness), and the corresponding \(z_{i}\) would be a measure of the average brightness of sample \(i\).
Geometrically, the columns of \(\Phi\) span a plane that approximates the data. The \(z_{i}\) provide coordinates of points projected onto this plane.
cocktails_df <- read_csv("https://uwmadison.box.com/shared/static/qyqof2512qsek8fpnkqqiw3p1jb77acf.csv")
cocktails_df[, 1:6]
# A tibble: 937 x 6
name category light_rum lemon_juice lime_juice sweet_vermouth
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Gauguin Cocktail … 2 1 1 0
2 Fort La… Cocktail … 1.5 0 0.25 0.5
3 Cuban C… Cocktail … 2 0 0.5 0
4 Cool Ca… Cocktail … 0 0 0 0
5 John Co… Whiskies 0 1 0 0
6 Cherry … Cocktail … 1.25 0 0 0
7 Casa Bl… Cocktail … 2 0 1.5 0
8 Caribbe… Cocktail … 0.5 0 0 0
9 Amber A… Cordials … 0 0.25 0 0
10 The Joe… Whiskies 0 0.5 0 0
# … with 927 more rows
pca_rec
object below defines a tidymodels recipe for performing PCA. Computation of the lower-dimensional representation is deferred until prep()
is called. This delineation between workflow definition and execution helps clarify the overall workflow, and it is typical of the tidymodels package.pca_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_pca(all_predictors())
pca_prep <- prep(pca_rec)
The step_normalize
call is used to center and scale all the columns. This is needed because otherwise columns with larger variance will have more weight in the final dimensionality reduction, but this is not conceptually meaningful. For example, if one of the columns in a dataset were measuring length in kilometers, then we could artificially increase its influence in a PCA by expressing the same value in meters. To achieve invariance to this change in units, it would be important to normalize first.
We can tidy
each element of the workflow object. Since PCA was the second step in the workflow, the PCA components can be obtained by calling tidy with the argument “2.” The scores of each sample with respect to these components can be extracted using juice.
The amount of variance explained by each dimension is also given by tidy
, but with the argument type = "variance"
. We’ll see how to visualize and interpret these results in the next lecture.
tidy(pca_prep, 2)
# A tibble: 1,600 x 4
terms value component id
<chr> <dbl> <chr> <chr>
1 light_rum 0.163 PC1 pca_pxTD1
2 lemon_juice -0.0140 PC1 pca_pxTD1
3 lime_juice 0.224 PC1 pca_pxTD1
4 sweet_vermouth -0.0661 PC1 pca_pxTD1
5 orange_juice 0.0308 PC1 pca_pxTD1
6 powdered_sugar -0.476 PC1 pca_pxTD1
7 dark_rum 0.124 PC1 pca_pxTD1
8 cranberry_juice 0.0954 PC1 pca_pxTD1
9 pineapple_juice 0.119 PC1 pca_pxTD1
10 bourbon_whiskey 0.0963 PC1 pca_pxTD1
# … with 1,590 more rows
juice(pca_prep)
# A tibble: 937 x 7
name category PC1 PC2 PC3 PC4 PC5
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Gauguin Cocktail Clas… 1.38 -1.15 1.34 -1.12 1.52
2 Fort Lauderda… Cocktail Clas… 0.684 0.548 0.0308 -0.370 1.41
3 Cuban Cocktai… Cocktail Clas… 0.285 -0.967 0.454 -0.931 2.02
4 Cool Carlos Cocktail Clas… 2.19 -0.935 -1.21 2.47 1.80
5 John Collins Whiskies 1.28 -1.07 0.403 -1.09 -2.21
6 Cherry Rum Cocktail Clas… -0.757 -0.460 0.909 0.0154 -0.748
7 Casa Blanca Cocktail Clas… 1.53 -0.392 3.29 -3.39 3.87
8 Caribbean Cha… Cocktail Clas… 0.324 0.137 -0.134 -0.147 0.303
9 Amber Amour Cordials and … 1.31 -0.234 -1.55 0.839 -1.19
10 The Joe Lewis Whiskies 0.138 -0.0401 -0.0365 -0.100 -0.531
# … with 927 more rows
tidy(pca_prep, 2, type = "variance")
# A tibble: 160 x 4
terms value component id
<chr> <dbl> <int> <chr>
1 variance 2.00 1 pca_pxTD1
2 variance 1.71 2 pca_pxTD1
3 variance 1.50 3 pca_pxTD1
4 variance 1.48 4 pca_pxTD1
5 variance 1.37 5 pca_pxTD1
6 variance 1.32 6 pca_pxTD1
7 variance 1.30 7 pca_pxTD1
8 variance 1.20 8 pca_pxTD1
9 variance 1.19 9 pca_pxTD1
10 variance 1.18 10 pca_pxTD1
# … with 150 more rows