Statistical Data Visualization: Principal Components Analysis I

Reading, Recording, Rmarkdown

library("tidymodels")
library("readr")

In our last notes, we saw how we could organize a collection of images based on average pixel brightness. We can think of average pixel brightness as a derived feature that can be used to build a low-dimensional map.
We can partially automate the process of deriving new features. Though, in general, finding the best way to combine raw features into derived ones is a complicated problem, we can simplify things by restricting attention to,

Features that are linear combinations of the raw input columns.
Features that are orthogonal to one another.
Features that have high variance.

Restricting to linear combinations allows for an analytical solution. We will relax this requirement when discussing UMAP.
Orthogonality means that the derived features will be uncorrelated with one another. This is a nice property, because it would be wasteful if features were redundant.
High variance is desirable because it means we preserve more of the essential structure of the underlying data. For example, if you look at this 2D representation of a 3D object, it’s hard to tell what it is,

Figure 1: What is this object?

But when viewing an alternative reduction which has higher variance…

Figure 2: Not so complicated now. Credit for this example goes to Professor Julie Josse, at Ecole Polytechnique.

Principal Components Analysis (PCA) is the optimal dimensionality reduction under these three restrictions, in the sense that it finds derived features with the highest variance. Formally, PCA finds a matrix \(\Phi \in \mathbb{R}^{D \times K}\) and a set of vector \(z_{i} \in \mathbb{R}^{K}\) such that \(x_{i} \approx \Phi z_{i}\) for all \(i\). The columns of \(\Phi\) are called principal components, and they specify the structure of the derived linear features. The vector \(z_{i}\) is called the score of \(x_{i}\) with respect to these components. The top component explains the most variance, the second captures the next most, and so on.
For example, if one of the columns of \(\Phi\) was equal to \(\left(\frac{1}{D}, \dots, \frac{1}{D}\right)\), then that feature computes the average of all coordinates (e.g., to get average brightness), and the corresponding \(z_{i}\) would be a measure of the average brightness of sample \(i\).
Geometrically, the columns of \(\Phi\) span a plane that approximates the data. The \(z_{i}\) provide coordinates of points projected onto this plane.

Figure 3: PCA finds a low-dimensional linear subspace that closely approximates the high-dimensional data.

In R, PCA can be conveniently implemented using the tidymodels package. We will see a base R implementation in the next lecture. The dataset below contains properties of a variety of cocktails, from the Boston Bartender’s guide. The first two columns are qualitative descriptors, while the rest give numerical ingredient information.

cocktails_df <- read_csv("https://uwmadison.box.com/shared/static/qyqof2512qsek8fpnkqqiw3p1jb77acf.csv")
cocktails_df[, 1:6]

# A tibble: 937 x 6
   name     category   light_rum lemon_juice lime_juice sweet_vermouth
   <chr>    <chr>          <dbl>       <dbl>      <dbl>          <dbl>
 1 Gauguin  Cocktail …      2           1          1               0  
 2 Fort La… Cocktail …      1.5         0          0.25            0.5
 3 Cuban C… Cocktail …      2           0          0.5             0  
 4 Cool Ca… Cocktail …      0           0          0               0  
 5 John Co… Whiskies        0           1          0               0  
 6 Cherry … Cocktail …      1.25        0          0               0  
 7 Casa Bl… Cocktail …      2           0          1.5             0  
 8 Caribbe… Cocktail …      0.5         0          0               0  
 9 Amber A… Cordials …      0           0.25       0               0  
10 The Joe… Whiskies        0           0.5        0               0  
# … with 927 more rows

The pca_rec object below defines a tidymodels recipe for performing PCA. Computation of the lower-dimensional representation is deferred until prep() is called. This delineation between workflow definition and execution helps clarify the overall workflow, and it is typical of the tidymodels package.

pca_rec <- recipe(~., data = cocktails_df) %>%
  update_role(name, category, new_role = "id") %>%
  step_normalize(all_predictors()) %>%
  step_pca(all_predictors())

pca_prep <- prep(pca_rec)

The step_normalize call is used to center and scale all the columns. This is needed because otherwise columns with larger variance will have more weight in the final dimensionality reduction, but this is not conceptually meaningful. For example, if one of the columns in a dataset were measuring length in kilometers, then we could artificially increase its influence in a PCA by expressing the same value in meters. To achieve invariance to this change in units, it would be important to normalize first.
We can tidy each element of the workflow object. Since PCA was the second step in the workflow, the PCA components can be obtained by calling tidy with the argument “2.” The scores of each sample with respect to these components can be extracted using juice. The amount of variance explained by each dimension is also given by tidy, but with the argument type = "variance". We’ll see how to visualize and interpret these results in the next lecture.

tidy(pca_prep, 2)

# A tibble: 1,600 x 4
   terms             value component id       
   <chr>             <dbl> <chr>     <chr>    
 1 light_rum        0.163  PC1       pca_pxTD1
 2 lemon_juice     -0.0140 PC1       pca_pxTD1
 3 lime_juice       0.224  PC1       pca_pxTD1
 4 sweet_vermouth  -0.0661 PC1       pca_pxTD1
 5 orange_juice     0.0308 PC1       pca_pxTD1
 6 powdered_sugar  -0.476  PC1       pca_pxTD1
 7 dark_rum         0.124  PC1       pca_pxTD1
 8 cranberry_juice  0.0954 PC1       pca_pxTD1
 9 pineapple_juice  0.119  PC1       pca_pxTD1
10 bourbon_whiskey  0.0963 PC1       pca_pxTD1
# … with 1,590 more rows

juice(pca_prep)

# A tibble: 937 x 7
   name           category          PC1     PC2     PC3     PC4    PC5
   <fct>          <fct>           <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
 1 Gauguin        Cocktail Clas…  1.38  -1.15    1.34   -1.12    1.52 
 2 Fort Lauderda… Cocktail Clas…  0.684  0.548   0.0308 -0.370   1.41 
 3 Cuban Cocktai… Cocktail Clas…  0.285 -0.967   0.454  -0.931   2.02 
 4 Cool Carlos    Cocktail Clas…  2.19  -0.935  -1.21    2.47    1.80 
 5 John Collins   Whiskies        1.28  -1.07    0.403  -1.09   -2.21 
 6 Cherry Rum     Cocktail Clas… -0.757 -0.460   0.909   0.0154 -0.748
 7 Casa Blanca    Cocktail Clas…  1.53  -0.392   3.29   -3.39    3.87 
 8 Caribbean Cha… Cocktail Clas…  0.324  0.137  -0.134  -0.147   0.303
 9 Amber Amour    Cordials and …  1.31  -0.234  -1.55    0.839  -1.19 
10 The Joe Lewis  Whiskies        0.138 -0.0401 -0.0365 -0.100  -0.531
# … with 927 more rows

tidy(pca_prep, 2, type = "variance")

# A tibble: 160 x 4
   terms    value component id       
   <chr>    <dbl>     <int> <chr>    
 1 variance  2.00         1 pca_pxTD1
 2 variance  1.71         2 pca_pxTD1
 3 variance  1.50         3 pca_pxTD1
 4 variance  1.48         4 pca_pxTD1
 5 variance  1.37         5 pca_pxTD1
 6 variance  1.32         6 pca_pxTD1
 7 variance  1.30         7 pca_pxTD1
 8 variance  1.20         8 pca_pxTD1
 9 variance  1.19         9 pca_pxTD1
10 variance  1.18        10 pca_pxTD1
# … with 150 more rows