Introduction to Dimensionality Reduction
Examples of high-dimensional data.
Kris Sankaran (UW Madison)
2023-01-01
Reading, Recording, Rmarkdown
- High-dimensional data are data where many features are collected for each
observation. These tend to be wide datasets with many columns. The name comes
from the fact that each row of the dataset can be viewed as a vector in a
high-dimensional space (one dimension for each feature). These data are common
in modern applications,
- Each cell in a genomics dataset might have measurements for hundreds of molecules.
- Each survey respondent might provide answers to dozens of questions.
- Each image might have several thousand pixels.
- Each document might have counts across several thousand relevant words.
For low-dimensional data, we could visually encode all the features in our
data directly, either using properties of marks or through faceting. In
high-dimensional data, this is no longer possible.
However, though there are many features associated with each observation, it
may still be possible to organize samples across a smaller number of meaningful,
derived features.
For example, consider the Metropolitan Museum of Art dataset, which contains
images of many artworks. Abstractly, each artwork is a high-dimensional object,
containing pixel intensities across many pixels. But it is reasonable to derive
a feature based on the average brightness.
- In general, manual feature construction can be difficult. Algorithmic
approaches try streamline the process of generating these maps by optimizing
some more generic criterion. Different algorithms use different criteria, which
we will review in the next couple of lectures.
- Informally, the goal of dimensionality reduction techniques is to produce a
low-dimensional “atlas” relating members of a collection of complex objects.
Samples that are similar to one another in the high-dimensional space should be
placed near one another in the low-dimensional view. For example, we might want
to make an atlas of artworks, with similar styles and historical periods being
placed near to one another.
Citation
For attribution, please cite this work as
Sankaran (2023, Jan. 1). STAT 436 (Spring 2023): Introduction to Dimensionality Reduction. Retrieved from https://krisrs1128.github.io/stat436_s23/website/stat436_s23/posts/2022-12-27-week10-1/
BibTeX citation
@misc{sankaran2023introduction,
author = {Sankaran, Kris},
title = {STAT 436 (Spring 2023): Introduction to Dimensionality Reduction},
url = {https://krisrs1128.github.io/stat436_s23/website/stat436_s23/posts/2022-12-27-week10-1/},
year = {2023}
}