An overview of the UMAP algorithm.
library("ggplot2")
library("readr")
library("tidymodels")
library("embed")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
Nonlinear dimension reduction methods can give a more faithful representation than PCA when the data don’t lie on a low-dimensional linear subspace.
For example, suppose the data were shaped like this. There is no one-dimensional line through these data that separate the groups well. We will need an alternative approach to reducing dimensionality if we want to preserve nonlinear structure.
moons <- read_csv("https://uwmadison.box.com/shared/static/kdt9qqvonhcz2ssb599p1nqganrg1w6k.csv")
ggplot(moons, aes(X, Y, col = Class)) +
geom_point() +
scale_color_brewer(palette = "Set2")
From a high-level, the intuition behind UMAP is to (a) build a graph joining nearby neighbors in the original high-dimensional space, and then (b) layout the graph in a lower-dimensional space.
For example, consider the 2-dimensional sine wave below. If we build a graph, we can try to layout the resulting nodes and edges on a 1-dimensional line in a way that approximately preserves the ordering.
One detail in the graph construction: In UMAP, the edges are assigned weights depending on the distance they span, normalized by the distance to the closest neighbor. Neighbors that are close, relative to the nearest neighbors, are assigned higher weights than those that are far away, and points that are linked by high weight edges are pulled together with larger force in the final graph layout. This is what the authors mean by using a “fuzzy” nearest neighbor graph. The fuzziness allows the algorithm to distinguish neighbors that are very close from those that are far, even though they all lie within a \(K\)-nearest-neighborhood.
Once the graph is constructed, there is the question of how the graph layout should proceed. UMAP uses a variant of force-directed layout, and the global strength of the springs is another hyperparameter. Lower tension on the springs allow the points to spread out more loosely, higher tension forces points closer together. This is a second hyperparameter of UMAP.
These two hyperparameters — the number of nearest neighbors \(K\) and the layout tension — are the only two hyperparameters of UMAP.
You can see more examples of what this algorithm does to toy datasets in the reading. Note in particular the properties that the algorithm does not preserve. The distance between clusters should not be interpreted, since it just means that the graph components were not connected. Similarly, the density of points is not preserved.
In R, we can implement this using almost the same code as we used for PCA. The step_umap
command is available through the embed package.
cocktails_df <- read_csv("https://uwmadison.box.com/shared/static/qyqof2512qsek8fpnkqqiw3p1jb77acf.csv")
umap_rec <- recipe(~., data = cocktails_df) %>%
update_role(name, category, new_role = "id") %>%
step_normalize(all_predictors()) %>%
step_umap(all_predictors(), neighbors = 20, min_dist = 0.1)
umap_prep <- prep(umap_rec)
ggplot(juice(umap_prep), aes(umap_1, umap_2)) +
geom_point(aes(color = category), alpha = 0.7, size = 0.8) +
geom_text(aes(label = name), check_overlap = TRUE, size = 3, hjust = "inward")