Clustering data at multiple scales using trees.
In reality, data are rarely separated into a clear number of homogeneous clusters. More often, even once a cluster formed, it’s possible to identify a few subclusters. For example, if you initially clustered movies into “drama” and “scifi”, you might be able to further refine the scifi cluster into “time travel” and “aliens.”
\(K\)-means only allows clustering at a single level of magnification. To instead simultaneously cluster across scales, you can use an approach called hierarchical clustering. As a first observation, note that a tree can be used to implicitly store many clusterings at once. You can get a standard clustering by cutting the tree at some level.
Figure 1: We can recover clusters at different levels of granularity, by cutting a hierarchical clustering tree.
robservable("@mbostock/tree-of-life", height = 1150)
Elaborating on this analogy, the leaves of a hierarchical clustering tree are the original observations. The more recently two nodes share a common ancestor, the more similar those observations are.
The specific algorithm proceeds as follows,
Figure 2: At initialization, the hierarchical clustering routine has a cluster for each observation.
Figure 3: Next, the two closest observations are merged into one cluster. This is the first merge point on the tree.
Figure 4: We continue this at the next iteration, though this time we have compute the pairwise distance between all clusters, not observations (technically, all the observations were their own cluster at the first step, and in both cases, we compare the pairwise distances between clusters).
Figure 5: We can continue this process…
Figure 6: … and eventually we will construct the entire tree.
In R, this can be accomplished by using the hclust
function. First, we compute the distances between all pairs of observations (this provides the similarities used in the algorithm). Then, we apply hclust
to the matrix of pairwise distances.
We apply this to a movie ratings dataset. Movies are considered similar if they tend to receive similar ratings across all audience members. The result is visualized below.
movies_mat <- read_csv("https://uwmadison.box.com/shared/static/wj1ln9xtigaoubbxow86y2gqmqcsu2jk.csv")
D <- movies_mat %>%
column_to_rownames(var = "title") %>%
dist()
hclust_result <- hclust(D)
plot(hclust_result, cex = 0.5)
as_tbl_graph
function from the network and trees lectures.hclust_graph <- as_tbl_graph(hclust_result, height = height)
hclust_graph <- hclust_graph %>%
mutate(height = ifelse(height == 0, 27, height)) # shorten the final edge
hclust_graph
# A tbl_graph: 99 nodes and 98 edges
#
# A rooted tree
#
# Node Data: 99 x 4 (active)
height leaf label members
<dbl> <lgl> <chr> <int>
1 27 TRUE "Schindler's List" 1
2 27 TRUE "Forrest Gump" 1
3 27 TRUE "Shawshank Redemption, The" 1
4 27 TRUE "Pulp Fiction" 1
5 27 TRUE "Silence of the Lambs, The" 1
6 58.7 FALSE "" 2
# … with 93 more rows
#
# Edge Data: 98 x 2
from to
<int> <int>
1 6 4
2 6 5
3 7 3
# … with 95 more rows
ggraph(hclust_graph, "dendrogram", height = height, circular = TRUE) +
geom_edge_elbow() +
geom_node_text(aes(label = label), size = 4) +
coord_fixed()
cluster_df <- cutree(hclust_result, k = 10) %>% # try changing K and regenerating the graph below
tibble(label = names(.), cluster = as.factor(.))
cluster_df
# A tibble: 50 x 3
. label cluster
<int> <chr> <fct>
1 1 Seven (a.k.a. Se7en) 1
2 1 Usual Suspects, The 1
3 2 Braveheart 2
4 2 Apollo 13 2
5 3 Pulp Fiction 3
6 4 Forrest Gump 4
7 2 Lion King, The 2
8 2 Mask, The 2
9 2 Speed 2
10 2 Fugitive, The 2
# … with 40 more rows
# colors chosen using https://medialab.github.io/iwanthue/
cols <- c("#51b48c", "#cf3d6e", "#7ab743", "#7b62cb", "#c49644", "#c364b9", "#6a803a", "#688dcd", "#c95a38", "#c26b7e")
hclust_graph %>%
left_join(cluster_df) %>%
ggraph("dendrogram", height = height, circular = TRUE) +
geom_edge_elbow() +
geom_node_text(aes(label = label, col = cluster), size = 4) +
coord_fixed() +
scale_color_manual(values = cols) +
theme(legend.position = "none")