Clustering data at multiple scales using trees.
In reality, data are rarely separated into a clear number of homogeneous clusters. More often, even once a cluster formed, it’s possible to identify a few subclusters. For example, if you initially clustered movies into “drama” and “scifi”, you might be able to further refine the scifi cluster into “time travel” and “aliens.”
\(K\)-means only allows clustering at a single level of magnification. To instead simultaneously cluster across scales, you can use an approach called hierarchical clustering. As a first observation, note that a tree can be used to implicitly store many clusterings at once. You can get a standard clustering by cutting the tree at some level.
Elaborating on this analogy, the leaves of a hierarchical clustering tree are the original observations. The more recently two nodes share a common ancestor, the more similar those observations are.
The specific algorithm proceeds as follows,
In R, this can be accomplished by using the hclust
function. First, we
compute the distances between all pairs of observations (this provides the
similarities used in the algorithm). Then, we apply hclust
to the matrix of
pairwise distances.
We apply this to a movie ratings dataset. Movies are considered similar if they tend to receive similar ratings across all audience members. The result is visualized below.
as_tbl_graph
function
from the network and trees lectures.hclust_graph <- as_tbl_graph(hclust_result, height = height)
hclust_graph <- hclust_graph %>%
mutate(height = ifelse(height == 0, 27, height)) # shorten the final edge
hclust_graph
# A tbl_graph: 99 nodes and 98 edges
#
# A rooted tree
#
# Node Data: 99 × 4 (active)
height leaf label members
<dbl> <lgl> <chr> <int>
1 27 TRUE "Schindler's List" 1
2 27 TRUE "Forrest Gump" 1
3 27 TRUE "Shawshank Redemption, The" 1
4 27 TRUE "Pulp Fiction" 1
5 27 TRUE "Silence of the Lambs, The" 1
6 58.7 FALSE "" 2
7 63.9 FALSE "" 3
8 64.9 FALSE "" 4
9 67.9 FALSE "" 5
10 27 TRUE "Star Wars: Episode IV - A New Hope" 1
# ℹ 89 more rows
#
# Edge Data: 98 × 2
from to
<int> <int>
1 6 4
2 6 5
3 7 3
# ℹ 95 more rows
ggraph(hclust_graph, "dendrogram", height = height, circular = TRUE) +
geom_edge_elbow() +
geom_node_text(aes(label = label), size = 4) +
coord_fixed()
cluster_df <- cutree(hclust_result, k = 10) %>% # try changing K and regenerating the graph below
tibble(label = names(.), cluster = as.factor(.))
cluster_df
# A tibble: 50 × 3
. label cluster
<int> <chr> <fct>
1 1 Seven (a.k.a. Se7en) 1
2 1 Usual Suspects, The 1
3 2 Braveheart 2
4 2 Apollo 13 2
5 3 Pulp Fiction 3
6 4 Forrest Gump 4
7 2 Lion King, The 2
8 2 Mask, The 2
9 2 Speed 2
10 2 Fugitive, The 2
# ℹ 40 more rows
# colors chosen using https://medialab.github.io/iwanthue/
cols <- c("#51b48c", "#cf3d6e", "#7ab743", "#7b62cb", "#c49644", "#c364b9", "#6a803a", "#688dcd", "#c95a38", "#c26b7e")
hclust_graph %>%
left_join(cluster_df) %>%
ggraph("dendrogram", height = height, circular = TRUE) +
geom_edge_elbow() +
geom_node_text(aes(label = label, col = cluster), size = 4) +
coord_fixed() +
scale_color_manual(values = cols) +
theme(legend.position = "none")
For attribution, please cite this work as
Sankaran (2024, April 21). STAT 436 (Spring 2024): Hierarchical Clustering. Retrieved from https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week09-02/
BibTeX citation
@misc{sankaran2024hierarchical, author = {Sankaran, Kris}, title = {STAT 436 (Spring 2024): Hierarchical Clustering}, url = {https://krisrs1128.github.io/stat436_s24/website/stat436_s24/posts/2024-12-27-week09-02/}, year = {2024} }