An introduction to clustering and how to manage its output.
The goal of clustering is to discover distinct groups within a dataset. In an ideal clustering, samples are very different between groups, but relatively similar within groups. At the end of a clustering routine, \(K\) clusters have been identified, and each sample is assigned to one of these \(K\) clusters. \(K\) must be chosen by the user.
Clustering gives a compressed representation of the dataset. Therefore, clustering is useful for getting a quick overview of the high-level structure in a dataset.
For example, clustering can be used in the following applications,
\(K\)-means is a particular algorithm for finding clusters. First, it randomly initializes \(K\) cluster centroids. Then, it alternates the following two steps until convergence,
Here is an animation from the tidymodels
page on \(K\)-means,
Note that, since we have to take an average for each coordinate, we require that our data be quantitative, not categorical1.
We illustrate this idea using the movielens
dataset from the reading. This dataset has ratings (0.5 to 5) given by 671 users across 9066 movies. We can think of this as a matrix of movies vs. users, with ratings within the entries. For simplicity, we filter down to only the 50 most frequently rated movies. We will assume that if a user never rated a movie, they would have given that movie a zero2. We’ve skipped a few steps used in the reading (subtracting movie / user averages and filtering to only active users), but the overall results are comparable.
data("movielens")
frequently_rated <- movielens %>%
group_by(movieId) %>%
summarize(n=n()) %>%
top_n(50, n) %>%
pull(movieId)
movie_mat <- movielens %>%
filter(movieId %in% frequently_rated) %>%
select(title, userId, rating) %>%
pivot_wider(title, names_from = userId, values_from = rating, values_fill = 0)
movie_mat[1:10, 1:20]
# A tibble: 10 x 20
title `2` `3` `4` `5` `6` `7` `8` `9` `10` `11`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Seven … 4 0 0 0 0 0 5 3 0 0
2 Usual … 4 0 0 0 0 0 5 0 5 5
3 Braveh… 4 4 0 0 0 5 4 0 0 0
4 Apollo… 5 0 0 4 0 0 0 0 0 0
5 Pulp F… 4 4.5 5 0 0 0 4 0 0 5
6 Forres… 3 5 5 4 0 3 4 0 0 0
7 Lion K… 3 0 5 4 0 3 0 0 0 0
8 Mask, … 3 0 4 4 0 3 0 0 0 0
9 Speed 3 2.5 0 4 0 3 0 0 0 0
10 Fugiti… 3 0 0 0 0 0 4.5 0 0 0
# … with 9 more variables: 12 <dbl>, 13 <dbl>, 14 <dbl>, 15 <dbl>,
# 16 <dbl>, 17 <dbl>, 18 <dbl>, 19 <dbl>, 20 <dbl>
kmeans
on this dataset. I’ve used the dplyr pipe notation to run kmeans
on the data above with “title” removed. augment
is a function from the tidymodels package that adds the cluster labels identified by kmeans
to the rows in the original dataset.kclust <- movie_mat %>%
select(-title) %>%
kmeans(centers = 10)
movie_mat <- augment(kclust, movie_mat) # creates column ".cluster" with cluster label
kclust <- tidy(kclust)
movie_mat %>%
select(title, .cluster) %>%
arrange(.cluster)
# A tibble: 50 x 2
title .cluster
<chr> <fct>
1 American Beauty 1
2 Godfather, The 1
3 Seven (a.k.a. Se7en) 2
4 Usual Suspects, The 2
5 Pulp Fiction 2
6 Braveheart 3
7 Apollo 13 3
8 Speed 3
9 Fugitive, The 3
10 Jurassic Park 3
# … with 40 more rows
kclust_long <- kclust %>%
pivot_longer(`2`:`671`, names_to = "userId", values_to = "rating")
ggplot(kclust_long) +
geom_bar(
aes(x = reorder(userId, rating), y = rating),
stat = "identity"
) +
facet_grid(cluster ~ .) +
labs(x = "Users (sorted)", y = "Rating") +
theme(
axis.text.x = element_blank(),
axis.text.y = element_text(size = 5),
strip.text.y = element_text(angle = 0)
)
It’s often of interest to relate the cluster assignments to complementary data, to see whether the clustering reflects any previously known differences between the observations, which weren’t directly used in the clustering algorithm.
Be cautious: Outliers, nonspherical shapes, and variations in density can throw off \(K\)-means.