Visualizing table values, ordered by clustering results.
The direct outputs of a standard clustering algorithim are (a) cluster assignments for each sample, (b) the centroids associated with each cluster. A hierarchical clustering algorithm enriches this output with a tree, which provide (a) and (b) at multiple levels of resolution.
These outputs can be used to improve visualizations. For example, they can be used to define small multiples, faceting across clusters. One especially common idea is to reorder the rows of a heatmap using the results of a clustering, and this is the subject of these notes.
In a heatmap, each mark (usually a small tile) corresponds to an entry of a matrix. The \(x\)-coordinate of the mark encodes the index of the observation, while the \(y\)-coordinate encodes the index of the feature. The color of each tile represents the value of that entry. For example, here are the first few rows of the movies data, along with the corresponding heatmap, made using the superheat package.
movies_mat <- read_csv("https://uwmadison.box.com/shared/static/wj1ln9xtigaoubbxow86y2gqmqcsu2jk.csv") %>%
column_to_rownames(var = "title")
yr
) to encode the total number of ratings given to that movie. The yr.obs.cols
allows us to change the color of each point in the adjacent plot. In this example, we change color depending on which cluster the movie was found to belong to.cluster_cols <- c('#8dd3c7','#ccebc5','#bebada','#fb8072','#80b1d3','#fdb462','#b3de69','#fccde5','#d9d9d9','#bc80bd')
superheat(
movies_mat,
left.label.text.size = 4,
order.rows = order(movies_clust$cluster),
order.cols = order(users_clust$cluster),
heat.pal = cols,
heat.lim = c(0, 5),
yr = rowSums(movies_mat > 0),
yr.axis.name = "Number of Ratings",
yr.obs.col = cluster_cols[movies_clust$cluster],
yr.plot.type = "bar"
)
pretty.order.rows
and pretty.order.cols
arguments use hierarchical clustering to reorder the heatmap.pretty.order.rows
and pretty.order.cols
can be also visualized.