An application to a gene expression dataset.
library("dplyr")
library("forcats")
library("ggplot2")
library("readr")
library("superheat")
library("tibble")
library("tidyr")
library("tidytext")
library("topicmodels")
theme479 <- theme_minimal() +
theme(
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "#f7f7f7"),
panel.border = element_rect(fill = NA, color = "#0c0c0c", size = 0.6),
legend.position = "bottom"
)
theme_set(theme479)
We have used text data analysis to motivate and illustrate the use of topic models. However, these models can be used whenever we have high-dimensional count data1. To illustrate this broad applicability, this lecture will consider an example from gene expression analysis.
The dataset we consider comes from the GTEX consortium. A variety of tissue samples have been subject to RNA-seq analysis, which measures how much of each type of gene is expressed within each sample. Intuitively, we relate,
x <- read_csv("https://uwmadison.box.com/shared/static/fd437om519i5mrnur14xy6dq3ls0yqt2.csv")
x
# A tibble: 2,000,000 x 6
sample gene tissue tissue_detail Description value
<chr> <chr> <chr> <chr> <chr> <dbl>
1 GTEX-NFK9-092… ENSG0000… Heart Heart - Left Ven… FGR 368
2 GTEX-OXRO-001… ENSG0000… Brain Brain - Frontal … FGR 593
3 GTEX-QLQ7-052… ENSG0000… Heart Heart - Left Ven… FGR 773
4 GTEX-POMQ-032… ENSG0000… Heart Heart - Left Ven… FGR 330
5 GTEX-QESD-052… ENSG0000… Heart Heart - Left Ven… FGR 357
6 GTEX-OHPN-001… ENSG0000… Brain Brain - Amygdala FGR 571
7 GTEX-OHPK-032… ENSG0000… Heart Heart - Left Ven… FGR 391
8 GTEX-OIZG-112… ENSG0000… Heart Heart - Left Ven… FGR 425
9 GTEX-O5YW-032… ENSG0000… Heart Heart - Left Ven… FGR 172
10 GTEX-REY6-102… ENSG0000… Heart Heart - Left Ven… FGR 875
# … with 1,999,990 more rows
The goal here is to find sets of genes that tend to be expressed together, because these co-expression patterns might be indications of shared biological processes. Unlike clustering, which assumes that each sample is described by one gene expression profile, a topic model will be able to model each tissue sample as a mixture of profiles (i.e., a mixture of underlying biological processes).
As a first step in our analysis, we need to prepare a DocumentTermMatrix
for use by the topicmodels package. Since the data were in tidy format, we can use the cast_dtm
function to spreaed genes across columns. From there, we can fit an LDA model. However, we’ve commented out the code (it takes a while to run) and instead just download the results that we’ve hosted on Box.
x_dtm <- cast_dtm(x, sample, gene, value)
#fit <- LDA(x_dtm, k = 10, control = list(seed = 479))
#save(fit, file = "lda_gtex.rda")
f <- tempfile()
download.file("https://uwmadison.box.com/shared/static/ifgo6fbvm8bdlshzegb5ty8xif5istn8.rda", f)
fit <- get(load(f))
discriminative_genes <- topics %>%
group_by(term) %>%
mutate(D = discrepancy(beta)) %>%
ungroup() %>%
slice_max(D, n = 400) %>%
mutate(term = fct_reorder(term, -D))
discriminative_genes %>%
pivot_wider(term, topic, values_from = "beta") %>%
column_to_rownames("term") %>%
superheat(
pretty.order.rows = TRUE,
left.label.size = 1.5,
left.label.text.size = 3,
bottom.label.size = 0.05,
legend = FALSE
)
keep_tissues <- memberships %>%
count(tissue) %>%
filter(n > 70) %>%
pull(tissue)
hclust_result <- hclust(dist(fit@gamma))
document_order <- fit@documents[hclust_result$order]
memberships <- memberships %>%
filter(tissue %in% keep_tissues) %>%
mutate(document = factor(document, levels = document_order))
ggplot(memberships, aes(gamma, document, fill = topic, col = topic)) +
geom_col(position = position_stack()) +
facet_grid(tissue ~ ., scales = "free", space = "free") +
scale_x_continuous(expand = c(0, 0)) +
scale_color_brewer(palette = "Set3", guide = FALSE) +
scale_fill_brewer(palette = "Set3") +
labs(x = "Topic Membership", y = "Sample", fill = "Topic") +
theme(
panel.spacing = unit(0.5, "lines"),
strip.switch.pad.grid = unit(0, "cm"),
strip.text.y = element_text(size = 8, angle = 0),
axis.text.y = element_blank(),
)
In fact, topic models are an example of a larger family of models, called mixed-membership models. All of these models generalize clustering, and different variants can be applied to other data types, like continuous, categorical, and network data.↩︎