Topic Models for Multiscale Analysis

`$\def\Dir{\text{Dir}}$`
`$\def\Mult{\text{Mult}}$`
`$\def\*#1{\mathbf{#1}}$`
`$\def\m#1{\boldsymbol{#1}}$`
`$\def\Unif{\text{Unif}}$`
`$\def\win{\tilde{w}_{\text{in}}}$`
`$\def\reals{\mathbb{R}}$`
`$\newcommand{\wout}{\tilde w_{\text{out}}}$`

## Topic Models in Microbiome Analysis

<div id="subtitle_right">
Kris Sankaran 
26 | February | 2025 
<a href="https://measurement-and-microbes.org">measurement-and-microbes.org</a> 
</div>

<div id="subtitle_left">
VUMC Department of Biostatistics 
Slides: <a href="https://go.wisc.edu/w380nq">go.wisc.edu/w380nq</a> 
Paper: <a href="https://go.wisc.edu/tify36">go.wisc.edu/tify36</a>
</div>

---

### Motivation: Enterotypes

The study [1] argued that all human gut microbiomes
could be divided into well-defined enterotypes:

.pull-left[
> ... We identified three robust clusters (enterotypes hereafter) that are not
nation or continent-specific.
]

.pull-right[
<img src="figure/enterotypes.png" width=300/> 

The visualization that motivated the definition of enterotypes.

]

---

### Motivation: Enterotypes

This idea caught on. Could we have discovered something as fundamental as blood type?

---

### Motivation: Enterotypes

This idea caught on. Could we have discovered something as fundamental as blood type?

---

### Motivation: Enterotypes

This idea caught on. Could we have discovered something as fundamental as blood type?

---

### Motivation: Enterotypes

Reality is more complex -- and interesting! Microbiomes
are dynamic and not so easily categorized
[2; 3; 4; 5].

The same person can jump between clusters in longitudinal sampling [5], which
complicates the simpler enterotypes story.

---

### Improved Exploration

Samples often can't be cleanly separated into clusters.  For more informative
exploration of these data, it helps to consider,

1. **Mixtures**: Instead of matching a sample to a single cluster centroid,
view them as intermediates between multiple representatives.

1. Multiple Scales: Allow transitions from coarse to fine-grained analysis.

---

### Improved Exploration

Samples often can't be cleanly separated into clusters.  For more informative
exploration of these data, it helps to consider,

1. Mixtures: Instead of matching a sample to a single cluster centroid,
view them as intermediates between multiple representatives.

1. **Multiple Scales**: Allow transitions from coarse to fine-grained analysis.

---

---

### Origins

Topic models were independently developed for analyzing genotype and text
data [7; 8] and are now widely used in
computational genomics
[9; 10; 11; 12].

---

### Example 1: GTEX

How are genes co-expressed across tissues? We can apply topic models to GTEX
Consortium data, following [13; 14].

---

### Example 1: GTEX

These are the expression patterns associated with each of those topics.

---

### Example 2: Gut Microbiome

How do microbiome communities change in response to antibiotics? The study [15] gathered samples before and after antibiotic
interventions.

.center[
<iframe src="https://krisrs1128.github.io/treelapse/pages/antibiotic.html#htmlwidget-f49f5eec4ced01f92314" width=800 height=360/>
]

---

### Example 2: Gut Microbiome

If we use topic models, we can see that Topic 2 increases after the antibiotic interventions,
especially the first [9].

---

### Example 2: Gut Microbiome

We can interpret topics by looking for representative taxa. These are species
that have much higher probabilities in one topic compared to the others.

---

### Model

Topic models suppose that samples `$x_i \in \mathbb{R}^{D}$` are drawn independently:
`\begin{align*}
x_i \vert \gamma_i &\sim \text{Mult}\left(n_{i}, \*B\gamma_{i}\right) \\
\gamma_{i} &\sim \text{Dir}\left(\lambda_{\gamma} 1_{K}\right)
\end{align*}`
where the columns `$\beta_{k}$` of `$\*B \in \Delta^{K}_{D}$` lie in the `$D$`-dimensional simplex and are themselves drawn independently from,
`\begin{align*}
\beta_{k} \sim \text{Dir}\left(\lambda_{\beta} 1_{D}\right).
\end{align*}`

We vertically stack the `$N$` `$\gamma_i$`'s into `$\Gamma \in \Delta^{N}_{K}$`.

---

### Simplex View

This model considers two sets of mixtures simultaneously.

* Memberships: `$\gamma_{i}$` describes sample `$x_i$` as a mixture of topics.
* Topics: `$\beta_{k}$` describes the composition of topic `$k$`.

---

---

### Choice of `$K$`

> However, we stress that care should be taken in the interpretation of the
inferred value of `$K$`. To begin with, due to the very high dimensionality of the
parameter space, we found it difficult to obtain reliable estimates of
`$P\left(X \vert K\right)$`... There are also biological reasons to be careful
interpreting `$K$`.

-- From [7].

In practice, it's common to check results and goodness-of-fit measures across
many `$K$` [16; 17; 18].

---

### `alto`: Main Idea

We will fit an ensemble of models of varying complexities. Then,
post-estimation, we will build a compact representation of the result.

.center[
<img src="figure/alto_sketches_annotated alignment.png" width="750" style="display: block; margin: auto;" />

In the Sankey diagram, columns are models and rectangles are topics.

]

---

### Alignment as a Graph

We view an alignment as a graph across the ensemble. Index models by `$m$` and
topics by `$k$`. Then,
* Nodes `$V$` describe topics, parameterized by `$\{\beta^m_{k}, \gamma^m_{k}\}$`.
* Edges `$E$` link topics from neighboring models, i.e. `$K$` to `$K + 1$`.
* Weights `$W$` encode the similarity between topics.

---

### Notation

This graph-based view provides a convenient notation,

* `$m\left(v\right)$` is the model for node `$v$`
* `$k\left(v\right)$` is the topic for node `$v$`
* `$\Gamma\left(v\right) := \left(\gamma_{ v\left(k\right)}^m\left(k\right)\right) \in \reals^n_{+}$` is the vector of
mixed memberships for topic `$v$`
* `$\beta\left(v\right) := \beta_{k}^m \in \Delta^{D}$` is the
corresponding topic distribution
* `$e = \left(v, v'\right)$` is an edge linking topics `$v$` and `$v'$`.

---

### Estimating Weights: Product Approach

.pull-left[
To compute weights, we can use,
`\begin{align*}
w\left(e\right) = \Gamma\left(v\right)^T\Gamma\left(v'\right)
\end{align*}`
]

.pull-right[
<img src="figure/product_alignment_conceptual.png" width="500" style="display: block; margin: auto;" />
]

---

### Estimating Weights: Transport Approach

Let `$V_p$` and `$V_q$` be two subsets of topics within the graph.

* Let the total "mass" of `$V_p$` be `$p = \left\{\Gamma\left(v\right)^T 1 : v \in V_{p}\right\}$`. Define `$q$` similarly.
* Define the transport cost `$C\left(v, v^\prime\right) := JSD\left(\beta\left(v\right), \beta\left(v^\prime\right)\right)$`, the Jensen-Shannon divergence between the pair of topic distributions [19].

---

### Estimating Weights: Transport Approach

The weights `$W$` can be estimated by solving the optimal transport problem,
`\begin{align*}
&\min_{W \in \mathcal{U}\left(p, q\right)} \left<C,W\right>
\end{align*}`

`\begin{align*}
\mathcal{U}\left(p, q\right) := &\{W\in \mathbb{R}^{\left|V_{p}\right| \times \left|V_{q}\right|}_{+} : W 1_{\left|V_{q}\right|} = p \text{ and } W^{T} 1_{\left|V_{p}\right|} = q\}.
\end{align*}`

---

### Estimating Weights: Transport Approach

---

### Diagnostics

* Paths: Partitions the Sankey diagram into connected sets of topics.
* Coherence: Measures transience of a topic along its path.
* Refinement: Reflects degree of mixing between descendant topics.

---

---

### True Model

Sanity check - What is the alignment when data are generated from a topic model?
Can you guess the true `$K$`?

* `$N = 250, D = 1000, \lambda_{\gamma} = 0.5, \lambda_{\beta} = 0.1$`

---

### Diagnostics

The diagnostics suggest that the true `$K$` is 5.

---

The diagnostics become more reliable as the sample size increases.

---

### Model with background variation

Can we detect systematic departures from the assumed model?  Consider the
following generative mechanism,

`\begin{align*}
x_{i} \vert \*B, \gamma_{i}, \nu_i &\sim \Mult\left(n_{i}, \alpha \*B\gamma_{i} + \left(1 - \alpha\right)\nu_i\right) \\
\nu_{i} &\sim \Dir\left(\lambda_{\nu}\right) \\
\gamma_i &\sim \Dir\left(\lambda_{\gamma}\right) \\
\beta_{k} &\sim \Dir\left(\lambda_{\beta}\right),
\end{align*}`

where `$\*B$` stacks the `$\beta_k$` rowwise.

---

### Result

Question: Which of these is closer to a true topic model, and which are closer
to random Dirichlets?

.pull-left[
<img src="figure/gradient_flow-1.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-2.png" width="300" style="display: block; margin: auto;" />
]

.pull-right[
<img src="figure/gradient_flow-3.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-4.png" width="300" style="display: block; margin: auto;" />
]

---

### Diagnostics

.pull-left-small[
This structure is consistent across simulation runs, and the diagnostics
quantify topic deterioration.
]

.pull-right-large[
<img src="figure/gradient-combined.png" width="635" style="display: block; margin: auto;" />
]

---

### Data Analysis Background

.pull-left[
Reference [20] used clustering to identify 5 Community State Types (CSTs) in
the vaginal microbiome.
  - Four healthy CSTs are dominated by Lactobacillus variants.
  - A fifth dysbiotic CST is more compositionally diverse and has been
  implicated in preterm birth [21; 22] and HIV transmission [23].
]

[20] grouped samples (columns) into CSTs.
 
]

---

### Deconstructing CSTs

.pull-left[
The follow-up study [6] had many more samples than [20] and so could begin to tease additional structure lying
behind CSTs.

* They had 2179 samples from 135 women, sampled longitudinally.
* The green and blue paths to the right reflect the known Lactobacillus CSTs.
]

<img src="figure/pregnancy_sankey.jpg" width=500/>
]

---

### Coherence Scores

.pull-left[
* Overlaying coherence scores onto the alignment, we can distinguish distinguish
between high and low-coherence topics.

* Coherence is not a simple function of `$K$` alone.
]

---

### Metrics across Replicates

.pull-left[
* We can see dropoffs in topic quality for `$K = 6 \to 7$` and `$K = 9 \to 10$`.
* Each color in the top figure corresponds to a different set of topic models.
]

---

### Taxonomic Breakdown

At `$K = 5$`, four of the previously known CSTs are present. At `$K = 9$`, we can
see additional community structure in the dysbiotic CST.

---

### Time Series

Like in the antibiotics study, we can visualize change in topics over time. This
highlights smooth transitions in topic memberships.

---

.pull-left[
**Test set perplexity** [24; 17]
  - Similar: Helps to choose `$K$`.
  - Different (-): Doesn't give topic-level quality.
  - Different (+): Evaluates generalization ability.
]

.pull-right[
**Hierarchical Topic Models** [25; 26]
  - Similar: Learns topics at multiple levels of resolution.
  - Different (-): Requires word-level representations.
  - Different (+/-): Words belong to individual subtrees.
]

---
class: small-code

### Software

Topic alignment is implemented in the R package [alto](https://lasy.github.io/alto).

``` r
library(purrr)
library(alto)

# Define LDA parameters
params <- map(
 set_names(1:10),
 ~ list(k = .)
)
models <- run_lda_models(
 vm_data$counts,
 params
)
```
]

``` r
# Run alignment
result <- align_topics(models, method = "transport")
plot(result)
```

<img src="20250226_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
]

All the simulations discussed today are vignettes in the package.

---

### Takeaways

Topic alignment is a simple but useful addition to the exploratory data analysis
toolbox for count data.

.pull-three-quarters-right[
[Paper](https://go.wisc.edu/tify36) 
<img src="figure/qr-paper.png" width=80/>
 
 
 
[Package](https://lasy.github.io/alto) 
<img src="figure/qr-package.png" width=80/>
]

---

* Collaborators: Laura Symul (UCLouvain), Julia Fukuyama (IU Bloomington)
* Lab Members: Margaret Thairu, Shuchen Yan, Yuliang Peng, Helena Huang
* Funding: NIGMS R01GM152744, NIAID R01AI184095

---

### References

[1] M. Arumugam et al. "Enterotypes of the human gut microbiome". In: _Nature_ 473.7346 (Apr. 2011), p. 174–180. ISSN: 1476-4687. DOI:
[10.1038/nature09944](https://doi.org/10.1038%2Fnature09944). URL: [http://dx.doi.org/10.1038/nature09944](http://dx.doi.org/10.1038/nature09944).

[2] E. Yong. "Gut microbial 'enterotypes' become less clear-cut". En. In: _Nature_ (Mar. 2012).

[3] O. Koren et al. "A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets". En. In: _PLoS Comput.
Biol._ 9.1 (Jan. 2013), p. e1002863.

[4] I. Bulygin et al. "Absence of enterotypes in the human gut microbiomes reanalyzed with non-linear dimensionality reduction methods". En. In: _PeerJ_ 11 (Sep. 2023),
p. e15838.

[5] D. Knights et al. "Rethinking enterotypes". En. In: _Cell Host Microbe_ 16.4 (Oct. 2014), pp. 433-437.

[6] L. Symul et al. "Sub-communities of the vaginal microbiota in pregnant and non-pregnant women". In: _Proceedings of the Royal Society B: Biological Sciences_ 290.2011
(Nov. 2023). ISSN: 1471-2954. DOI: [10.1098/rspb.2023.1461](https://doi.org/10.1098%2Frspb.2023.1461). URL:
[http://dx.doi.org/10.1098/rspb.2023.1461](http://dx.doi.org/10.1098/rspb.2023.1461).

[7] J. K. Pritchard et al. "Inference of Population Structure Using Multilocus Genotype Data". In: _Genetics_ 155.2 (Jun. 2000), p. 945–959. ISSN: 1943-2631. DOI:
[10.1093/genetics/155.2.945](https://doi.org/10.1093%2Fgenetics%2F155.2.945). URL:
[http://dx.doi.org/10.1093/genetics/155.2.945](http://dx.doi.org/10.1093/genetics/155.2.945).

[8] D. M. Blei et al. "Latent dirichlet allocation". In: _J. Mach. Learn. Res._ 3.null (Mar. 2003), p. 993–1022. ISSN: 1532-4435.

[9] K. Sankaran et al. "Latent variable modeling for the microbiome". In: _Biostatistics_ 20.4 (Jun. 2018), p. 599–614. ISSN: 1468-4357. DOI:
[10.1093/biostatistics/kxy018](https://doi.org/10.1093%2Fbiostatistics%2Fkxy018). URL:
[http://dx.doi.org/10.1093/biostatistics/kxy018](http://dx.doi.org/10.1093/biostatistics/kxy018).

[10] A. Kim et al. "Latent Dirichlet Allocation modeling of environmental microbiomes". In: _PLOS Computational Biology_ 19.6 (Jun. 2023). Ed. by G. Zeller, p. e1011075.
ISSN: 1553-7358. DOI: [10.1371/journal.pcbi.1011075](https://doi.org/10.1371%2Fjournal.pcbi.1011075). URL:
[http://dx.doi.org/10.1371/journal.pcbi.1011075](http://dx.doi.org/10.1371/journal.pcbi.1011075).

[11] C. Tataru et al. "Topic modeling for multi-omic integration in the human gut microbiome and implications for Autism". In: _Scientific Reports_ 13.1 (Jul. 2023).
ISSN: 2045-2322. DOI: [10.1038/s41598-023-38228-0](https://doi.org/10.1038%2Fs41598-023-38228-0). URL:
[http://dx.doi.org/10.1038/s41598-023-38228-0](http://dx.doi.org/10.1038/s41598-023-38228-0).

[12] X. Peng et al. "A topic modeling approach reveals the dynamic T cell composition of peripheral blood during cancer immunotherapy". In: _Cell Reports Methods_ 3.8
(Aug. 2023), p. 100546. ISSN: 2667-2375. DOI: [10.1016/j.crmeth.2023.100546](https://doi.org/10.1016%2Fj.crmeth.2023.100546). URL:
[http://dx.doi.org/10.1016/j.crmeth.2023.100546](http://dx.doi.org/10.1016/j.crmeth.2023.100546).

[13] K. K. Dey et al. "Visualizing the structure of RNA-seq expression data using grade of membership models". In: _PLOS Genetics_ 13.3 (Mar. 2017). Ed. by A. Kundaje, p.
e1006599. ISSN: 1553-7404. DOI: [10.1371/journal.pgen.1006599](https://doi.org/10.1371%2Fjournal.pgen.1006599). URL:
[http://dx.doi.org/10.1371/journal.pgen.1006599](http://dx.doi.org/10.1371/journal.pgen.1006599).

---

### References

[14] J. H. Kushal K Dey. "GTEX V6 analysis - stephenslab.github.io".  (). [Accessed 21-07-2024].

[15] L. Dethlefsen et al. "Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation". In: _Proceedings of
the National Academy of Sciences_ 108.supplement_1 (Sep. 2010), p. 4554–4561. ISSN: 1091-6490. DOI: [10.1073/pnas.1000087107](https://doi.org/10.1073%2Fpnas.1000087107).
URL: [http://dx.doi.org/10.1073/pnas.1000087107](http://dx.doi.org/10.1073/pnas.1000087107).

[16] J. Novembre. "Pritchard, Stephens, and Donnelly on Population Structure". In: _Genetics_ 204.2 (Oct. 2016), p. 391–393. ISSN: 1943-2631. DOI:
[10.1534/genetics.116.195164](https://doi.org/10.1534%2Fgenetics.116.195164). URL:
[http://dx.doi.org/10.1534/genetics.116.195164](http://dx.doi.org/10.1534/genetics.116.195164).

[17] H. M. Wallach et al. "Evaluation methods for topic models". In: _International Conference on Machine Learning_ (2009). URL:
[https://api.semanticscholar.org/CorpusID:10910725](https://api.semanticscholar.org/CorpusID:10910725).

[18] D. J. Lawson et al. "A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots". In: _Nature Communications_ 9.1 (Aug. 2018). ISSN: 2041-1723. DOI:
[10.1038/s41467-018-05257-7](https://doi.org/10.1038%2Fs41467-018-05257-7). URL:
[http://dx.doi.org/10.1038/s41467-018-05257-7](http://dx.doi.org/10.1038/s41467-018-05257-7).

[19] G. Peyré et al. "Computational Optimal Transport".  (2018). DOI: [10.48550/ARXIV.1803.00567](https://doi.org/10.48550%2FARXIV.1803.00567). URL:
[https://arxiv.org/abs/1803.00567](https://arxiv.org/abs/1803.00567).

[20] J. Ravel et al. "Vaginal microbiome of reproductive-age women". In: _Proceedings of the National Academy of Sciences_ 108.supplement_1 (Jun. 2010), p. 4680–4687.
ISSN: 1091-6490. DOI: [10.1073/pnas.1002611107](https://doi.org/10.1073%2Fpnas.1002611107). URL:
[http://dx.doi.org/10.1073/pnas.1002611107](http://dx.doi.org/10.1073/pnas.1002611107).

[21] J. M. Fettweis et al. "The vaginal microbiome and preterm birth". In: _Nature Medicine_ 25.6 (May. 2019), p. 1012–1021. ISSN: 1546-170X. DOI:
[10.1038/s41591-019-0450-2](https://doi.org/10.1038%2Fs41591-019-0450-2). URL: [http://dx.doi.org/10.1038/s41591-019-0450-2](http://dx.doi.org/10.1038/s41591-019-0450-2).

[22] U. Gudnadottir et al. "The vaginal microbiome and the risk of preterm birth: a systematic review and network meta-analysis". In: _Scientific Reports_ 12.1 (May.
2022). ISSN: 2045-2322. DOI: [10.1038/s41598-022-12007-9](https://doi.org/10.1038%2Fs41598-022-12007-9). URL:
[http://dx.doi.org/10.1038/s41598-022-12007-9](http://dx.doi.org/10.1038/s41598-022-12007-9).

[23] C. Gosmann et al. "Lactobacillus-Deficient Cervicovaginal Bacterial Communities Are Associated with Increased HIV Acquisition in Young South African Women". In:
_Immunity_ 46.1 (Jan. 2017), p. 29–37. ISSN: 1074-7613. DOI: [10.1016/j.immuni.2016.12.013](https://doi.org/10.1016%2Fj.immuni.2016.12.013). URL:
[http://dx.doi.org/10.1016/j.immuni.2016.12.013](http://dx.doi.org/10.1016/j.immuni.2016.12.013).

[24] J. Foulds et al. "Annealing paths for the evaluation of topic models". In: _Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP
2013)_ (Jan. 2013).

[25] D. M. Blei et al. "Hierarchical topic models and the nested chinese restaurant process". In: _Proceedings of the 16th International Conference on Neural Information
Processing Systems_. NIPS'03. Whistler, British Columbia, Canada: MIT Press, 2003, p. 17–24.

[26] A. Smith et al. "Hiearchie: Visualization for Hierarchical Topic Models".  (2014). DOI: [10.3115/v1/w14-3111](https://doi.org/10.3115%2Fv1%2Fw14-3111). URL:
[http://dx.doi.org/10.3115/v1/W14-3111](http://dx.doi.org/10.3115/v1/W14-3111).

---

### References

---

### Paths

For each `$v$`, identify the incoming edge with the highest normalized weight,
`\begin{align*}
  e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right).
\end{align*}`