Multiscale Topic Modeling with the Alto Package

`$\def\Dir{\text{Dir}}$`
`$\def\Mult{\text{Mult}}$`
`$\def\*#1{\mathbf{#1}}$`
`$\def\m#1{\boldsymbol{#1}}$`
`$\def\Unif{\text{Unif}}$`
`$\def\win{\tilde{w}_{\text{in}}}$`
`$\def\reals{\mathbb{R}}$`
`$\newcommand{\wout}{\tilde w_{\text{out}}}$`

## Multiscale Topic Modeling with the Alto Package

<div id="subtitle">
Kris Sankaran <br/>
30 | April | 2025 <br/>
<a href="https://measurement-and-microbes.org">measurement-and-microbes.org</a> <br/>
</div>

<div id="subtitle_right">
SDSS 2025<br/>
Slides: <a href="https://go.wisc.edu/9q5ls4">go.wisc.edu/9q5ls4</a><br/>
Paper: <a href="https://go.wisc.edu/tify36">go.wisc.edu/tify36</a>
</div>

---

### Model

Topic models suppose that samples `$x_i \in \mathbb{R}^{D}$` are drawn independently:
`\begin{align*}
x_i \vert \gamma_i &\sim \text{Mult}\left(n_{i}, \*B\gamma_{i}\right) \\
\gamma_{i} &\sim \text{Dir}\left(\lambda_{\gamma} 1_{K}\right)
\end{align*}`
where the columns `$\beta_{k}$` of `$\*B \in \Delta^{K}_{D}$` lie in the `$D$`-dimensional simplex and are themselves drawn independently from,
`\begin{align*}
\beta_{k} \sim \text{Dir}\left(\lambda_{\beta} 1_{D}\right).
\end{align*}`

We vertically stack the `$N$` `$\gamma_i$`'s into `$\Gamma \in \Delta^{N}_{K}$`.

---

### Simplex View

This model considers two sets of mixtures simultaneously.

* Memberships: `$\gamma_{i}$` describes sample `$x_i$` as a mixture of topics.
* Topics: `$\beta_{k}$` describes the composition of topic `$k$`.

---

### Example: Gut Microbiome

If we use topic models, we can see that <span style="color: #476b57;">Topic 2</span> increases during the antibiotic interventions,
especially the first [1].

---

### Example: Gut Microbiome

We can interpret topics by looking for representative taxa. These are species
that have much higher probabilities in one topic compared to the others.

---

---

### Choice of `$K$`

> However, we stress that care should be taken in the interpretation of the
inferred value of `$K$`. To begin with, due to the very high dimensionality of the
parameter space, we found it difficult to obtain reliable estimates of
`$P\left(X \vert K\right)$`... There are also biological reasons to be careful
interpreting `$K$`.

-- From [2].

In practice, it's common to check results and goodness-of-fit measures across
many `$K$` [3; 4; 5].

---

### `alto`: Main Idea

We will fit an ensemble of models of varying complexities. Then,
post-estimation, we will build a compact representation of the result.

.center[
<img src="figure/alto_sketches_annotated alignment.png" width="750" style="display: block; margin: auto;" />

<span style="font-size: 20px;">
In the Sankey diagram, columns are models and rectangles are topics.
</span>
]

---

### Alignment as a Graph

We view an alignment as a graph across the ensemble. Index models by `$m$` and
topics by `$k$`. Then,
* Nodes `$V$` describe topics, parameterized by `$\{\beta^m_{k}, \gamma^m_{k}\}$`.
* Edges `$E$` link topics from neighboring models, i.e. `$K$` to `$K + 1$`.
* Weights `$W$` encode the similarity between topics.

---

### Notation

This graph-based view provides a convenient notation,

* `$m\left(v\right)$` is the model for node `$v$`
* `$k\left(v\right)$` is the topic for node `$v$`
* `$\Gamma\left(v\right) := \left(\gamma_{ v\left(k\right)}^m\left(k\right)\right) \in \reals^n_{+}$` is the vector of
mixed memberships for topic `$v$`
* `$\beta\left(v\right) := \beta_{k}^m \in \Delta^{D}$` is the
corresponding topic distribution
* `$e = \left(v, v'\right)$` is an edge linking topics `$v$` and `$v'$`.

---

### Estimating Weights

Let `$V_p$` and `$V_q$` be two subsets of topics within the graph.

* Let the total "mass" of `$V_p$` be `$p = \left\{\Gamma\left(v\right)^T 1 : v \in V_{p}\right\}$`. Define `$q$` similarly.
* Define the transport cost `$C\left(v, v^\prime\right) := JSD\left(\beta\left(v\right), \beta\left(v^\prime\right)\right)$`, the Jensen-Shannon divergence between the pair of topic distributions [6].

---

### Estimating Weights

The weights `$W$` can be estimated by solving the optimal transport problem,
`\begin{align*}
&\min_{W \in \mathcal{U}\left(p, q\right)} \left<C,W\right>
\end{align*}`
<span style="font-size: 20px;">
`\begin{align*}
\mathcal{U}\left(p, q\right) := &\{W\in \mathbb{R}^{\left|V_{p}\right| \times \left|V_{q}\right|}_{+} : W 1_{\left|V_{q}\right|} = p \text{ and } W^{T} 1_{\left|V_{p}\right|} = q\}.
\end{align*}`
</span>

---

### Diagnostics

* Paths: Partitions the Sankey diagram into connected sets of topics.
* Coherence: Measures transience of a topic along its path.
* Refinement: Reflects degree of mixing between descendant topics.

---

---

### True Model

Sanity check - What is the alignment when data are generated from a topic model?
Can you guess the true `$K$`?

* `$N = 250, D = 1000, \lambda_{\gamma} = 0.5, \lambda_{\beta} = 0.1$`

---

### Diagnostics

The diagnostics suggest that the true `$K$` is 5.

---

### Model with background variation

Can we detect systematic departures from the assumed model?  Consider the
following generative mechanism,

`\begin{align*}
x_{i} \vert \*B, \gamma_{i}, \nu_i &\sim \Mult\left(n_{i}, \alpha \*B\gamma_{i} + \left(1 - \alpha\right)\nu_i\right) \\
\nu_{i} &\sim \Dir\left(\lambda_{\nu}\right) \\
\gamma_i &\sim \Dir\left(\lambda_{\gamma}\right) \\
\beta_{k} &\sim \Dir\left(\lambda_{\beta}\right),
\end{align*}`

where `$\*B$` stacks the `$\beta_k$` rowwise.

---

### Result

The alignment structure is sensitive to changes in `$\alpha$` and fragments when
structure is not present.

.pull-left[
<img src="figure/gradient_flow-1.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-2.png" width="300" style="display: block; margin: auto;" />
]

.pull-right[
<img src="figure/gradient_flow-3.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-4.png" width="300" style="display: block; margin: auto;" />
]

---

### Diagnostics

.pull-left-small[
This structure is consistent across simulation runs, and the diagnostics
quantify topic deterioration.
]

.pull-right-large[
<img src="figure/gradient-combined.png" width="635" style="display: block; margin: auto;" />
]

---

### Data Analysis Background

.pull-left[
[7] used clustering to identify 5 Community State Types (CSTs) in
the vaginal microbiome.
  - Four healthy CSTs are dominated by Lactobacillus variants.
  - A fifth dysbiotic CST is more compositionally diverse and has been
  implicated in preterm birth [8; 9] and HIV transmission [10].
]

.pull-right[
  <img src="figure/community_state_types.jpg" width=380/>
  <span style="font-size: 18px;">

[7] grouped samples (columns) into CSTs.
  </span>
]

---

### Deconstructing CSTs

.pull-left[
The follow-up study [11] had many more samples than [7] and so could begin to tease additional structure lying
behind CSTs.

* They had 2179 samples from 135 women, sampled longitudinally.
* The green and blue paths to the right reflect the known Lactobacillus CSTs.
]

<img src="figure/pregnancy_sankey.jpg" width=500/>
]

---

### Coherence Scores

.pull-left[
* Overlaying coherence scores onto the alignment, we can distinguish distinguish
between high and low-coherence topics.

* Coherence is not a simple function of `$K$` alone.
]

---

### Time Series

Like in the antibiotics study, we can visualize change in topics over time. This
highlights smooth transitions in topic memberships.

---

### Software

Topic alignment is implemented in the R package [alto](https://lasy.github.io/alto).

``` r
library(purrr)
library(alto)

# Define LDA parameters
params <- map(
  set_names(1:10),
  ~ list(k = .)
)
models <- run_lda_models(
  vm_data$counts,
  params
)
```
]

``` r
# Run alignment
result <- align_topics(models, method = "transport")
plot(result)
```

<img src="20250430_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" />
]

All the simulations discussed today are vignettes in the package.

---

### Takeaways

Topic alignment is a simple but useful addition to the exploratory data analysis
toolbox for count data.

.pull-three-quarters-right[
[Paper](https://go.wisc.edu/tify36)<br/>
<img src="figure/qr-paper.png" width=80/>
<br/>
<br/>
<br/>
[Package](https://lasy.github.io/alto)<br/>
<img src="figure/qr-package.png" width=80/>
]

---

* Collaborators: Laura Symul (UCLouvain), Julia Fukuyama (IU Bloomington)
* Lab Members: Margaret Thairu, Hanying Jiang, Shuchen Yan, Yuliang Peng, Kai Cui, Sam Merten
* Funding: NIGMS R01GM152744

---

### References

[1] K. Sankaran et al. "Latent variable modeling for the microbiome". In: _Biostatistics_ 20.4 (Jun. 2018), p. 599–614. ISSN: 1468-4357. DOI:
[10.1093/biostatistics/kxy018](https://doi.org/10.1093%2Fbiostatistics%2Fkxy018). URL:
[http://dx.doi.org/10.1093/biostatistics/kxy018](http://dx.doi.org/10.1093/biostatistics/kxy018).

[2] J. K. Pritchard et al. "Inference of Population Structure Using Multilocus Genotype Data". In: _Genetics_ 155.2 (Jun. 2000), p. 945–959. ISSN: 1943-2631. DOI:
[10.1093/genetics/155.2.945](https://doi.org/10.1093%2Fgenetics%2F155.2.945). URL: [http://dx.doi.org/10.1093/genetics/155.2.945](http://dx.doi.org/10.1093/genetics/155.2.945).

[3] J. Novembre. "Pritchard, Stephens, and Donnelly on Population Structure". In: _Genetics_ 204.2 (Oct. 2016), p. 391–393. ISSN: 1943-2631. DOI:
[10.1534/genetics.116.195164](https://doi.org/10.1534%2Fgenetics.116.195164). URL: [http://dx.doi.org/10.1534/genetics.116.195164](http://dx.doi.org/10.1534/genetics.116.195164).

[4] H. M. Wallach et al. "Evaluation methods for topic models". In: _International Conference on Machine Learning_ (2009). URL:
[https://api.semanticscholar.org/CorpusID:10910725](https://api.semanticscholar.org/CorpusID:10910725).

[5] D. J. Lawson et al. "A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots". In: _Nature Communications_ 9.1 (Aug. 2018). ISSN: 2041-1723. DOI:
[10.1038/s41467-018-05257-7](https://doi.org/10.1038%2Fs41467-018-05257-7). URL: [http://dx.doi.org/10.1038/s41467-018-05257-7](http://dx.doi.org/10.1038/s41467-018-05257-7).

[6] G. Peyré et al. "Computational Optimal Transport".  (2018). DOI: [10.48550/ARXIV.1803.00567](https://doi.org/10.48550%2FARXIV.1803.00567). URL:
[https://arxiv.org/abs/1803.00567](https://arxiv.org/abs/1803.00567).

[7] J. Ravel et al. "Vaginal microbiome of reproductive-age women". In: _Proceedings of the National Academy of Sciences_ 108.supplement_1 (Jun. 2010), p. 4680–4687. ISSN: 1091-6490.
DOI: [10.1073/pnas.1002611107](https://doi.org/10.1073%2Fpnas.1002611107). URL: [http://dx.doi.org/10.1073/pnas.1002611107](http://dx.doi.org/10.1073/pnas.1002611107).

[8] J. M. Fettweis et al. "The vaginal microbiome and preterm birth". In: _Nature Medicine_ 25.6 (May. 2019), p. 1012–1021. ISSN: 1546-170X. DOI:
[10.1038/s41591-019-0450-2](https://doi.org/10.1038%2Fs41591-019-0450-2). URL: [http://dx.doi.org/10.1038/s41591-019-0450-2](http://dx.doi.org/10.1038/s41591-019-0450-2).

[9] U. Gudnadottir et al. "The vaginal microbiome and the risk of preterm birth: a systematic review and network meta-analysis". In: _Scientific Reports_ 12.1 (May. 2022). ISSN:
2045-2322. DOI: [10.1038/s41598-022-12007-9](https://doi.org/10.1038%2Fs41598-022-12007-9). URL:
[http://dx.doi.org/10.1038/s41598-022-12007-9](http://dx.doi.org/10.1038/s41598-022-12007-9).

[10] C. Gosmann et al. "Lactobacillus-Deficient Cervicovaginal Bacterial Communities Are Associated with Increased HIV Acquisition in Young South African Women". In: _Immunity_ 46.1
(Jan. 2017), p. 29–37. ISSN: 1074-7613. DOI: [10.1016/j.immuni.2016.12.013](https://doi.org/10.1016%2Fj.immuni.2016.12.013). URL:
[http://dx.doi.org/10.1016/j.immuni.2016.12.013](http://dx.doi.org/10.1016/j.immuni.2016.12.013).

[11] L. Symul et al. "Sub-communities of the vaginal microbiota in pregnant and non-pregnant women". In: _Proceedings of the Royal Society B: Biological Sciences_ 290.2011 (Nov.
2023). ISSN: 1471-2954. DOI: [10.1098/rspb.2023.1461](https://doi.org/10.1098%2Frspb.2023.1461). URL:
[http://dx.doi.org/10.1098/rspb.2023.1461](http://dx.doi.org/10.1098/rspb.2023.1461).

---

### Paths

For each `$v$`, identify the incoming edge with the highest normalized weight,
`\begin{align*}
  e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right).
\end{align*}`