Multiscale Analysis of Microbiome Data with Alto

class: title

`$\def\Dir{\text{Dir}}$`
`$\def\Mult{\text{Mult}}$`
`$\def\*#1{\mathbf{#1}}$`
`$\def\m#1{\boldsymbol{#1}}$`
`$\def\Unif{\text{Unif}}$`
`$\def\win{\tilde{w}_{\text{in}}}$`
`$\def\reals{\mathbb{R}}$`
`$\newcommand{\wout}{\tilde w_{\text{out}}}$`

## Multiscale Analysis of Microbiome Data with Alto

<div id="subtitle">
Kris Sankaran <br/>
03 | May | 2024 <br/>
Lab: <a href="https://go.wisc.edu/pgb8nl">go.wisc.edu/pgb8nl</a> <br/>
</div>

<div id="subtitle_right">
STATGEN 2024 <br/>
Microbiome Data Analysis <br/>
Slides: <a href="https://go.wisc.edu/u10492">go.wisc.edu/u10492</a> 
</div>

---

### Motivation: Microbiome Data Analysis

.pull-left[
* How should we describe the bacterial ecosystems that surround (and
inhabit) us?
* An improved understanding would support human and environmental health
]

.pull-right[
<div class="figure" style="text-align: left">
<img src="https://whatislife.stanford.edu/images/spatial.png" alt="A 'photo' of the microbiome along the gut lining, from Earle et al. 2015."  />
<p class="caption">A 'photo' of the microbiome along the gut lining, from Earle et al. 2015.</p>
</div>
]

---

### Iterative Data Structuration

.pull-left[
* Iterative data structuration creates a sequence of simple `$\to$` complex models (Holmes, 1993; Mallows et al., 1982)
* It makes it easy to navigate the interpretability/expressivity tradeoff
  - Begin with the overview (context), then focus on residual variation (focus)
]

.pull-right[
<div class="figure" style="text-align: left">
<img src="figure/structuration.png" alt="A metaphor for structuration, from (Holmes, 1993)." width="340" />
<p class="caption">A metaphor for structuration, from (Holmes, 1993).</p>
</div>
]

---

### Latent Dirichlet Allocation

Latent Dirichlet Allocation (Blei et al., 2003) is often useful for
dimensionality reduction for microbiome data (Sankaran et al., 2018; Kim et al., 2023; Tataru et al., 2023).
It supposes that samples `$x_i \in \mathbb{R}^{D}$` are drawn independently:
`\begin{align*}
x_i \vert \gamma_i &\sim \text{Mult}\left(n_{i}, \*B\gamma_{i}\right) \\
\gamma_{i} &\sim \text{Dir}\left(\lambda_{\gamma} 1_{K}\right)
\end{align*}`
where the columns `$\beta_{k}$` of `$\*B \in \Delta^{K}_{D}$` lie in the `$D$`-dimensional simplex and are themselves drawn independently from,
`\begin{align*}
\beta_{k} \sim \text{Dir}\left(\lambda_{\beta} 1_{D}\right).
\end{align*}`

We vertically stack the `$N$` `$\gamma_i$`'s into `$\Gamma \in \Delta^{N}_{K}$`.

---

### Latent Dirichlet Allocation

Topic models are well-suited to count data dimensionality reduction. The
estimated parameters can be interpreted as,
* `$\Gamma \in \Delta_{K}^{N}$`: Per-document memberships across `$K$` topics.
* `$\*B \in \Delta_{D}^{K}$`: Per topic distributions over `$D$` words.
* `$K$` = Number of topics = Number of extreme points

---

### How to Choose `$K$`?

* It is often difficult to interpret these models when `$K$` is large
* Heuristic: Look at smaller `$K$` to build intuition about topics
  - Then analyze deviations between topics at coarse and fine scales
  - This is guided by iterative data structuration

---

### Main Idea

* We will fit an ensemble of models of varying complexities.
* Post-estimation, we will build a compact representation of the result.
* In the Sankey diagram, columns are models and rectangles are topics

---

### Alignment as a Graph

We view an alignment as a graph across the ensemble. Index models by `$m$` and
topics by `$k$`. Then,
* Nodes `$V$` correspond to topics, parameterized by `$\{\beta^m_{k}, \gamma^m_{k}\}$`
* Edges `$E$` are placed between topics from neighboring models, i.e. `$K$` to `$K + 1$`
* Weights `$W$` encode the similarity between topics

---

### Notation

This graph-based view provides a convenient notation,

* `$m\left(v\right)$` is the model for node `$v$`
* `$k\left(v\right)$` is the topic for node `$v$`
* `$\Gamma\left(v\right) := \left(\gamma_{ v\left(k\right)}^m\left(k\right)\right) \in \reals^n_{+}$` is the vector of
mixed memberships for topic `$v$`
* `$\beta\left(v\right) := \beta_{k}^m \in \Delta^{D}$` is the
corresponding topic distribution
* `$e = \left(v, v'\right)$` is an edge linking topics `$v$` and `$v'$`.

---

### Estimating Weights: Product approach

.pull-left[
To compute weights, we can use,
`\begin{align*}
w\left(e\right) = \Gamma\left(v\right)^T\Gamma\left(v'\right)
\end{align*}`
]

.pull-right[
<img src="figure/product_alignment_conceptual.png" width="500" style="display: block; margin: auto;" />
]

---

### Estimating Weights: Transport approach

Let `$V_p$` and `$V_q$` be two subsets of topics within the graph.

* Let the total "mass" of `$V_p$` be `$p = \left\{\Gamma\left(v\right)^T 1 : v \in V_{p}\right\}$`. Define `$q$` similarly.
* Define the transport cost `$C\left(v, v^\prime\right) := JSD\left(\beta\left(v\right), \beta\left(v^\prime\right)\right)$`, the Jensen-Shannon divergence between the pair of topic distributions.

---

### Estimating Weights: Transport approach

The weights `$W$` can be estimated by solving the optimal transport problem,
`\begin{align*}
&\min_{W \in \mathcal{U}\left(p, q\right)} \left<C,W\right> \\
\mathcal{U}\left(p, q\right) := &\{W\in \mathbb{R}^{\left|V_{p}\right| \times \left|V_{q}\right|}_{+} : W 1_{\left|V_{q}\right|} = p \text{ and } W^{T} 1_{\left|V_{p}\right|} = q\}.
\end{align*}`

---

### Diagnostics

* Paths: Partitions the Sankey diagram into connected sets of topics
* Coherence: Measures transience of a topic along its path
* Refinement: Reflects degree of mixing between descendant topics

---

### True LDA Model

Sanity check - What is the alignment when data are generated from LDA? Can you guess the true `$K$`?

* `$N = 250, D = 1000, \lambda_{\gamma} = 0.5, \lambda_{\beta} = 0.1$`

---

### Diagnostics

The diagnostics suggest that the true `$K$` is 5.

---

The diagnostics become more reliable as the sample size increases.

---

### LDA with background variation

What happens when the LDA model is mis-specified? Consider the following
generative mechanism,

`\begin{align*}
x_{i} \vert \*B, \gamma_{i}, \nu_i &\sim \Mult\left(n_{i}, \alpha \*B\gamma_{i} + \left(1 - \alpha\right)\nu_i\right) \\
\nu_{i} &\sim \Dir\left(\lambda_{\nu}\right) \\
\gamma_i &\sim \Dir\left(\lambda_{\gamma}\right) \\
\beta_{k} &\sim \Dir\left(\lambda_{\beta}\right),
\end{align*}`

where `$\*B$` stacks the `$\beta_k$` rowwise.

---

### Result

The alignment structure is sensitive to changes in `$\alpha$` and fragments when
LDA structure is not present.

.pull-left[
<img src="figure/gradient_flow-1.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-2.png" width="300" style="display: block; margin: auto;" />
]

.pull-right[
<img src="figure/gradient_flow-3.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-4.png" width="300" style="display: block; margin: auto;" />
]

---

### Diagnostics

.pull-left-small[
This structure is consistent across simulation runs, and the diagnostics 
quantify topic deterioration
]
  
.pull-right-large[
<img src="figure/gradient-combined.png" width="635" style="display: block; margin: auto;" />
]

---

### Strain switching

The final simulation mimics the strain switching problem.
  * Small subsets of species switch between two otherwise similar topics
  * Multiple resolutions are required to detect the difference

---

### Mechanism

We construct related topics `$\tilde{\beta}_k^r$` by perturbing an underlying
collection of `$\beta_{k}$`. Then, for each sample `$i$` and each `$k$`, we draw one
member from the class,

`\begin{align*}
\beta_{k}^{i} &\sim \Unif\left(\left\{\tilde{\beta}_{k}^{1}, \dots, \tilde{\beta}_{k}^{R}\right\}\right)
\end{align*}`

stack the results into `$\*B^{i}$`, and then draw,

`\begin{align}
x_{i} &\sim \Mult\left(n_{i}, \*B^{i}\gamma_{i}\right)
\end{align}`
as in standard LDA.

---

### Results
* There are five topics, two of which exhibit strain switching
* At smaller `$K$`, we recognize the five main topics
* At larger `$K$`, we can distinguish perturbed variants

.pull-left[
<img src="figure/equivalence_flow.png" width="960" style="display: block; margin: auto;" />
]

.pull-right[
<img src="figure/equivalence_betas.png" width="4320" style="display: block; margin: auto;" />
]

---

### Results
* At smaller `$K$`, we recognize the main community structure, but don't see strain switching
* At larger `$K$`, we can recognize instances of switching

---

### Data Analysis

.pull-left[
Ravel et al. (2011) use clustering to identify 5 Community State Types (CSTs) in
the vaginal microbiome
  - Four healthy CSTs are dominated by Lactobacillus variants
  - A fifth dysbiotic CST is more compositionally diverse and has been
  implicated in preterm birth (Fettweis et al., 2019; Gudnadottir et al., 2022) and HIV transmission (Gosmann et al., 2017).
]

.pull-right[
<div class="figure" style="text-align: center">
<img src="figure/community_state_types.jpg" alt="CSTs as found by Ravel et al. (2011)." width="420" />
<p class="caption">CSTs as found by Ravel et al. (2011).</p>
</div>
]

---

### Dataset

Consider the 16S modality from the follow-up study (Symul et al., 2023).
.pull-left[
* 135 individuals followed through pregnancy
* 2179 samples and 2699 species
]

.pull-right[

```r
library(alto)
library(purrr)
data(vm_data)
map(vm_data, dim)
```

```
## $sample_info
## [1] 2179   61
## 
## $counts
## [1] 2179 2699
## 
## $taxonomy
## [1] 2699    7
```
]

---

### Interpretations

* Using the paths diagnostic, either `$K = 7$` or `$12$` give justifiable models
  - We can also distinguish high and low coherence topics
* At `$K = 7$`, the four Lactobacillus CSTs are already present, and the remaining
represent coherent subtypes in the dysbiotic CSTs
 
.pull-left[
<img src="figure/microbiome_flow.png" width="280" style="display: block; margin: auto;" />
]

.pull-right[
<img src="figure/microbiome_betas.png" width="800" style="display: block; margin: auto;" />
]
 
---

.pull-left[
**Test set perplexity**
  - Similar: Helps to choose `$K$`.
  - Different (-): Doesn't give topic-level quality.
  - Different (+): Evaluates generalization ability.
]

.pull-right[
**Hierarchical Topic Models **
  - Similar: Learns topics at multiple levels of resolution
  - Different (-): Requires word-level representations.
  - Different (+/-): Words belong to individual subtrees
]

---

### Software

Topic alignment is implemented in the R package [alto](lasy.github.io/alto).
 
.pull-left[

```r
library(purrr)
library(alto)

# simulate data and fit models
x <- rmultinom(20, 500, rep(0.1, 50))
colnames(x) <- seq_len(ncol(x))
lda_params <- setNames(map(1:10, ~ list(k = .)), 1:10)
lda_models <- run_lda_models(x, lda_params)

# perform alignment
result <- align_topics(lda_models)
```
]

.pull-right[

```r
plot(result)
```

<img src="20240503_files/figure-html/unnamed-chunk-28-1.png" style="display: block; margin: auto auto auto 0;" />
]

All the simulations discussed today are vignettes in the package.

---

### Takeaways

- `alto` paper: [ https://go.wisc.edu/tify36]( https://go.wisc.edu/tify36)
- `alto` documentation: [lasy.github.io/alto](https://lasy.github.io/alto/)

> Exploratory analysis can guide critical examination of complex problems, and
topic alignment is a simple but useful addition to the toolbox for count data.

.pull-left[
<img src="figure/structuration.png" width="200" style="display: block; margin: auto;" />
]

.pull-right[
<img src="figure/alto_sketches_annotated alignment.png" width="1439" style="display: block; margin: auto;" />

]

---

class: background-rivers

.center[
### Thank you!
]

* Collaborators: Laura Symul (UCLouvain), Julia Fukuyama (IU Bloomington)
* Lab Members: Margaret Thairu, Hanying Jiang, Mason Garza, Yuliang Peng, Kai Cui, and Kobe Uko
* Funding: NIGMS R01GM152744

.center[
<img src="figure/Lena__credit_row_0.jpg"/>
]

---

### Paths

For each `$v$`, identify the incoming edge with the highest normalized weight,
`\begin{align*}
  e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right).
\end{align*}`