class: title `\(\def\Dir{\text{Dir}}\)` `\(\def\Mult{\text{Mult}}\)` `\(\def\*#1{\mathbf{#1}}\)` `\(\def\m#1{\boldsymbol{#1}}\)` `\(\def\Unif{\text{Unif}}\)` `\(\def\win{\tilde{w}_{\text{in}}}\)` `\(\def\reals{\mathbb{R}}\)` `\(\newcommand{\wout}{\tilde w_{\text{out}}}\)` ## Topic Models for Multiscale Analysis <div id="subtitle"> Kris Sankaran <br/> 30 | April | 2025 <br/> Lab: <a href="https://measurement-and-microbes.org">measurement-and-microbes.org</a> <br/> </div> <div id="subtitle_right"> SDSS 2025<br/> Slides: <a href="https://go.wisc.edu/9q5ls4">go.wisc.edu/9q5ls4</a><br/> Paper: <a href="https://go.wisc.edu/tify36">go.wisc.edu/tify36</a> </div> --- ### Model Topic models suppose that samples `\(x_i \in \mathbb{R}^{D}\)` are drawn independently: `\begin{align*} x_i \vert \gamma_i &\sim \text{Mult}\left(n_{i}, \*B\gamma_{i}\right) \\ \gamma_{i} &\sim \text{Dir}\left(\lambda_{\gamma} 1_{K}\right) \end{align*}` where the columns `\(\beta_{k}\)` of `\(\*B \in \Delta^{K}_{D}\)` lie in the `\(D\)`-dimensional simplex and are themselves drawn independently from, `\begin{align*} \beta_{k} \sim \text{Dir}\left(\lambda_{\beta} 1_{D}\right). \end{align*}` We vertically stack the `\(N\)` `\(\gamma_i\)`'s into `\(\Gamma \in \Delta^{N}_{K}\)`. --- ### Simplex View This model considers two sets of mixtures simultaneously. * Memberships: `\(\gamma_{i}\)` describes sample `\(x_i\)` as a mixture of topics. * Topics: `\(\beta_{k}\)` describes the composition of topic `\(k\)`. <img src="figure/topic_double_triangle.png"/> --- ### Example: Gut Microbiome If we use topic models, we can see that <span style="color: #476b57;">Topic 2</span> increases during the antibiotic interventions, especially the first [1]. .center[ <img src="figure/antibiotic_memberships.png" width=1000/> ] --- ### Example: Gut Microbiome We can interpret topics by looking for representative taxa. These are species that have much higher probabilities in one topic compared to the others. .center[ <img src="figure/antibiotic_prototypes.png" width=900/> ] --- class: middle .center[ ## Topic Alignment: Method ] --- ### Choice of `\(K\)` > However, we stress that care should be taken in the interpretation of the inferred value of `\(K\)`. To begin with, due to the very high dimensionality of the parameter space, we found it difficult to obtain reliable estimates of `\(P\left(X \vert K\right)\)`... There are also biological reasons to be careful interpreting `\(K\)`. -- From [2]. In practice, it's common to check results and goodness-of-fit measures across many `\(K\)` [3; 4; 5]. --- ### `alto`: Main Idea We will fit an ensemble of models of varying complexities. Then, post-estimation, we will build a compact representation of the result. .center[ <img src="figure/alto_sketches_annotated alignment.png" width="750" style="display: block; margin: auto;" /> <span style="font-size: 20px;"> In the Sankey diagram, columns are models and rectangles are topics. </span> ] --- ### Alignment as a Graph We view an alignment as a graph across the ensemble. Index models by `\(m\)` and topics by `\(k\)`. Then, * Nodes `\(V\)` describe topics, parameterized by `\(\{\beta^m_{k}, \gamma^m_{k}\}\)`. * Edges `\(E\)` link topics from neighboring models, i.e. `\(K\)` to `\(K + 1\)`. * Weights `\(W\)` encode the similarity between topics. <img src="figure/alto_sketches_annotated alignment.png" width="560" style="display: block; margin: auto;" /> --- ### Notation This graph-based view provides a convenient notation, * `\(m\left(v\right)\)` is the model for node `\(v\)` * `\(k\left(v\right)\)` is the topic for node `\(v\)` * `\(\Gamma\left(v\right) := \left(\gamma_{ v\left(k\right)}^m\left(k\right)\right) \in \reals^n_{+}\)` is the vector of mixed memberships for topic `\(v\)` * `\(\beta\left(v\right) := \beta_{k}^m \in \Delta^{D}\)` is the corresponding topic distribution * `\(e = \left(v, v'\right)\)` is an edge linking topics `\(v\)` and `\(v'\)`. <img src="figure/alto_sketches_annotated alignment.png" width="560" style="display: block; margin: auto;" /> --- ### Estimating Weights Let `\(V_p\)` and `\(V_q\)` be two subsets of topics within the graph. * Let the total "mass" of `\(V_p\)` be `\(p = \left\{\Gamma\left(v\right)^T 1 : v \in V_{p}\right\}\)`. Define `\(q\)` similarly. * Define the transport cost `\(C\left(v, v^\prime\right) := JSD\left(\beta\left(v\right), \beta\left(v^\prime\right)\right)\)`, the Jensen-Shannon divergence between the pair of topic distributions [6]. <img src="figure/transport_alignment_conceptual.png" width="420" style="display: block; margin: auto;" /> --- ### Estimating Weights The weights `\(W\)` can be estimated by solving the optimal transport problem, `\begin{align*} &\min_{W \in \mathcal{U}\left(p, q\right)} \left<C,W\right> \end{align*}` <span style="font-size: 20px;"> `\begin{align*} \mathcal{U}\left(p, q\right) := &\{W\in \mathbb{R}^{\left|V_{p}\right| \times \left|V_{q}\right|}_{+} : W 1_{\left|V_{q}\right|} = p \text{ and } W^{T} 1_{\left|V_{p}\right|} = q\}. \end{align*}` </span> <img src="figure/transport_alignment_conceptual.png" width="420" style="display: block; margin: auto;" /> --- ### Diagnostics * Paths: Partitions the Sankey diagram into connected sets of topics. * Coherence: Measures transience of a topic along its path. * Refinement: Reflects degree of mixing between descendant topics. <img src="figure/alto_sketches_diagnotics.png" width="2696" style="display: block; margin: auto;" /> --- class: middle .center[ ## Topic Alignment: Examples ] --- ### True Model Sanity check - What is the alignment when data are generated from a topic model? Can you guess the true `\(K\)`? * `\(N = 250, D = 1000, \lambda_{\gamma} = 0.5, \lambda_{\beta} = 0.1\)` <img src="figure/transport-true-lda.png" width="480" style="display: block; margin: auto;" /> --- ### Diagnostics The diagnostics suggest that the true `\(K\)` is 5. <img src="figure/lda-combined.png" width="2003" style="display: block; margin: auto;" /> --- ### Model with background variation Can we detect systematic departures from the assumed model? Consider the following generative mechanism, `\begin{align*} x_{i} \vert \*B, \gamma_{i}, \nu_i &\sim \Mult\left(n_{i}, \alpha \*B\gamma_{i} + \left(1 - \alpha\right)\nu_i\right) \\ \nu_{i} &\sim \Dir\left(\lambda_{\nu}\right) \\ \gamma_i &\sim \Dir\left(\lambda_{\gamma}\right) \\ \beta_{k} &\sim \Dir\left(\lambda_{\beta}\right), \end{align*}` where `\(\*B\)` stacks the `\(\beta_k\)` rowwise. --- ### Result The alignment structure is sensitive to changes in `\(\alpha\)` and fragments when structure is not present. .pull-left[ <img src="figure/gradient_flow-1.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-2.png" width="300" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figure/gradient_flow-3.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-4.png" width="300" style="display: block; margin: auto;" /> ] --- ### Diagnostics .pull-left-small[ This structure is consistent across simulation runs, and the diagnostics quantify topic deterioration. ] .pull-right-large[ <img src="figure/gradient-combined.png" width="635" style="display: block; margin: auto;" /> ] --- ### Data Analysis Background .pull-left[ [7] used clustering to identify 5 Community State Types (CSTs) in the vaginal microbiome. - Four healthy CSTs are dominated by Lactobacillus variants. - A fifth dysbiotic CST is more compositionally diverse and has been implicated in preterm birth [8; 9] and HIV transmission [10]. ] .pull-right[ <img src="figure/community_state_types.jpg" width=380/> <span style="font-size: 18px;"> [7] grouped samples (columns) into CSTs. </span> ] --- ### Deconstructing CSTs .pull-left[ The follow-up study [11] had many more samples than [7] and so could begin to tease additional structure lying behind CSTs. * They had 2179 samples from 135 women, sampled longitudinally. * The green and blue paths to the right reflect the known Lactobacillus CSTs. ] .pull-right[ <img src="figure/pregnancy_sankey.jpg" width=500/> ] --- ### Coherence Scores .pull-left[ * Overlaying coherence scores onto the alignment, we can distinguish distinguish between high and low-coherence topics. * Coherence is not a simple function of `\(K\)` alone. ] .pull-right[ <img src="figure/coherence_on_tree.png" width=340/> ] --- ### Metrics across Replicates .pull-left[ * We can see dropoffs in topic quality for `\(K = 6 \to 7\)` and `\(K = 9 \to 10\)`. * Each color in the top figure corresponds to a different set of topic models. ] .pull-right[ <img src="figure/alignment_scores_across_k.png"/> ] --- ### Time Series Like in the antibiotics study, we can visualize change in topics over time. This highlights smooth transitions in topic memberships. .center[ <img src="figure/topic_trajectories_pregnancy.jpeg" width=700/> ] --- class: small-code ### Software Topic alignment is implemented in the R package [alto](https://lasy.github.io/alto). .pull-left[ ``` r library(purrr) library(alto) # Define LDA parameters params <- map( set_names(1:10), ~ list(k = .) ) models <- run_lda_models( vm_data$counts, params ) ``` ] .pull-right[ ``` r # Run alignment result <- align_topics(models, method = "transport") plot(result) ``` <img src="20250430_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> ] All the simulations discussed today are vignettes in the package. --- ### Takeaways Topic alignment is a simple but useful addition to the exploratory data analysis toolbox for count data. .pull-three-quarters-left[ <img src="figure/alto_sketches_annotated alignment.png" width=700/> ] .pull-three-quarters-right[ [Paper](https://go.wisc.edu/tify36)<br/> <img src="figure/qr-paper.png" width=80/> <br/> <br/> <br/> [Package](https://lasy.github.io/alto)<br/> <img src="figure/qr-package.png" width=80/> ] --- class: background-rivers .center[ ### Thank you! ] * Collaborators: Laura Symul (UCLouvain), Julia Fukuyama (IU Bloomington) * Lab Members: Margaret Thairu, Hanying Jiang, Shuchen Yan, Yuliang Peng, Kai Cui, Sam Merten * Funding: NIGMS R01GM152744 .center[ <img src="figure/Lena__credit_row_0.jpg"/> ] --- class: reference ### References [1] K. Sankaran et al. "Latent variable modeling for the microbiome". In: _Biostatistics_ 20.4 (Jun. 2018), p. 599–614. ISSN: 1468-4357. DOI: [10.1093/biostatistics/kxy018](https://doi.org/10.1093%2Fbiostatistics%2Fkxy018). URL: [http://dx.doi.org/10.1093/biostatistics/kxy018](http://dx.doi.org/10.1093/biostatistics/kxy018). [2] J. K. Pritchard et al. "Inference of Population Structure Using Multilocus Genotype Data". In: _Genetics_ 155.2 (Jun. 2000), p. 945–959. ISSN: 1943-2631. DOI: [10.1093/genetics/155.2.945](https://doi.org/10.1093%2Fgenetics%2F155.2.945). URL: [http://dx.doi.org/10.1093/genetics/155.2.945](http://dx.doi.org/10.1093/genetics/155.2.945). [3] J. Novembre. "Pritchard, Stephens, and Donnelly on Population Structure". In: _Genetics_ 204.2 (Oct. 2016), p. 391–393. ISSN: 1943-2631. DOI: [10.1534/genetics.116.195164](https://doi.org/10.1534%2Fgenetics.116.195164). URL: [http://dx.doi.org/10.1534/genetics.116.195164](http://dx.doi.org/10.1534/genetics.116.195164). [4] H. M. Wallach et al. "Evaluation methods for topic models". In: _International Conference on Machine Learning_ (2009). URL: [https://api.semanticscholar.org/CorpusID:10910725](https://api.semanticscholar.org/CorpusID:10910725). [5] D. J. Lawson et al. "A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots". In: _Nature Communications_ 9.1 (Aug. 2018). ISSN: 2041-1723. DOI: [10.1038/s41467-018-05257-7](https://doi.org/10.1038%2Fs41467-018-05257-7). URL: [http://dx.doi.org/10.1038/s41467-018-05257-7](http://dx.doi.org/10.1038/s41467-018-05257-7). [6] G. Peyré et al. "Computational Optimal Transport". (2018). DOI: [10.48550/ARXIV.1803.00567](https://doi.org/10.48550%2FARXIV.1803.00567). URL: [https://arxiv.org/abs/1803.00567](https://arxiv.org/abs/1803.00567). [7] J. Ravel et al. "Vaginal microbiome of reproductive-age women". In: _Proceedings of the National Academy of Sciences_ 108.supplement_1 (Jun. 2010), p. 4680–4687. ISSN: 1091-6490. DOI: [10.1073/pnas.1002611107](https://doi.org/10.1073%2Fpnas.1002611107). URL: [http://dx.doi.org/10.1073/pnas.1002611107](http://dx.doi.org/10.1073/pnas.1002611107). [8] J. M. Fettweis et al. "The vaginal microbiome and preterm birth". In: _Nature Medicine_ 25.6 (May. 2019), p. 1012–1021. ISSN: 1546-170X. DOI: [10.1038/s41591-019-0450-2](https://doi.org/10.1038%2Fs41591-019-0450-2). URL: [http://dx.doi.org/10.1038/s41591-019-0450-2](http://dx.doi.org/10.1038/s41591-019-0450-2). [9] U. Gudnadottir et al. "The vaginal microbiome and the risk of preterm birth: a systematic review and network meta-analysis". In: _Scientific Reports_ 12.1 (May. 2022). ISSN: 2045-2322. DOI: [10.1038/s41598-022-12007-9](https://doi.org/10.1038%2Fs41598-022-12007-9). URL: [http://dx.doi.org/10.1038/s41598-022-12007-9](http://dx.doi.org/10.1038/s41598-022-12007-9). [10] C. Gosmann et al. "Lactobacillus-Deficient Cervicovaginal Bacterial Communities Are Associated with Increased HIV Acquisition in Young South African Women". In: _Immunity_ 46.1 (Jan. 2017), p. 29–37. ISSN: 1074-7613. DOI: [10.1016/j.immuni.2016.12.013](https://doi.org/10.1016%2Fj.immuni.2016.12.013). URL: [http://dx.doi.org/10.1016/j.immuni.2016.12.013](http://dx.doi.org/10.1016/j.immuni.2016.12.013). [11] L. Symul et al. "Sub-communities of the vaginal microbiota in pregnant and non-pregnant women". In: _Proceedings of the Royal Society B: Biological Sciences_ 290.2011 (Nov. 2023). ISSN: 1471-2954. DOI: [10.1098/rspb.2023.1461](https://doi.org/10.1098%2Frspb.2023.1461). URL: [http://dx.doi.org/10.1098/rspb.2023.1461](http://dx.doi.org/10.1098/rspb.2023.1461). --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-1.png" width="270" style="display: block; margin: auto;" /> --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-2.png" width="270" style="display: block; margin: auto;" /> --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-3.png" width="270" style="display: block; margin: auto;" /> --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-4.png" width="270" style="display: block; margin: auto;" /> --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-5.png" width="270" style="display: block; margin: auto;" /> --- ### Coherence Coherence quantifies a topic's average connectedness to other topics along the same path, `\begin{align*} c(v) = \frac{1}{|\text{Path}\left(v\right)|} \sum_{v' \in \text{Path}\left(v\right)} \min\left(\win\left(v, v'\right), \wout\left(v, v'\right) \right) \end{align*}` * Transient topics (appearing at one `\(K\)` and disappearing at another) have low coherence scores * Consistently recovered topics across choices of `\(K\)` have high coherence .pull-right[ ] --- ### Refinement Parent specificity identifies two distinct regimes, * High Refinement: Each topic receives the most mass from a unique parent, corresponding to a true or "compromise" topic * Low Refinement: Each topic receives substantial mass from several parents, each corresponding to an arbitrary split of a true topic <img src="figure/refinement_diagnostic_example.png" width="350" style="display: block; margin: auto;" /> --- ### Refinement Parent specificity identifies two distinct regimes, * High Refinement: Each topic receives the most mass from a unique parent, corresponding to a true or "compromise" topic * Low Refinement: Each topic receives substantial mass from several parents, each corresponding to an arbitrary split of a true topic `\begin{align*} r(v)=\frac{\left|V_{m}\right|}{M-m} \sum_{m^{\prime}=m+1}^{M} \sum_{v_{m^{\prime}}^{\prime} \in V_{m^{\prime}}} \wout \left(v, v_{m^{\prime}}^{\prime}\right) \win\left(v, v_{m^{\prime}}^{\prime}\right) \end{align*}` --- The diagnostics become more reliable as the sample size increases. <img src="figure/summary_alto_asymptotic_behavior.png" width="1000" style="display: block; margin: auto;" /> --- .pull-left[ **Test set perplexity** [12; 4] - Similar: Helps to choose `\(K\)`. - Different (-): Doesn't give topic-level quality. - Different (+): Evaluates generalization ability. ] .pull-right[ **Hierarchical Topic Models** [13; 14] - Similar: Learns topics at multiple levels of resolution. - Different (-): Requires word-level representations. - Different (+/-): Words belong to individual subtrees. ] <img src="figure/hlda.png"/>