class: title `\(\def\Dir{\text{Dir}}\)` `\(\def\Mult{\text{Mult}}\)` `\(\def\*#1{\mathbf{#1}}\)` `\(\def\m#1{\boldsymbol{#1}}\)` `\(\def\Unif{\text{Unif}}\)` `\(\def\win{\tilde{w}_{\text{in}}}\)` `\(\def\reals{\mathbb{R}}\)` `\(\newcommand{\wout}{\tilde w_{\text{out}}}\)` ## Topic Models in Microbiome Analysis <div id="subtitle_right"> Kris Sankaran <br/> 26 | February | 2025 <br/> <a href="https://measurement-and-microbes.org">measurement-and-microbes.org</a> <br/> </div> <div id="subtitle_left"> VUMC Department of Biostatistics<br/> Slides: <a href="https://go.wisc.edu/w380nq">go.wisc.edu/w380nq</a><br/> Paper: <a href="https://go.wisc.edu/tify36">go.wisc.edu/tify36</a> </div> --- ### Motivation: Enterotypes The study [1] argued that all human gut microbiomes could be divided into well-defined enterotypes: .pull-left[ > ... We identified three robust clusters (enterotypes hereafter) that are not nation or continent-specific. ] .pull-right[ <img src="figure/enterotypes.png" width=300/><br/> <span style="font-size: 18px;"> The visualization that motivated the definition of enterotypes. </span> ] --- ### Motivation: Enterotypes This idea caught on. Could we have discovered something as fundamental as blood type? .center[ <img src="figure/news-1.png" width=500/> ] --- ### Motivation: Enterotypes This idea caught on. Could we have discovered something as fundamental as blood type? .center[ <img src="figure/news-2.png" width=750/> ] --- ### Motivation: Enterotypes This idea caught on. Could we have discovered something as fundamental as blood type? .center[ <img src="figure/news-3.png" width=550/> ] --- ### Motivation: Enterotypes Reality is more complex -- and interesting! Microbiomes are dynamic and not so easily categorized [2; 3; 4; 5]. .center[ <img src="figure/enterotype-rebuttal.png" width=300/> ] <span style="font-size: 18px;"> The same person can jump between clusters in longitudinal sampling [5], which complicates the simpler enterotypes story. </span> --- ### Improved Exploration Samples often can't be cleanly separated into clusters. For more informative exploration of these data, it helps to consider, 1. **Mixtures**: Instead of matching a sample to a single cluster centroid, view them as intermediates between multiple representatives. 1. Multiple Scales: Allow transitions from coarse to fine-grained analysis. <span style="font-size: 18px;"> .center[ <img src="figure/clusters_vs_mixtures.png" width=565/><br/> Figure adapted from [6]. ] </span> --- ### Improved Exploration Samples often can't be cleanly separated into clusters. For more informative exploration of these data, it helps to consider, 1. Mixtures: Instead of matching a sample to a single cluster centroid, view them as intermediates between multiple representatives. 1. **Multiple Scales**: Allow transitions from coarse to fine-grained analysis. .center[ <img src="figure/multiscale.png"/> ] --- class: middle .center[ ## Topic Models ] --- ### Origins Topic models were independently developed for analyzing genotype and text data [7; 8] and are now widely used in computational genomics [9; 10; 11; 12]. .center[ <img src="figure/text_genotypes.png" width=1200/> ] --- ### Example 1: GTEX How are genes co-expressed across tissues? We can apply topic models to GTEX Consortium data, following [13; 14]. .center[ <img src="figure/gtex_memberships.png" width=780/> ] --- ### Example 1: GTEX These are the expression patterns associated with each of those topics. .center[ <img src="figure/GTex-topics.png" width=730/> ] <!-- [Example Interpretation](https://www.genecards.org/cgi-bin/carddisp.pl?id_type=hgnc&id=7939#summaries) --> --- ### Example 2: Gut Microbiome How do microbiome communities change in response to antibiotics? The study [15] gathered samples before and after antibiotic interventions. .center[ <iframe src="https://krisrs1128.github.io/treelapse/pages/antibiotic.html#htmlwidget-f49f5eec4ced01f92314" width=800 height=360/> ] --- ### Example 2: Gut Microbiome If we use topic models, we can see that <span style="color: #476b57;">Topic 2</span> increases after the antibiotic interventions, especially the first [9]. .center[ <img src="figure/antibiotic_memberships.png" width=1000/> ] --- ### Example 2: Gut Microbiome We can interpret topics by looking for representative taxa. These are species that have much higher probabilities in one topic compared to the others. .center[ <img src="figure/antibiotic_prototypes.png" width=900/> ] --- ### Model Topic models suppose that samples `\(x_i \in \mathbb{R}^{D}\)` are drawn independently: `\begin{align*} x_i \vert \gamma_i &\sim \text{Mult}\left(n_{i}, \*B\gamma_{i}\right) \\ \gamma_{i} &\sim \text{Dir}\left(\lambda_{\gamma} 1_{K}\right) \end{align*}` where the columns `\(\beta_{k}\)` of `\(\*B \in \Delta^{K}_{D}\)` lie in the `\(D\)`-dimensional simplex and are themselves drawn independently from, `\begin{align*} \beta_{k} \sim \text{Dir}\left(\lambda_{\beta} 1_{D}\right). \end{align*}` We vertically stack the `\(N\)` `\(\gamma_i\)`'s into `\(\Gamma \in \Delta^{N}_{K}\)`. --- ### Simplex View This model considers two sets of mixtures simultaneously. * Memberships: `\(\gamma_{i}\)` describes sample `\(x_i\)` as a mixture of topics. * Topics: `\(\beta_{k}\)` describes the composition of topic `\(k\)`. <center> <img src="figure/latent_dirichlet_v2.png" width=550/> </center> --- class: middle .center[ ## Topic Alignment: Method ] --- ### Choice of `\(K\)` > However, we stress that care should be taken in the interpretation of the inferred value of `\(K\)`. To begin with, due to the very high dimensionality of the parameter space, we found it difficult to obtain reliable estimates of `\(P\left(X \vert K\right)\)`... There are also biological reasons to be careful interpreting `\(K\)`. -- From [7]. In practice, it's common to check results and goodness-of-fit measures across many `\(K\)` [16; 17; 18]. --- ### `alto`: Main Idea We will fit an ensemble of models of varying complexities. Then, post-estimation, we will build a compact representation of the result. .center[ <img src="figure/alto_sketches_annotated alignment.png" width="750" style="display: block; margin: auto;" /> <span style="font-size: 20px;"> In the Sankey diagram, columns are models and rectangles are topics. </span> ] --- ### Alignment as a Graph We view an alignment as a graph across the ensemble. Index models by `\(m\)` and topics by `\(k\)`. Then, * Nodes `\(V\)` describe topics, parameterized by `\(\{\beta^m_{k}, \gamma^m_{k}\}\)`. * Edges `\(E\)` link topics from neighboring models, i.e. `\(K\)` to `\(K + 1\)`. * Weights `\(W\)` encode the similarity between topics. <img src="figure/alto_sketches_annotated alignment.png" width="560" style="display: block; margin: auto;" /> --- ### Notation This graph-based view provides a convenient notation, * `\(m\left(v\right)\)` is the model for node `\(v\)` * `\(k\left(v\right)\)` is the topic for node `\(v\)` * `\(\Gamma\left(v\right) := \left(\gamma_{ v\left(k\right)}^m\left(k\right)\right) \in \reals^n_{+}\)` is the vector of mixed memberships for topic `\(v\)` * `\(\beta\left(v\right) := \beta_{k}^m \in \Delta^{D}\)` is the corresponding topic distribution * `\(e = \left(v, v'\right)\)` is an edge linking topics `\(v\)` and `\(v'\)`. <img src="figure/alto_sketches_annotated alignment.png" width="560" style="display: block; margin: auto;" /> --- ### Estimating Weights: Product Approach .pull-left[ To compute weights, we can use, `\begin{align*} w\left(e\right) = \Gamma\left(v\right)^T\Gamma\left(v'\right) \end{align*}` ] .pull-right[ <img src="figure/product_alignment_conceptual.png" width="500" style="display: block; margin: auto;" /> ] --- ### Estimating Weights: Transport Approach Let `\(V_p\)` and `\(V_q\)` be two subsets of topics within the graph. * Let the total "mass" of `\(V_p\)` be `\(p = \left\{\Gamma\left(v\right)^T 1 : v \in V_{p}\right\}\)`. Define `\(q\)` similarly. * Define the transport cost `\(C\left(v, v^\prime\right) := JSD\left(\beta\left(v\right), \beta\left(v^\prime\right)\right)\)`, the Jensen-Shannon divergence between the pair of topic distributions [19]. <img src="figure/transport_alignment_conceptual.png" width="420" style="display: block; margin: auto;" /> --- ### Estimating Weights: Transport Approach The weights `\(W\)` can be estimated by solving the optimal transport problem, `\begin{align*} &\min_{W \in \mathcal{U}\left(p, q\right)} \left<C,W\right> \end{align*}` <span style="font-size: 20px;"> `\begin{align*} \mathcal{U}\left(p, q\right) := &\{W\in \mathbb{R}^{\left|V_{p}\right| \times \left|V_{q}\right|}_{+} : W 1_{\left|V_{q}\right|} = p \text{ and } W^{T} 1_{\left|V_{p}\right|} = q\}. \end{align*}` </span> <img src="figure/transport_alignment_conceptual.png" width="420" style="display: block; margin: auto;" /> --- ### Estimating Weights: Transport Approach The weights `\(W\)` can be estimated by solving the optimal transport problem, `\begin{align*} &\min_{W \in \mathcal{U}\left(p, q\right)} \left<C,W\right> \end{align*}` <span style="font-size: 20px;"> `\begin{align*} \mathcal{U}\left(p, q\right) := &\{W\in \mathbb{R}^{\left|V_{p}\right| \times \left|V_{q}\right|}_{+} : W 1_{\left|V_{q}\right|} = p \text{ and } W^{T} 1_{\left|V_{p}\right|} = q\}. \end{align*}` </span> <img src="figure/optimal_transport_matrices.png" width="800" style="display: block; margin: auto;" /> --- ### Diagnostics * Paths: Partitions the Sankey diagram into connected sets of topics. * Coherence: Measures transience of a topic along its path. * Refinement: Reflects degree of mixing between descendant topics. <img src="figure/alto_sketches_diagnotics.png" width="2696" style="display: block; margin: auto;" /> --- class: middle .center[ ## Topic Alignment: Examples ] --- ### True Model Sanity check - What is the alignment when data are generated from a topic model? Can you guess the true `\(K\)`? * `\(N = 250, D = 1000, \lambda_{\gamma} = 0.5, \lambda_{\beta} = 0.1\)` <img src="figure/transport-true-lda.png" width="480" style="display: block; margin: auto;" /> --- ### Diagnostics The diagnostics suggest that the true `\(K\)` is 5. <img src="figure/lda-combined.png" width="2003" style="display: block; margin: auto;" /> --- The diagnostics become more reliable as the sample size increases. <img src="figure/summary_alto_asymptotic_behavior.png" width="1000" style="display: block; margin: auto;" /> --- ### Model with background variation Can we detect systematic departures from the assumed model? Consider the following generative mechanism, `\begin{align*} x_{i} \vert \*B, \gamma_{i}, \nu_i &\sim \Mult\left(n_{i}, \alpha \*B\gamma_{i} + \left(1 - \alpha\right)\nu_i\right) \\ \nu_{i} &\sim \Dir\left(\lambda_{\nu}\right) \\ \gamma_i &\sim \Dir\left(\lambda_{\gamma}\right) \\ \beta_{k} &\sim \Dir\left(\lambda_{\beta}\right), \end{align*}` where `\(\*B\)` stacks the `\(\beta_k\)` rowwise. --- ### Result Question: Which of these is closer to a true topic model, and which are closer to random Dirichlets? .pull-left[ <img src="figure/gradient_flow-1.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-2.png" width="300" style="display: block; margin: auto;" /> ] .pull-right[ <img src="figure/gradient_flow-3.png" width="300" style="display: block; margin: auto;" /><img src="figure/gradient_flow-4.png" width="300" style="display: block; margin: auto;" /> ] --- ### Diagnostics .pull-left-small[ This structure is consistent across simulation runs, and the diagnostics quantify topic deterioration. ] .pull-right-large[ <img src="figure/gradient-combined.png" width="635" style="display: block; margin: auto;" /> ] --- ### Data Analysis Background .pull-left[ Reference [20] used clustering to identify 5 Community State Types (CSTs) in the vaginal microbiome. - Four healthy CSTs are dominated by Lactobacillus variants. - A fifth dysbiotic CST is more compositionally diverse and has been implicated in preterm birth [21; 22] and HIV transmission [23]. ] .pull-right[ <img src="figure/community_state_types.jpg" width=380/> <span style="font-size: 18px;"> [20] grouped samples (columns) into CSTs. </span> ] --- ### Deconstructing CSTs .pull-left[ The follow-up study [6] had many more samples than [20] and so could begin to tease additional structure lying behind CSTs. * They had 2179 samples from 135 women, sampled longitudinally. * The green and blue paths to the right reflect the known Lactobacillus CSTs. ] .pull-right[ <img src="figure/pregnancy_sankey.jpg" width=500/> ] --- ### Coherence Scores .pull-left[ * Overlaying coherence scores onto the alignment, we can distinguish distinguish between high and low-coherence topics. * Coherence is not a simple function of `\(K\)` alone. ] .pull-right[ <img src="figure/coherence_on_tree.png" width=340/> ] --- ### Metrics across Replicates .pull-left[ * We can see dropoffs in topic quality for `\(K = 6 \to 7\)` and `\(K = 9 \to 10\)`. * Each color in the top figure corresponds to a different set of topic models. ] .pull-right[ <img src="figure/alignment_scores_across_k.png"/> ] --- ### Taxonomic Breakdown .center[ <img src="figure/pregnancy_topics_heatmap.jpg" width=600/> ] <span style="font-size: 18px;"> At `\(K = 5\)`, four of the previously known CSTs are present. At `\(K = 9\)`, we can see additional community structure in the dysbiotic CST. </span> --- ### Time Series Like in the antibiotics study, we can visualize change in topics over time. This highlights smooth transitions in topic memberships. .center[ <img src="figure/topic_trajectories_pregnancy.jpeg" width=700/> ] --- .pull-left[ **Test set perplexity** [24; 17] - Similar: Helps to choose `\(K\)`. - Different (-): Doesn't give topic-level quality. - Different (+): Evaluates generalization ability. ] .pull-right[ **Hierarchical Topic Models** [25; 26] - Similar: Learns topics at multiple levels of resolution. - Different (-): Requires word-level representations. - Different (+/-): Words belong to individual subtrees. ] <img src="figure/hlda.png"/> --- class: small-code ### Software Topic alignment is implemented in the R package [alto](https://lasy.github.io/alto). .pull-left[ ``` r library(purrr) library(alto) # Define LDA parameters params <- map( set_names(1:10), ~ list(k = .) ) models <- run_lda_models( vm_data$counts, params ) ``` ] .pull-right[ ``` r # Run alignment result <- align_topics(models, method = "transport") plot(result) ``` <img src="20250226_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /> ] All the simulations discussed today are vignettes in the package. --- ### Takeaways Topic alignment is a simple but useful addition to the exploratory data analysis toolbox for count data. .pull-three-quarters-left[ <img src="figure/alto_sketches_annotated alignment.png" width=700/> ] .pull-three-quarters-right[ [Paper](https://go.wisc.edu/tify36)<br/> <img src="figure/qr-paper.png" width=80/> <br/> <br/> <br/> [Package](https://lasy.github.io/alto)<br/> <img src="figure/qr-package.png" width=80/> ] --- class: background-rivers .center[ ### Thank you! ] * Collaborators: Laura Symul (UCLouvain), Julia Fukuyama (IU Bloomington) * Lab Members: Margaret Thairu, Shuchen Yan, Yuliang Peng, Helena Huang * Funding: NIGMS R01GM152744, NIAID R01AI184095 .center[ <img src="figure/Lena__credit_row_0.jpg"/> ] --- class: reference ### References [1] M. Arumugam et al. "Enterotypes of the human gut microbiome". In: _Nature_ 473.7346 (Apr. 2011), p. 174–180. ISSN: 1476-4687. DOI: [10.1038/nature09944](https://doi.org/10.1038%2Fnature09944). URL: [http://dx.doi.org/10.1038/nature09944](http://dx.doi.org/10.1038/nature09944). [2] E. Yong. "Gut microbial 'enterotypes' become less clear-cut". En. In: _Nature_ (Mar. 2012). [3] O. Koren et al. "A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets". En. In: _PLoS Comput. Biol._ 9.1 (Jan. 2013), p. e1002863. [4] I. Bulygin et al. "Absence of enterotypes in the human gut microbiomes reanalyzed with non-linear dimensionality reduction methods". En. In: _PeerJ_ 11 (Sep. 2023), p. e15838. [5] D. Knights et al. "Rethinking enterotypes". En. In: _Cell Host Microbe_ 16.4 (Oct. 2014), pp. 433-437. [6] L. Symul et al. "Sub-communities of the vaginal microbiota in pregnant and non-pregnant women". In: _Proceedings of the Royal Society B: Biological Sciences_ 290.2011 (Nov. 2023). ISSN: 1471-2954. DOI: [10.1098/rspb.2023.1461](https://doi.org/10.1098%2Frspb.2023.1461). URL: [http://dx.doi.org/10.1098/rspb.2023.1461](http://dx.doi.org/10.1098/rspb.2023.1461). [7] J. K. Pritchard et al. "Inference of Population Structure Using Multilocus Genotype Data". In: _Genetics_ 155.2 (Jun. 2000), p. 945–959. ISSN: 1943-2631. DOI: [10.1093/genetics/155.2.945](https://doi.org/10.1093%2Fgenetics%2F155.2.945). URL: [http://dx.doi.org/10.1093/genetics/155.2.945](http://dx.doi.org/10.1093/genetics/155.2.945). [8] D. M. Blei et al. "Latent dirichlet allocation". In: _J. Mach. Learn. Res._ 3.null (Mar. 2003), p. 993–1022. ISSN: 1532-4435. [9] K. Sankaran et al. "Latent variable modeling for the microbiome". In: _Biostatistics_ 20.4 (Jun. 2018), p. 599–614. ISSN: 1468-4357. DOI: [10.1093/biostatistics/kxy018](https://doi.org/10.1093%2Fbiostatistics%2Fkxy018). URL: [http://dx.doi.org/10.1093/biostatistics/kxy018](http://dx.doi.org/10.1093/biostatistics/kxy018). [10] A. Kim et al. "Latent Dirichlet Allocation modeling of environmental microbiomes". In: _PLOS Computational Biology_ 19.6 (Jun. 2023). Ed. by G. Zeller, p. e1011075. ISSN: 1553-7358. DOI: [10.1371/journal.pcbi.1011075](https://doi.org/10.1371%2Fjournal.pcbi.1011075). URL: [http://dx.doi.org/10.1371/journal.pcbi.1011075](http://dx.doi.org/10.1371/journal.pcbi.1011075). [11] C. Tataru et al. "Topic modeling for multi-omic integration in the human gut microbiome and implications for Autism". In: _Scientific Reports_ 13.1 (Jul. 2023). ISSN: 2045-2322. DOI: [10.1038/s41598-023-38228-0](https://doi.org/10.1038%2Fs41598-023-38228-0). URL: [http://dx.doi.org/10.1038/s41598-023-38228-0](http://dx.doi.org/10.1038/s41598-023-38228-0). [12] X. Peng et al. "A topic modeling approach reveals the dynamic T cell composition of peripheral blood during cancer immunotherapy". In: _Cell Reports Methods_ 3.8 (Aug. 2023), p. 100546. ISSN: 2667-2375. DOI: [10.1016/j.crmeth.2023.100546](https://doi.org/10.1016%2Fj.crmeth.2023.100546). URL: [http://dx.doi.org/10.1016/j.crmeth.2023.100546](http://dx.doi.org/10.1016/j.crmeth.2023.100546). [13] K. K. Dey et al. "Visualizing the structure of RNA-seq expression data using grade of membership models". In: _PLOS Genetics_ 13.3 (Mar. 2017). Ed. by A. Kundaje, p. e1006599. ISSN: 1553-7404. DOI: [10.1371/journal.pgen.1006599](https://doi.org/10.1371%2Fjournal.pgen.1006599). URL: [http://dx.doi.org/10.1371/journal.pgen.1006599](http://dx.doi.org/10.1371/journal.pgen.1006599). --- class: reference ### References [14] J. H. Kushal K Dey. "GTEX V6 analysis - stephenslab.github.io". (). [Accessed 21-07-2024]. [15] L. Dethlefsen et al. "Incomplete recovery and individualized responses of the human distal gut microbiota to repeated antibiotic perturbation". In: _Proceedings of the National Academy of Sciences_ 108.supplement_1 (Sep. 2010), p. 4554–4561. ISSN: 1091-6490. DOI: [10.1073/pnas.1000087107](https://doi.org/10.1073%2Fpnas.1000087107). URL: [http://dx.doi.org/10.1073/pnas.1000087107](http://dx.doi.org/10.1073/pnas.1000087107). [16] J. Novembre. "Pritchard, Stephens, and Donnelly on Population Structure". In: _Genetics_ 204.2 (Oct. 2016), p. 391–393. ISSN: 1943-2631. DOI: [10.1534/genetics.116.195164](https://doi.org/10.1534%2Fgenetics.116.195164). URL: [http://dx.doi.org/10.1534/genetics.116.195164](http://dx.doi.org/10.1534/genetics.116.195164). [17] H. M. Wallach et al. "Evaluation methods for topic models". In: _International Conference on Machine Learning_ (2009). URL: [https://api.semanticscholar.org/CorpusID:10910725](https://api.semanticscholar.org/CorpusID:10910725). [18] D. J. Lawson et al. "A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots". In: _Nature Communications_ 9.1 (Aug. 2018). ISSN: 2041-1723. DOI: [10.1038/s41467-018-05257-7](https://doi.org/10.1038%2Fs41467-018-05257-7). URL: [http://dx.doi.org/10.1038/s41467-018-05257-7](http://dx.doi.org/10.1038/s41467-018-05257-7). [19] G. Peyré et al. "Computational Optimal Transport". (2018). DOI: [10.48550/ARXIV.1803.00567](https://doi.org/10.48550%2FARXIV.1803.00567). URL: [https://arxiv.org/abs/1803.00567](https://arxiv.org/abs/1803.00567). [20] J. Ravel et al. "Vaginal microbiome of reproductive-age women". In: _Proceedings of the National Academy of Sciences_ 108.supplement_1 (Jun. 2010), p. 4680–4687. ISSN: 1091-6490. DOI: [10.1073/pnas.1002611107](https://doi.org/10.1073%2Fpnas.1002611107). URL: [http://dx.doi.org/10.1073/pnas.1002611107](http://dx.doi.org/10.1073/pnas.1002611107). [21] J. M. Fettweis et al. "The vaginal microbiome and preterm birth". In: _Nature Medicine_ 25.6 (May. 2019), p. 1012–1021. ISSN: 1546-170X. DOI: [10.1038/s41591-019-0450-2](https://doi.org/10.1038%2Fs41591-019-0450-2). URL: [http://dx.doi.org/10.1038/s41591-019-0450-2](http://dx.doi.org/10.1038/s41591-019-0450-2). [22] U. Gudnadottir et al. "The vaginal microbiome and the risk of preterm birth: a systematic review and network meta-analysis". In: _Scientific Reports_ 12.1 (May. 2022). ISSN: 2045-2322. DOI: [10.1038/s41598-022-12007-9](https://doi.org/10.1038%2Fs41598-022-12007-9). URL: [http://dx.doi.org/10.1038/s41598-022-12007-9](http://dx.doi.org/10.1038/s41598-022-12007-9). [23] C. Gosmann et al. "Lactobacillus-Deficient Cervicovaginal Bacterial Communities Are Associated with Increased HIV Acquisition in Young South African Women". In: _Immunity_ 46.1 (Jan. 2017), p. 29–37. ISSN: 1074-7613. DOI: [10.1016/j.immuni.2016.12.013](https://doi.org/10.1016%2Fj.immuni.2016.12.013). URL: [http://dx.doi.org/10.1016/j.immuni.2016.12.013](http://dx.doi.org/10.1016/j.immuni.2016.12.013). [24] J. Foulds et al. "Annealing paths for the evaluation of topic models". In: _Proceedings of the Empirical Methods in Natural Language Processing Conference (EMNLP 2013)_ (Jan. 2013). [25] D. M. Blei et al. "Hierarchical topic models and the nested chinese restaurant process". In: _Proceedings of the 16th International Conference on Neural Information Processing Systems_. NIPS'03. Whistler, British Columbia, Canada: MIT Press, 2003, p. 17–24. [26] A. Smith et al. "Hiearchie: Visualization for Hierarchical Topic Models". (2014). DOI: [10.3115/v1/w14-3111](https://doi.org/10.3115%2Fv1%2Fw14-3111). URL: [http://dx.doi.org/10.3115/v1/W14-3111](http://dx.doi.org/10.3115/v1/W14-3111). --- class: reference ### References --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-1.png" width="270" style="display: block; margin: auto;" /> --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-2.png" width="270" style="display: block; margin: auto;" /> --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-3.png" width="270" style="display: block; margin: auto;" /> --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-4.png" width="270" style="display: block; margin: auto;" /> --- ### Paths For each `\(v\)`, identify the incoming edge with the highest normalized weight, `\begin{align*} e^\ast\left(v\right) = \arg \max_{e : \text{target}\left(e\right) = v} \tilde{w}_{\text{out}}\left(e\right) + \tilde{w}_{\text{in}}\left(e\right). \end{align*}` * Iterate this process from large to small `\(l\)` to construct a set of distinct paths along the alignment * The number of unique paths is a useful property of an alignment <img src="figure/refinement-branches-5.png" width="270" style="display: block; margin: auto;" /> --- ### Coherence Coherence quantifies a topic's average connectedness to other topics along the same path, `\begin{align*} c(v) = \frac{1}{|\text{Path}\left(v\right)|} \sum_{v' \in \text{Path}\left(v\right)} \min\left(\win\left(v, v'\right), \wout\left(v, v'\right) \right) \end{align*}` * Transient topics (appearing at one `\(K\)` and disappearing at another) have low coherence scores * Consistently recovered topics across choices of `\(K\)` have high coherence .pull-right[ ] --- ### Refinement Parent specificity identifies two distinct regimes, * High Refinement: Each topic receives the most mass from a unique parent, corresponding to a true or "compromise" topic * Low Refinement: Each topic receives substantial mass from several parents, each corresponding to an arbitrary split of a true topic <img src="figure/refinement_diagnostic_example.png" width="350" style="display: block; margin: auto;" /> --- ### Refinement Parent specificity identifies two distinct regimes, * High Refinement: Each topic receives the most mass from a unique parent, corresponding to a true or "compromise" topic * Low Refinement: Each topic receives substantial mass from several parents, each corresponding to an arbitrary split of a true topic `\begin{align*} r(v)=\frac{\left|V_{m}\right|}{M-m} \sum_{m^{\prime}=m+1}^{M} \sum_{v_{m^{\prime}}^{\prime} \in V_{m^{\prime}}} \wout \left(v, v_{m^{\prime}}^{\prime}\right) \win\left(v, v_{m^{\prime}}^{\prime}\right) \end{align*}`