class: title background-size: cover background-image: url("figure/background.png") `$$\def\absarg#1{\left|#1\right|}$$` `$$\def\Earg#1{\mathbf{E}\left[#1\right]}$$` `$$\def\reals{\mathbb{R}} % Real number symbol$$` `$$\def\integers{\mathbb{Z}} % Integer symbol$$` `$$\def\*#1{\mathbf{#1}}$$` `$$\def\m#1{\boldsymbol{#1}}$$` `$$\def\FDR{\widehat{\operatorname{FDR}}}$$` `$$\def\Gsn{\mathcal{N}}$$` `$$\def\Unif{\operatorname{Unif}}$$` `$$\def\Bern{\operatorname{Bern}}$$` .pull-left[ <div id="title"> Multiple Testing in Quantitative Biology </div> <div id="subtitle"> BME 780: Methods in Quantitative Biology Kris Sankaran <br/> 11 | October | 2023 <br/> https://github.io/krisrs1128/LSLab </div> <div id="links"> Slides: https://go.wisc.edu/005s29 <br/> Code: https://go.wisc.edu/p2oly1 </div> ] --- ### Learning Objectives We will be able to help collaborators solve scientific problems by adapting modern methods from multiple testing. By the end of this talk, we will be able to: 1. Discuss the role of multiple testing abstractions in computational genomics settings. 1. Interpret `\(p\)`-value histograms and explain how they motivate the Benjamini-Hochberg procedure. 1. Improve power in large-scale inference by focusing on the most promising contexts. 1. Diagnose miscalibration in large-scale inference and apply data splitting to address it. --- ### Large Scale Inference Notation * Hypotheses of interest: `\(H_{1}, \dots, H_{M}\)`. Some of them are non-null, but you don't know which. * Associated `\(p\)`-values: `\(p_{1}, \dots, p_{M}\)`. Goal: Reject as many non-null hypotheses as possible while controlling the _False Discovery Rate_, `\begin{align*} \text{FDR} := \Earg{\frac{\absarg{\text{False Positives}}}{\absarg{\text{Rejections}}\vee 1}} \end{align*}` --- ### Examples .pull-left[ * Microbiome: Is taxon `\(m\)` associated with the development of autism in infants? * Epigenetics: Is CpG site `\(m\)` differentially methlyated among smokers? * Cancer: Is elevated immune cell expression of gene `\(m\)` associated with improved survival rates? ] .pull-right[ <img src="figure/microbiome_selection.jpg" width=450/> ] --- ### Examples * [Comic](https://imgs.xkcd.com/comics/significant.png) * [Search](https://www.biostars.org/post/search/?query=multiple+hypothesis+testing) --- ### `\(p\)`-value histogram * Under the null, the `\(p\)`-values follow a uniform distribution. * The spike near 0 `\(\implies\)` true alternative hypotheses * `\(\pi_{0}\)` refers to the true proportion of nulls .center[ <img src="figure/histogram-merged-1.png" width=800/> ] --- ### `\(p\)`-value histogram * Under the null, the `\(p\)`-values follow a uniform distribution. * The spike near 0 `\(\implies\)` true alternative hypotheses * `\(\pi_{0}\)` refers to the true proportion of nulls .center[ <img src="figure/histogram-1.png" width=800/> ] --- ### `\(p\)`-value histogram * Under the null, the `\(p\)`-values follow a uniform distribution. * The spike near 0 `\(\implies\)` true alternative hypotheses * `\(\pi_{0}\)` refers to the true proportion of nulls .center[ <img src="figure/histogram-2.png" width=800/> ] --- ### Benjamini-Hochberg The Benjamini-Hochberg (BH) procedure [1] controls the FDR at level `\(q\)`. 1. Sort the `\(p\)`-values: `\(p_{(1)} \leq p_{(2)} \leq \dots \leq p_{(M)}\)` 1. Find the largest `\(i\)` such that `\(p_{(i)} \leq \frac{i q}{M}\)` 1. Reject hypotheses associated with `\(p_{(1)} \leq \dots \leq p_{(i)}\)` --- ### Why? At any threshold `\(t\)`, estimate the FDR using relative areas from the null and alternative densities. .center[ <img src="figure/annotated_histogram.png" width=410/> ] --- ### Why? At any threshold `\(t\)`, estimate the FDR using relative areas from the null and alternative densities. .center[ <img src="figure/annotated_histogram-2.png" width=410/> ] --- ### Why? Let `\(R\left(t\right)\)` be the number of rejected hypotheses at threshold `\(t\)`. Then, .pull-left[ `\begin{align*} \FDR\left(t\right) &= \frac{\pi_{0}Mt}{R\left(t\right)} \end{align*}` ] .pull-right[ <img src="figure/annotated_histogram-2.png" width=435/> ] --- ### Why? Maximize the number of rejections while limiting false discoveries. .pull-left[ Optimize: `\begin{align*} \text{maximize } &R\left(t\right) \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}` ] .pull-right[ <img src="figure/annotated_histogram-2.png" width=435/> ] --- ### Why? Maximize the number of rejections while limiting false discoveries. .pull-left[ Optimize: `\begin{align*} \text{maximize } &t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}` ] .pull-right[ <img src="figure/annotated_histogram-2.png" width=435> ] --- ### Why? Larger thresholds let more hypotheses through. .pull-left[ Optimize: `\begin{align*} \text{maximize } & t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}` ] .pull-right[ <img src="figure/annotated_histogram-3.png" width=435> ] --- ### Why? Larger thresholds let more hypotheses through. .pull-left[ Optimize: `\begin{align*} \text{maximize } & t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}` ] .pull-right[ <img src="figure/annotated_histogram-4.png" width=450> ] --- ### Why? Larger thresholds let more hypotheses through. .pull-left[ Optimize: `\begin{align*} \text{maximize } &t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}` ] .pull-right[ <img src="figure/annotated_histogram-5.png" width=450> ] --- ### Why? Look for the largest `\(p\)`-value satisfying this inequality. .pull-left[ Optimize: `\begin{align*} \text{maximize } &t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}` ] .pull-right[ If we substitute `\(t = p_{(i)}\)`, then `\begin{align*} &\frac{\pi_{0}M p_{(i)}}{i} \leq q \\ \implies & p_{(i)} \leq \frac{i q}{\pi_{0} M} \\ \implies & p_{(i)} \leq \frac{i q}{M} \end{align*}` ] --- Discussion, Q1: [https://go.wisc.edu/aq19dy]( https://go.wisc.edu/aq19dy) --- background-image: url("figure/split_night_sky.jpg") background-size: cover .middle[ ## Power through Context ] --- ### Hypotheses are not identical * We may know in advance that some hypotheses are more likely to be rejected than others. - Prior literature - Number of mapped reads * This information can improve power. We'll discuss [2], but see [3; 4; 5; 6] .center[ <img width=620 src="figure/split-histograms.png"/> ] --- ### Example: Testing Pairs .pull-left[ Suppose we are testing associations between, * SNPs `\(x_{1}, \dots, x_{I}\)` * Histone Markers `\(y_{1}, \dots, y_{J}\)` ] .pull-right[ `\(H_{m}: \text{Cor}\left(x_{i}, y_{j}\right) = 0\)`. Even for moderate `\(I, J\)`, this is many tests! ] .center[ <img src="figure/pairwise-cors.png" width=800/> ] --- ### Grouping by Distance * Group the distances into bins `\(1, \dots, G\)`. * We use different thresholds `\(\*t = \left(t_{1}, \dots, t_{G}\right)\)` across groups. .center[ <img src="figure/pairwise-example-groups.png" width=800/> ] --- ### Grouping by Distance * Group the distances into bins `\(1 , \dots, G\)`. * We use different thresholds `\(\*t = \left(t_{1}, \dots, t_{G}\right)\)` across groups. .pull-left[ `\begin{align*} R\left(\*t\right) &:= \sum_{g} R_{g}\left(t_{g}\right) \\ \FDR\left(\mathbf{t}\right) &:= \frac{\sum_{g = 1}^{G} m_{g}t_{g}}{R\left(\*t\right)} \end{align*}` ] .pull-right[ `\begin{align*} \text{maximize } &R\left(\*t\right) \\ \text{subject to } &\FDR\left(\*t\right) \leq q \end{align*}` ] --- ### Code Demo We simulated a SNP-histone marker association dataset `\(x_{n} \in \{0, 1\}^{I}\)` and `\(y_{n} \in \reals^{J}\)`. .pull-left[ `\begin{align*} y_{n} &= B x_{n} + \epsilon_{n} \\ \epsilon_{n} &\sim \mathcal{N}\left(0, 0.1^2\right) \\ x_{ni} &\sim \Bern\left(\frac{1}{2}\right) \\ B_{nj} &\sim \Bern\left(0.4 \exp\left(- \frac{1}{10}\left|\frac{i}{I} - \frac{j}{J}\right|\right)\right) \\ \end{align*}` ] .pull-right[ <img src="figure/beta_heatmap.png" width=400/> ] --- ### Code Demo We simulated a SNP-histone marker association dataset `\(x_{n} \in \{0, 1\}^{I}\)` and `\(y_{n} \in \reals^{J}\)`. .pull-left[ `\begin{align*} y_{n} &= B x_{n} + \epsilon_{n} \\ \epsilon_{n} &\sim \mathcal{N}\left(0, 0.1^2\right) \\ x_{ni} &\sim \Bern\left(\frac{1}{2}\right) \\ B_{nj} &\sim \Bern\left(0.4 \exp\left(- \frac{1}{10}\left|\frac{i}{I} - \frac{j}{J}\right|\right)\right) \\ \end{align*}` ] .pull-right[ <img src="figure/context_p_values_histogram.png"/> ] --- ### Code Demo We can estimate `\(\FDR\left(\*t\right)\)` with a few lines of code: ```r m <- map_dbl(p_values, length) R <- map2_dbl(p_values, thresholds, ~ sum(.x < .y)) sum(m * thresholds) / R ``` * `p_values` is a list of `\(p\)`-values for each group `\(g\)` * `thresholds` is a vector storing `\(\*t\)` --- ### Code Demo We can have more generous thresholds for `\(p\)`-values between nearby pairs. This invests the most "budget" on the promising contexts. .center[ <img src="figure/optimal-thresholds-1.png" width=700/> ] --- ### Code Demo We can have more generous thresholds for `\(p\)`-values between nearby pairs. This invests the most "budget" on the promising contexts. .center[ <img src="figure/optimal-thresholds-2.png" width=700/> ] --- ### Optimization * In practice, we would optimize `\(\*t\)`, rather than scanning the entire space. * The number of rejections can be derived from the CDF of the `\(p\)`-value histogram, and the problem can be formulated as a convex optimization. * We haven't discussed it, but it is important to cross-weight to ensure independence between the thresholds and `\(p\)`-values. .center[ <img src="figure/grenander_comparison.png" width=650> ] --- background-image: url("figure/symmetry_cover.jpg") background-size: cover .middle[ ## Calibration through Splitting ] --- ### Example Imagine gene `\(m\)` in sample `\(n\)` is distributed according to, `\begin{align*} x_{nm} \vert \text{disease t} \sim \text{NB}\left(\mu_{mt}, \varphi\right) \end{align*}` so that expression potentially varies according to disease status. `\begin{align*} H_{0m}: \mu_{m0} = \mu_{m1} \\ H_{1m}: \mu_{m0} \neq \mu_{m1} \end{align*}` --- ### Example * We used the `edgeR` package to normalize these data (TMM + log transform) [7], inspired by the simulation in [8] * Notice the nonuniformity of the null `\(p\)`-values from the two-sample `\(t\)`-tests .center[ <img src="figure/miscalibration_histogram.png" width=700/> ] --- ### Key Idea .pull-left[ We can avoid this problem if we bypass `\(p\)`-values altogether. - Significance can be deduced from test statistics directly - We do not have to use BH to ensure FDR control This idea is highly adaptable. We'll discuss [9], but see also [10; 11]. ] .pull-right[ <img src="figure/mirror_inspiration_raw.png" width = 650/> ] --- ### Testing `\(\to\)` Regression We work with the original data `\(\*X \in \reals^{N \times M}\)` and `\(\*y \in \reals^{N}\)`. For example, * `\(y_{n}\)`: Disease state for patient `\(n\)` * `\(x_{nm}\)`: Expression level for gene `\(m\)`. Consider regressing `\(\*y = \*X \*\beta + \epsilon\)`. The null hypothesis is that `\(\beta_{m} = 0\)`. .center[ <img src="figure/regression-assumptions.png" width=600/> ] --- ### Symmetry and Mirrors Randomly split rows into `\(\left(\*X^{(1)}, \*y^{(1)}\right)\)` and `\(\left(\*X^{(2)}, \*y^{(2)}\right)\)`. Mirror statistics have the form, `\begin{align*} T_{m} &= \text{sign}\left(\hat{\beta}^{(1)}_{m}\hat{\beta}_{m}^{(2)}\right)f\left(\absarg{\hat{\beta}^{(1)}_{m}}, \absarg{\hat{\beta}^{(2)}_{m}}\right) \end{align*}` where `\(\hat{\beta}^{(j)}\)` are estimated on the separate splits. --- ### Symmetry and Mirrors Mirror statistics measure the agreement in estimates across splits. Example: `\(\text{sign}\left(\hat{\beta}^{(1)}_{m}\hat{\beta}_{m}^{(2)}\right)\left[\absarg{\hat{\beta}_{m}^{(1)}} + \absarg{\hat{\beta}_{m}^{(2)}}\right]\)` .center[ <img src="figure/regression-mirrors.png" width=750> ] --- ### Symmetry and Mirrors Mirror statistics measure the agreement in estimates across splits. Example: `\(\text{sign}\left(\hat{\beta}^{(1)}_{m}\hat{\beta}_{m}^{(2)}\right)\left[\absarg{\hat{\beta}_{m}^{(1)}} + \absarg{\hat{\beta}_{m}^{(2)}}\right]\)` ```r ix <- sample(nrow(x), 0.5 * nrow(x)) beta1 <- train_fun(x[ix], y[ix]) beta2 <- train_fun(x[-ix], y[-ix]) sign(beta1 * beta2) * (abs(beta1) + abs(beta2)) ``` --- ### Mirror Histogram Let's revisit the negative binomial problem. Which hypotheses should we reject? .center[ <img src="figure/mirror_histogram_merged.png" width=800/> ] --- ### Mirror Histogram A reasonable strategy is to reject for all `\(T_{m}\)` in the far right tail. How should we pick the threshold? .center[ <img src="figure/mirror_histogram.png" width=800/> ] --- ### FDR Estimation By symmetry, the size of the left tail `\(\approx\)` the number of nulls in the right tail, .pull-left[ `\begin{align*} \FDR\left(t\right) &= \frac{\absarg{T_{m} < -t}}{\absarg{T_{m} > t}} \end{align*}` which suggests solving, `\begin{align*} \text{maximize } &t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}` ] .pull-right[ <img src="figure/mirror_histogram.png"/> ] --- ### Multiple Splits Issue: * A downside is that sample splitting samples reduces power. * Power can be recovered by aggregating over multiple iterations of splitting. Main Idea: * Create `\(K\)` different pairs of random splits * Let `\(\hat{S}_{k}\)` be the selected features when using mirror statistics on the `\(k^{th}\)` random split * Look for consensus across the `\(\hat{S}_{k}\)` * (See backup slides for details) --- ### Code Demo In our Negative Binomial example, this leads to FDR control and reasonable power. Here are the results over multiple runs. .center[ <img src="figure/mirror_fdr_summary.png" width=600/> ] --- ### Conclusion * Whenever the signal is subtle, hypothesis testing is crucial. * The key abstractions we encountered were, - Optimization view of BH + Context = Focused Hypothesis Testing - Data splitting + Symmetry assumptions = Formal Error Control * These abstractions are useful across the computational genomics workflow - E.g., in experimental design [12] and interpretation of features generated through unsupervised learning [13; 11]. - Other variations developed in my lab [14; 15; 16] --- Discussion, Q2: [https://go.wisc.edu/aq19dy]( https://go.wisc.edu/aq19dy) --- ### References [1] Y. Benjamini and Y. Hochberg. "Controlling the false discovery rate: a practical and powerful approach to multiple testing". In: _Journal of the royal statistical society series b-methodological_ 57 (1995), pp. 289-300. [2] N. Ignatiadis, B. Klaus, J. B. Zaugg, et al. "Data-driven hypothesis weighting increases detection power in genome-scale multiple testing". In: _Nature methods_ 13 (2016), pp. 577 - 580. [3] L. Lei and W. Fithian. "AdaPT: an interactive procedure for multiple testing with side information". In: _Journal of the Royal Statistical Society: Series B (Statistical Methodology)_ 80 (2016). [4] N. Ignatiadis and W. Huber. "Covariate powered cross<U+2010>weighted multiple testing". In: _Journal of the Royal Statistical Society: Series B (Statistical Methodology)_ 83 (2017). --- ### References [5] M. J. Zhang, F. Xia, and J. Y. Zou. "Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing". In: _Nature Communications_ 10 (2019). [6] R. Yurko, M. G. G'Sell, K. Roeder, et al. "A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk". In: _Proceedings of the National Academy of Sciences of the United States of America_ 117 (2020), pp. 15028 - 15035. [7] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data". In: _Bioinformatics_ 26 (2009), pp. 139 - 140. [8] M. Stephens. "Simple Transform Simulation". In: _Online_ (2023). Accessed: 2023-07-14. --- ### References [9] C. Dai, B. Lin, X. Xing, et al. "False Discovery Rate Control via Data Splitting". In: _Journal of the American Statistical Association_ (2020). [10] X. Ge, Y. E. Chen, D. Song, et al. "Clipper: p-value-free FDR control on high-throughput data from two conditions". In: _Genome Biology_ 22 (2021). [11] F. R. Guo and R. D. Shah. "Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values". In: _ArXiv_ (2023). [12] C. Fannjiang, S. Bates, A. N. Angelopoulos, et al. "Conformal prediction under feedback covariate shift for biomolecular design". In: _Proceedings of the National Academy of Sciences of the United States of America_ 119 (2022). --- ### References [13] D. Song and J. J. Li. "PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data". In: _Genome Biology_ 22 (2021). [14] K. Sankaran and P. Jeganathan. "Microbiome Intervention Analysis with Transfer Functions and Mirror Statistics". In: _ArXiv_ (2023). [15] K. Sankaran and S. P. Holmes. "structSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data." In: _Journal of statistical software_ 59 13 (2014), pp. 1-21 . [16] K. Sankaran. "Functions for Performing Local FDR Estimation when Null and Alternative are Poisson [R package LocFDRPois version 1.0.0]". In: _CRAN_ (2015). --- ### Selective Inference Step (1) distinguishes selective inference from ordinary statistical inference. 1. Search for interesting patterns. 1. Test whether they could have just been coincidences. Looking for promising contexts `\(\implies\)` selective inference. --- ### Optimization This can be given into generic optimization solvers: `\begin{align*} \text{maximize } & \sum_{g} m_{g}\hat{F}\left(t_{g}\right) \\ \text{subject to } & \sum_{g = 1}^{G} m_{g}\left(t_{g} - m_{g}\hat{F}\left(t_{g}\right)\right) \leq q \end{align*}` --- ### Aggregation Procedure We need a mechanism for aggregating `\(\hat{S}_{k}\)` across iterations. .center[ <img src="figure/selection.png"/> ] --- ### Aggregation Procedure Let `\(\hat{I}_{1}, \dots, \hat{I}_{m}\)` store the row averages of this matrix. Larger `\(\hat{I}_{m}\)` means: * The feature is selected even when `\(\absarg{\hat{S}_{k}}\)` is small * The feature is selected frequently .center[ <img src="figure/selection.png" width=700/> ] --- ### Aggregation Procedure Let `\(\hat{I}_{1}, \dots, \hat{I}_{m}\)` store the column averages of this matrix. Larger `\(\hat{I}_{m}\)` means: * The feature is selected even when `\(\absarg{\hat{S}_{k}}\)` is small * The feature is selected frequently .center[ <img src="figure/I_hat.png" width=800/> ] --- ### Aggregation Procedure * For the final selection, choose features with the largest `\(\hat{I}_{m}\)` possible, up until the budget exceeds `\(1 - q\)`. * This is guaranteed to (asymptotically) control the FDR at level `\(q\)` .center[ <img src="figure/I_hat_bar.png" width=800/> ]