\def\absarg#1{\left|#1\right|} \def\Earg#1{\mathbf{E}\left[#1\right]} \def\reals{\mathbb{R}} % Real number symbol \def\integers{\mathbb{Z}} % Integer symbol \def\*#1{\mathbf{#1}} \def\m#1{\boldsymbol{#1}} \def\FDR{\widehat{\operatorname{FDR}}} \def\Gsn{\mathcal{N}} \def\Unif{\operatorname{Unif}} \def\Bern{\operatorname{Bern}}
We will be able to help collaborators solve scientific problems by adapting modern methods from selective inference.
By the end of this talk, we will be able to:
Notation
Goal: Reject as many non-null hypotheses as possible while controlling the False Discovery Rate,
\begin{align*} \text{FDR} := \Earg{\frac{\absarg{\text{False Positives}}}{\absarg{\text{Rejections}}\vee 1}} \end{align*}
The Benjamini-Hochberg (BH) procedure [1] controls the FDR at level q.
At any threshold t, estimate the FDR using relative areas from the null and alternative densities.
At any threshold t, estimate the FDR using relative areas from the null and alternative densities.
Let R\left(t\right) be the number of rejected hypotheses at threshold t. Then,
\begin{align*} \FDR\left(t\right) &= \frac{\pi_{0}Mt}{R\left(t\right)} \end{align*}
Maximize the number of rejections while limiting false discoveries.
Optimize: \begin{align*} \text{maximize } &R\left(t\right) \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}
Maximize the number of rejections while limiting false discoveries.
Optimize: \begin{align*} \text{maximize } &t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}
Larger thresholds let more hypotheses through.
Optimize: \begin{align*} \text{maximize } & t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}
Larger thresholds let more hypotheses through.
Optimize: \begin{align*} \text{maximize } & t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}
Larger thresholds let more hypotheses through.
Optimize: \begin{align*} \text{maximize } &t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}
Look for the largest p-value satisfying this inequality.
Optimize: \begin{align*} \text{maximize } &t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}
If we substitute t = p_{(i)}, then
\begin{align*} &\frac{\pi_{0}M p_{(i)}}{i} \leq q \\ \implies & p_{(i)} \leq \frac{i q}{\pi_{0} M} \\ \implies & p_{(i)} \leq \frac{i q}{M} \end{align*}
Suppose we are testing associations between,
H_{m}: \text{Cor}\left(x_{i}, y_{j}\right) = 0.
Even for moderate I, J, this is many tests!
\begin{align*} R\left(\*t\right) &:= \sum_{g} R_{g}\left(t_{g}\right) \\ \FDR\left(\mathbf{t}\right) &:= \frac{\sum_{g = 1}^{G} m_{g}t_{g}}{R\left(\*t\right)} \end{align*}
\begin{align*} \text{maximize } &R\left(\*t\right) \\ \text{subject to } &\FDR\left(\*t\right) \leq q \end{align*}
We simulated a SNP-histone marker association dataset x_{n} \in \{0, 1\}^{I} and y_{n} \in \reals^{J}.
\begin{align*} y_{n} &= B x_{n} + \epsilon_{n} \\ \epsilon_{n} &\sim \mathcal{N}\left(0, 0.1^2\right) \\ x_{ni} &\sim \Bern\left(\frac{1}{2}\right) \\ B_{nj} &\sim \Bern\left(0.4 \exp\left(- \frac{1}{10}\left|\frac{i}{I} - \frac{j}{J}\right|\right)\right) \\ \end{align*}
We simulated a SNP-histone marker association dataset x_{n} \in \{0, 1\}^{I} and y_{n} \in \reals^{J}.
\begin{align*} y_{n} &= B x_{n} + \epsilon_{n} \\ \epsilon_{n} &\sim \mathcal{N}\left(0, 0.1^2\right) \\ x_{ni} &\sim \Bern\left(\frac{1}{2}\right) \\ B_{nj} &\sim \Bern\left(0.4 \exp\left(- \frac{1}{10}\left|\frac{i}{I} - \frac{j}{J}\right|\right)\right) \\ \end{align*}
We can estimate \FDR\left(\*t\right) with a few lines of code:
m <- map_dbl(p_values, length)R <- map2_dbl(p_values, thresholds, ~ sum(.x < .y))sum(m * thresholds) / R
p_values
is a list of p-values for each group gthresholds
is a vector storing \*tWe can have more generous thresholds for p-values between nearby pairs. This invests the most "budget" on the promising contexts.
We can have more generous thresholds for p-values between nearby pairs. This invests the most "budget" on the promising contexts.
Imagine gene m in sample n is distributed according to, \begin{align*} x_{nm} \vert \text{disease t} \sim \text{NB}\left(\mu_{mt}, \varphi\right) \end{align*} so that expression potentially varies according to disease status.
\begin{align*} H_{0m}: \mu_{m0} = \mu_{m1} \\ H_{1m}: \mu_{m0} \neq \mu_{m1} \end{align*}
edgeR
package to normalize these data (TMM + log transform) [7], inspired by the simulation in [8]We can avoid this problem if we bypass p-values altogether.
This idea is highly adaptable. We'll discuss [9], but see also [10; 11].
We work with the original data \*X \in \reals^{N \times M} and \*y \in \reals^{N}. For example,
Consider regressing \*y = \*X \*\beta + \epsilon. The null hypothesis is that \beta_{m} = 0.
Randomly split rows into \left(\*X^{(1)}, \*y^{(1)}\right) and \left(\*X^{(2)}, \*y^{(2)}\right).
Mirror statistics have the form, \begin{align*} T_{m} &= \text{sign}\left(\hat{\beta}^{(1)}_{m}\hat{\beta}_{m}^{(2)}\right)f\left(\absarg{\hat{\beta}^{(1)}_{m}}, \absarg{\hat{\beta}^{(2)}_{m}}\right) \end{align*}
where \hat{\beta}^{(j)} are estimated on the separate splits.
Mirror statistics measure the agreement in estimates across splits.
Example: \text{sign}\left(\hat{\beta}^{(1)}_{m}\hat{\beta}_{m}^{(2)}\right)\left[\absarg{\hat{\beta}_{m}^{(1)}} + \absarg{\hat{\beta}_{m}^{(2)}}\right]
Mirror statistics measure the agreement in estimates across splits.
Example: \text{sign}\left(\hat{\beta}^{(1)}_{m}\hat{\beta}_{m}^{(2)}\right)\left[\absarg{\hat{\beta}_{m}^{(1)}} + \absarg{\hat{\beta}_{m}^{(2)}}\right]
ix <- sample(nrow(x), 0.5 * nrow(x))beta1 <- train_fun(x[ix], y[ix])beta2 <- train_fun(x[-ix], y[-ix])sign(beta1 * beta2) * (abs(beta1) + abs(beta2))
Let's revisit the negative binomial problem. Which hypotheses should we reject?
A reasonable strategy is to reject for all T_{m} in the far right tail. How should we pick the threshold?
By symmetry, the size of the left tail \approx the number of nulls in the right tail,
\begin{align*} \FDR\left(t\right) &= \frac{\absarg{T_{m} < -t}}{\absarg{T_{m} > t}} \end{align*}
which suggests solving, \begin{align*} \text{maximize } &t \\ \text{subject to } &\FDR\left(t\right) \leq q \end{align*}
Issue:
Main Idea:
In our Negative Binomial example, this leads to FDR control and reasonable power. Here are the results over multiple runs.
Mirror statistics support formal inference of delayed, nonlinear intervention effects in longitudinal microbiome studies [12].
Manuscript, Documentation, Demo
[1] Y. Benjamini and Y. Hochberg. "Controlling the false discovery rate: a practical and powerful approach to multiple testing". In: Journal of the royal statistical society series b-methodological 57 (1995), pp. 289-300.
[2] N. Ignatiadis, B. Klaus, J. B. Zaugg, et al. "Data-driven hypothesis weighting increases detection power in genome-scale multiple testing". In: Nature methods 13 (2016), pp. 577 - 580.
[3] L. Lei and W. Fithian. "AdaPT: an interactive procedure for multiple testing with side information". In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 (2016).
[4] N. Ignatiadis and W. Huber. "Covariate powered cross‐weighted multiple testing". In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 83 (2017).
[5] M. J. Zhang, F. Xia, and J. Y. Zou. "Fast and covariate-adaptive method amplifies detection power in large-scale multiple hypothesis testing". In: Nature Communications 10 (2019).
[6] R. Yurko, M. G. G'Sell, K. Roeder, et al. "A selective inference approach for false discovery rate control using multiomics covariates yields insights into disease risk". In: Proceedings of the National Academy of Sciences of the United States of America 117 (2020), pp. 15028 - 15035.
[7] M. D. Robinson, D. J. McCarthy, and G. K. Smyth. "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data". In: Bioinformatics 26 (2009), pp. 139 - 140.
[8] M. Stephens. "Simple Transform Simulation". In: Online (2023). Accessed: 2023-07-14.
[9] C. Dai, B. Lin, X. Xing, et al. "False Discovery Rate Control via Data Splitting". In: Journal of the American Statistical Association (2020).
[10] X. Ge, Y. E. Chen, D. Song, et al. "Clipper: p-value-free FDR control on high-throughput data from two conditions". In: Genome Biology 22 (2021).
[11] F. R. Guo and R. D. Shah. "Rank-transformed subsampling: inference for multiple data splitting and exchangeable p-values". In: ArXiv (2023).
[12] K. Sankaran and P. Jeganathan. "Microbiome Intervention Analysis with Transfer Functions and Mirror Statistics". In: ArXiv (2023).
[13] C. Fannjiang, S. Bates, A. N. Angelopoulos, et al. "Conformal prediction under feedback covariate shift for biomolecular design". In: Proceedings of the National Academy of Sciences of the United States of America 119 (2022).
[14] D. Song and J. J. Li. "PseudotimeDE: inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data". In: Genome Biology 22 (2021).
[15] K. Sankaran and S. P. Holmes. "structSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data." In: Journal of statistical software 59 13 (2014), pp. 1-21 .
[16] K. Sankaran. "Functions for Performing Local FDR Estimation when Null and Alternative are Poisson [R package LocFDRPois version 1.0.0]". In: CRAN (2015).
Step (1) distinguishes selective inference from ordinary statistical inference.
Looking for promising contexts \implies selective inference.
This can be given into generic optimization solvers:
\begin{align*} \text{maximize } & \sum_{g} m_{g}\hat{F}\left(t_{g}\right) \\ \text{subject to } & \sum_{g = 1}^{G} m_{g}\left(t_{g} - m_{g}\hat{F}\left(t_{g}\right)\right) \leq q \end{align*}
We need a mechanism for aggregating \hat{S}_{k} across iterations.
Let \hat{I}_{1}, \dots, \hat{I}_{m} store the row averages of this matrix. Larger \hat{I}_{m} means:
Let \hat{I}_{1}, \dots, \hat{I}_{m} store the column averages of this matrix. Larger \hat{I}_{m} means:
We will be able to help collaborators solve scientific problems by adapting modern methods from selective inference.
By the end of this talk, we will be able to:
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |