class: title <div id="title"> Trustworthy and Adaptable Biological Data Integration<br/> <br/> </div> <div id="under_title"> IMS - NUS Workshop Proposal </div> <div id="subtitle_right"> 03 | July | 2024 <br/> Slides: <a href="https://go.wisc.edu/8k8r2q">go.wisc.edu/8k8r2q</a> </div> <div id="subtitle"> <i>Organizing Team</i><br/> <span style="font-size: 18px;"> Kris Sankaran, UWM<br/> Wei-Yin Loh, UWM<br/> Susan Holmes, Stanford<br/> Bibhas Chakraborty, Duke-NUS<br/> Bee Choo Tai, NUS <br/> Wanjie Wang, NUS <br/> </span> </div> --- class: middle .center[ ## Scientific Context ] --- ### Data Integration For many biological problems, researchers gather data from multiple views, .center[ <img src="figures/integration_types.png" width=1000/> ] Data integration makes it easier to answer questions using all relevant data. --- ### Example: Cancer .pull-left[ 1. Data about molecular activity and cellular organization has clarified when treatments are more likely to be effective. 1. Several technologies and levels of resolution -- population, person, tissue, cell -- need to be studied together. ] .pull-right[ <img src="figures/Lung_cancer_HAVCR2.jpg" width=450/> <span style="font-size: 22px;"> Cancer cells in liver tissue, from [1]. </span> ] --- ### Example: COVID-19 1. Molecular data have shed light on the immune system mechanisms responsible for variable COVID-19 severity. 1. Multiple high-dimensional sources can be integrated to identify key molecular interactions. .center[ <img src="figures/covid_multiomics.jpg" width=550/> <br/> <span style="font-size: 22px;"> Graphical summary from [2]. </span> ] --- class: middle .center[ ## Statistical Challenges ] --- ### Adaptability How can we enable careful statistical analysis across diverse data types and experimental designs? .pull-left[ 1. General-purpose techniques 1. Effective diagnostics 1. Scalability ] .pull-right[ <img src="figures/data_types.png" width=500/> ] Biological insights can be hindered by inappropriate assumptions or the failure to use all of the data. --- ### Reproducibility Complex workflows which have many interlocking steps and may use black boxes challenge scientific reproducibility and interpretability. The community needs methods for, .pull-left[ 1. Assessing stability 1. Addressing selection biases 1. Supporting re-usable software ] .pull-right[ <img src="figures/multimodal_flow.png" width=500/> ] --- ### Community Resources Larger community databases can be informative, but may be noisier or only indirectly related to questions of interest. How should they be brought into a specific analysis? .pull-left[ 1. Cell type atlases 1. Gene function annotation 1. DNA foundation models ] .pull-right[ <img src="figures/NeuronsGraphic-Lee_Dalley.png" width=700/> ] --- class: middle .center[ ## Mathematical Opportunities ] --- ### Motivation Theoretical analysis has driven advances in biological data integration, .pull-left[ 1. **Improved Algorithms: Theory creates language for forming and answering novel questions.** 1. Consolidation: Theory can create coherent framework from which to view available methods. ] .pull-right[ <img src="figures/cnv.png" width=440/> <span style="font-size: 18px;"> Scan statistics offer a useful lens for analyzing copy number variation [3]. </span> ] --- ### Motivation Theoretical analysis has driven advances in biological data integration, .pull-left[ 1. Improved Algorithms: Theory creates language for forming and answering novel questions. 1. **Consolidation: Theory can create coherent framework from which to view available methods.** ] .pull-right[ <img src="figures/cfit.jpg" width=480/> <span style="font-size: 18px;"> Low-rank models are ubiquitous in integration across high-dimensional genomics data [4]. </span> ] --- ### Motivation Modern data can also inspire developments in theory, .pull-left[ 1. **Expanded Abstractions: As more applications emerge, theoretical concepts can be enriched.** 1. Revised Assumptions: Properties that are taken for granted in narrower analysis settings may no longer apply. ] .pull-right[ .center[ <img src="figures/lfdr.png" width=240/> ] <span style="font-size: 18px;"> Microarray analysis sparked developments in empirical bayes and hypothesis testing [5]. </span> ] --- ### Motivation Modern data can also inspire developments in theory, .pull-left[ 1. Expanded Abstractions: As more applications emerge, theoretical concepts can be enriched. 1. **Revised Assumptions: Properties that are taken for granted in narrower analysis settings may no longer apply.** ] .pull-right[ <img src="figures/sparse_gca.png"/> <span style="font-size: 18px;"> Problems from data integration can help inspire research in high-dimensional statistics [6]. </span> ] --- class: middle .center[ ## Workshop Structure ] --- ### Goals <img src="figures/dna-molecule.png" width=40 align="left" hspace=20/>Support discussions that could shape methodological progress in biological data integration. <br/> <br/> <br/> <img src="figures/normal-distn.png" width=80 align="left" hspace=10/> Introduce cutting-edge biological datasets and problems to researchers from statistics. <br/> <br/> <br/> <img src="figures/computer.png" width=80 align="left" hspace=10/>Provide training on current best practices for researchers who are entering the field either from statistical or biological backgrounds. --- ### Values: Interdisciplinarity Our workshop aims to promote effective communication between researchers working on complementary biological and statistical aspects of the data integration problem. .pull-three-quarters-left[ 1. Focused discussions will clarify challenges and identify strategies for interdisciplinary research. 1. Tutorials will help establish shared vocabulary that can serve as a reference throughout the workshop. ] .pull-three-quarters-right[ <img src="figures/bridge.png"/> ] --- ### Values: Hands-on Learning We want workshop participants with varying levels of expertise to leave with a concrete understanding of the current theory and practice of data integration. .pull-three-quarters-left[ 1. We will curate a collection of problems, datasets, and methods for data integration which can serve as a resource for the wider community. 1. We will create opportunities for participants to present short research demos and talks. ] .pull-three-quarters-right[ <img src="figures/workbook.png"/> ] --- ### Schedule 1. The program will span roughly three weeks with an initial tutorial period. 1. Following the tutorial, research talks will be broken up with discussion sessions, participant-led demos, and lightning talks. .center[ <img src="figures/timeline.png" width=750/> ] --- ### Tutorials: Biological Data Management and Visualization **Kris Sankaran** will provide an introduction to managing and visualizing multi-omics data using tools from the Bioconductor ecosystem. .center[ <img src="figures/visualization-1.png" width=800/> ] --- ### Tutorials: Generative Models for Multi-omics **Kris Sankaran** will review concepts from probabilistic modeling, evaluation, and interpretation that are the basis for many data integration methods. .center[ <img src="figures/blocking_omics.gif" width=800/> ] <span style="font-size: 18px;"> Simulation for power analysis from [8]. </span> --- ### Tutorials: Introduction to Regression Trees **Wei-Yin Loh** will offer a course on tree-based methods and their application to mixed data types that arise in biomedical settings. .center[ <img src="figures/guide_1.png" width=680/><br/> <span style="font-size: 18px;"> An example regression tree, from the tutorial [7]. </span> ] --- ### Tutorials: Introduction to Regression Trees **Wei-Yin Loh** will offer a course on tree-based methods and their application to mixed data types that arise in biomedical settings. .center[ <img src="figures/genie3_overview.png" width=720/> <span style="font-size: 18px;"> Trees are often used for integration [9]. </span> ] --- ### Tutorials: Case Studies from CPTAC **Xiaoyu Song** will lead a tutorial describing how problems have been formulated and solved in her expeience with the Clinical Proteomic Tumor Analysis Consortium (CPTAC). .center[ <img src="figures/CPTAC_overview.jpg" width=700/> ] --- ### Tutorials: Case Studies from CPTAC **Xiaoyu Song** will lead a tutorial describing how problems have been formulated and solved in her expeience with the Clinical Proteomic Tumor Analysis Consortium (CPTAC). .center[ <img src="figures/CPTAC_multiomics.png" width=800/> ] --- ### Talks Criteria We will invite speakers based on complementary criteria, .pull-three-quarters-left[ 1. **Compelling Applications**: Work that shows what is possible by carefully analyzing modern data. 1. **Statistical Foundations**: Research that draws deeply from the statistical canon and highlights opportunities for the field. 1. **Creative Perspectives**: Studies that are able to get more out of their data by taking a new approach. ] .pull-three-quarters-right[ <img src="figures/presentation.png"/> ] --- ### Interaction Beyond the featured talks, we will create opportunities for both formal and informal interaction among all participants. .pull-three-quarters-left[ 1. Facilitated discussion 1. Lightning research talks 1. Data analysis demos ] .pull-three-quarters-right[ <img src="figures/speech-bubbles.png"/> ] We will curate resources from these activities on a website dedicated to the workshop. --- ### Why IMS-NUS? 1. **Interdisciplinary emphasis**: The workshop relies on an environment that allows researchers from multiple backgrounds to feel welcome. 1. **Statistics reputation**: Unlike related workshops on integration, we emphasize the role of statistics, and the IMS - NUS connection will help. 1. **Location**: The strength of both the statistics and bioinformatics communities in Singapore will ensure that the workshop balances theoretical and applied perspectives. .center[ <img src="figures/nus_ims_full.png" width=550/> ] --- ### Recruitment The organizing team will recruit students, scientists, and researchers locally and abroad who have the potential to benefit from one another's expertise. 1. University Campuses: National University of Singapore, Nanyang Technological University 1. Research Institutes: Genome Institute of Singapore, The Bioinformatics Institute 1. Medical Schools/Hospitals: Duke - NUS Medical School, NUS Medicine, LKC Medicine --- ### Conclusion We believe that this topic has the potential both to demonstrate the value of statistics to the wider scientific community and to bring exciting, new problems into the conversation in statistics. Thank you for your attention and we are happy to take any questions. --- ### References [1] Wikimedia. _File:Lung cancer HAVCR2.jpg - Wikimedia Commons - commons.wikimedia.org_. <https://commons.wikimedia.org/wiki/File:Lung_cancer_HAVCR2.jpg>. [Accessed 28-06-2024]. [2] J. P. Gygi et al. "Integrated longitudinal multiomics study identifies immune programs associated with acute COVID-19 severity and mortality". In: _Journal of Clinical Investigation_ 134.9 (May. 2024). ISSN: 1558-8238. DOI: [10.1172/jci176640](https://doi.org/10.1172%2Fjci176640). URL: [http://dx.doi.org/10.1172/JCI176640](http://dx.doi.org/10.1172/JCI176640). [3] N. R. Zhang et al. "A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data". En. In: _Biometrics_ 63.1 (Mar. 2007), pp. 22-32. --- [4] M. Peng et al. "Integration and transfer learning of single-cell transcriptomes via cFIT". En. In: _Proc. Natl. Acad. Sci. U. S. A._ 118.10 (Mar. 2021). [5] B. Efron. "Size, power and false discovery rates". En. In: _Ann. Stat._ 35.4 (Aug. 2007), pp. 1351-1377. [6] S. Gao et al. "Sparse GCA and Thresholded Gradient Descent". In: _Journal of Machine Learning Research_ 24.135 (2023), pp. 1-61. URL: [http://jmlr.org/papers/v24/21-0745.html](http://jmlr.org/papers/v24/21-0745.html). [7] W. Loh. _Classification and Regression Trees by Example_. <https://pages.stat.wisc.edu/~loh/ims21.pdf>. Tutorial at the 2021 Cuasal Inference with Big Data Workshop hosted by IMS-NUS. [Accessed 28-06-2024]. 2021. --- [8] K. Sankaran et al. "Generative Models: An Interdisciplinary Perspective". In: _Annual Review of Statistics and Its Application_ 10.1 (Mar. 2023), p. 325–352. ISSN: 2326-831X. DOI: [10.1146/annurev-statistics-033121-110134](https://doi.org/10.1146%2Fannurev-statistics-033121-110134). URL: [http://dx.doi.org/10.1146/annurev-statistics-033121-110134](http://dx.doi.org/10.1146/annurev-statistics-033121-110134). [9] V. A. Huynh-Thu et al. "Inferring regulatory networks from expression data using tree-based methods". En. In: _PLoS One_ 5.9 (Sep. 2010), p. e12776. --- --- ### Figure Attribution * presentation button by Sara from <a href="https://thenounproject.com/browse/icons/term/presentation-button/" target="_blank" title="presentation button Icons">Noun Project</a> (CC BY 3.0) * discussion by Cuan Studio from <a href="https://thenounproject.com/browse/icons/term/discussion/" target="_blank" title="discussion Icons">Noun Project</a> (CC BY 3.0) * workbook by john sapuomah from <a href="https://thenounproject.com/browse/icons/term/workbook/" target="_blank" title="workbook Icons">Noun Project</a> (CC BY 3.0) * bridge by Edy Subiyanto from <a href="https://thenounproject.com/browse/icons/term/bridge/" target="_blank" title="bridge Icons">Noun Project</a> (CC BY 3.0) * Computer by annisa luthfiasari from <a href="https://thenounproject.com/browse/icons/term/computer/" target="_blank" title="Computer Icons">Noun Project</a> (CC BY 3.0)