Trustworthy and Adaptable Biological Data Integration

<div id="title">
Trustworthy and Adaptable Biological Data Integration 
 
</div>
<div id="under_title">
IMS - NUS Workshop Proposal 
</div>

<div id="subtitle_right">
03 | July | 2024 
Slides: <a href="https://go.wisc.edu/8k8r2q">go.wisc.edu/8k8r2q</a>
</div>

<div id="subtitle">
Organizing Team 

Kris Sankaran, UWM 
Wei-Yin Loh, UWM 
Susan Holmes, Stanford 
Bibhas Chakraborty, Duke-NUS 
Bee Choo Tai, NUS 
Wanjie Wang, NUS 

</div>

---

---

### Data Integration

For many biological problems, researchers gather data from multiple views,

Data integration makes it easier to answer questions using all relevant data.

---

### Example: Cancer

.pull-left[
1. Data about molecular activity and cellular organization has clarified
when treatments are more likely to be effective.

1. Several technologies and levels of resolution -- population,
person, tissue, cell -- need to be studied together.
]

.pull-right[
<img src="figures/Lung_cancer_HAVCR2.jpg" width=450/>

Cancer cells in liver tissue, from [1].

]

---

### Example: COVID-19

1. Molecular data have shed light on the immune system mechanisms responsible
for variable COVID-19 severity.

1. Multiple high-dimensional sources can be integrated to identify key molecular
interactions.

---

---

### Adaptability

How can we enable careful statistical analysis across diverse data types and
experimental designs?

Biological insights can be hindered by inappropriate assumptions or the failure
to use all of the data.

---

### Reproducibility

Complex workflows which have many interlocking steps and may use black boxes
challenge scientific reproducibility and interpretability. The community needs
methods for,

---

### Community Resources

Larger community databases can be informative, but may be noisier or only
indirectly related to questions of interest. How should they be brought into a
specific analysis?

---

---

### Motivation

Theoretical analysis has driven advances in biological data integration,

.pull-left[
1. **Improved Algorithms: Theory creates language for forming and answering novel questions.**

1. Consolidation: Theory can create coherent framework from which to view
available methods.
]

.pull-right[
 <img src="figures/cnv.png" width=440/>
 
 Scan statistics offer a useful lens for analyzing copy number variation [3].
 
]

---

### Motivation

Theoretical analysis has driven advances in biological data integration,

.pull-left[
1. Improved Algorithms: Theory creates language for forming and answering novel questions.

1. **Consolidation: Theory can create coherent framework from which to view available methods.**
]

.pull-right[
 <img src="figures/cfit.jpg" width=480/>
 
 Low-rank models are ubiquitous in integration across high-dimensional genomics
 data [4].
 
]

---

### Motivation

Modern data can also inspire developments in theory,

.pull-left[
1. **Expanded Abstractions: As more applications emerge, theoretical concepts can be enriched.**

1. Revised Assumptions: Properties that are taken for granted in narrower analysis settings may no longer apply.
]

Microarray analysis sparked developments in empirical bayes and hypothesis testing [5].

]
---

### Motivation

Modern data can also inspire developments in theory,

.pull-left[
1. Expanded Abstractions: As more applications emerge, theoretical concepts can be enriched.

1. **Revised Assumptions: Properties that are taken for granted in narrower analysis settings may no longer apply.**
]

.pull-right[
<img src="figures/sparse_gca.png"/>

Problems from data integration can help inspire research in high-dimensional
statistics [6].

]

---

---

### Goals

<img src="figures/dna-molecule.png" width=40 align="left" hspace=20/>Support discussions that could shape methodological progress in biological data integration.

<img src="figures/normal-distn.png" width=80 align="left" hspace=10/> Introduce cutting-edge biological datasets and problems to researchers from statistics.

<img src="figures/computer.png" width=80 align="left" hspace=10/>Provide training on current best practices for researchers who are entering the field either from statistical or biological backgrounds.

---

### Values: Interdisciplinarity

Our workshop aims to promote effective communication between researchers working
on complementary biological and statistical aspects of the data integration
problem.

.pull-three-quarters-left[
1. Focused discussions will clarify challenges and identify strategies for
interdisciplinary research.

1. Tutorials will help establish shared vocabulary that can serve as a reference
throughout the workshop.
]
.pull-three-quarters-right[
<img src="figures/bridge.png"/>
]

---

### Values: Hands-on Learning

We want workshop participants with varying levels of expertise to leave with a
concrete understanding of the current theory and practice of data integration.

.pull-three-quarters-left[
1. We will curate a collection of problems, datasets, and methods for data
integration which can serve as a resource for the wider community.

1. We will create opportunities for participants to present short research demos
and talks.
]
.pull-three-quarters-right[
<img src="figures/workbook.png"/>
]

---

### Schedule

1. The program will span roughly three weeks with an initial tutorial period.

1. Following the tutorial, research talks will be broken up with discussion
sessions, participant-led demos, and lightning talks.

---

### Tutorials: Biological Data Management and Visualization

**Kris Sankaran** will provide an introduction to managing and visualizing
multi-omics data using tools from the Bioconductor ecosystem.

---

### Tutorials: Generative Models for Multi-omics

**Kris Sankaran** will review concepts from probabilistic modeling, evaluation,
and interpretation that are the basis for many data integration methods.

.center[
<img src="figures/blocking_omics.gif" width=800/>
]

Simulation for power analysis from [8].

---

### Tutorials: Introduction to Regression Trees

**Wei-Yin Loh** will offer a course on tree-based methods and their application
to mixed data types that arise in biomedical settings.

.center[
<img src="figures/guide_1.png" width=680/> 

An example regression tree, from the tutorial [7].

]

---

### Tutorials: Introduction to Regression Trees

**Wei-Yin Loh** will offer a course on tree-based methods and their application
to mixed data types that arise in biomedical settings.

Trees are often used for integration [9].

]

---

### Tutorials: Case Studies from CPTAC

**Xiaoyu Song** will lead a tutorial describing how problems have been
formulated
and solved in her expeience with the Clinical Proteomic Tumor Analysis
Consortium (CPTAC).

---

### Tutorials: Case Studies from CPTAC

**Xiaoyu Song** will lead a tutorial describing how problems have been
formulated
and solved in her expeience with the Clinical Proteomic Tumor Analysis
Consortium (CPTAC).

---

### Talks Criteria

We will invite speakers based on complementary criteria,

.pull-three-quarters-left[
1. **Compelling Applications**: Work that shows what is possible by carefully analyzing modern data.

1. **Statistical Foundations**: Research that draws deeply from the statistical canon and highlights opportunities for the field.

1. **Creative Perspectives**: Studies that are able to get more out of their data by taking a new approach.
]

---

### Interaction

Beyond the featured talks, we will create opportunities for both formal and
informal interaction among all participants.

.pull-three-quarters-left[
1. Facilitated discussion
1. Lightning research talks
1. Data analysis demos 
]
.pull-three-quarters-right[
 <img src="figures/speech-bubbles.png"/>
]

We will curate resources from these activities on a website dedicated to the workshop.

---

### Why IMS-NUS?

1. **Interdisciplinary emphasis**: The workshop relies on an environment that
allows researchers from multiple backgrounds to feel welcome.

1. **Statistics reputation**: Unlike related workshops on integration, we
emphasize the role of statistics, and the IMS - NUS connection will help.

1. **Location**: The strength of both the statistics and bioinformatics
communities in Singapore will ensure that the workshop balances theoretical and applied perspectives.

---

### Recruitment

The organizing team will recruit students, scientists, and researchers locally
and abroad who have the potential to benefit from one another's expertise.

1. University Campuses: National University of Singapore, Nanyang Technological University
1. Research Institutes: Genome Institute of Singapore, The Bioinformatics Institute
1. Medical Schools/Hospitals: Duke - NUS Medical School, NUS Medicine, LKC Medicine

---

### Conclusion

We believe that this topic has the potential both to demonstrate the value of
statistics to the wider scientific community and to bring exciting, new problems
into the conversation in statistics.

Thank you for your attention and we are happy to take any questions.

---

### References

[1] Wikimedia. _File:Lung cancer HAVCR2.jpg - Wikimedia Commons - commons.wikimedia.org_. <https://commons.wikimedia.org/wiki/File:Lung_cancer_HAVCR2.jpg>. [Accessed 28-06-2024].

[2] J. P. Gygi et al. "Integrated longitudinal multiomics study identifies immune programs associated with acute COVID-19 severity and mortality". In: _Journal of Clinical
Investigation_ 134.9 (May. 2024). ISSN: 1558-8238. DOI: [10.1172/jci176640](https://doi.org/10.1172%2Fjci176640). URL:
[http://dx.doi.org/10.1172/JCI176640](http://dx.doi.org/10.1172/JCI176640).

[3] N. R. Zhang et al. "A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data". En. In: _Biometrics_ 63.1 (Mar. 2007), pp.
22-32.

---

[4] M. Peng et al. "Integration and transfer learning of single-cell transcriptomes via cFIT". En. In: _Proc. Natl. Acad. Sci. U. S. A._ 118.10 (Mar. 2021).

[5] B. Efron. "Size, power and false discovery rates". En. In: _Ann. Stat._ 35.4 (Aug. 2007), pp. 1351-1377.

[6] S. Gao et al. "Sparse GCA and Thresholded Gradient Descent". In: _Journal of Machine Learning Research_ 24.135 (2023), pp. 1-61. URL:
[http://jmlr.org/papers/v24/21-0745.html](http://jmlr.org/papers/v24/21-0745.html).

[7] W. Loh. _Classification and Regression Trees by Example_. <https://pages.stat.wisc.edu/~loh/ims21.pdf>. Tutorial at the 2021 Cuasal Inference with Big Data Workshop hosted by
IMS-NUS. [Accessed 28-06-2024]. 2021.

---

[8] K. Sankaran et al. "Generative Models: An Interdisciplinary Perspective". In: _Annual Review of Statistics and Its Application_ 10.1 (Mar. 2023), p. 325–352. ISSN: 2326-831X. DOI:
[10.1146/annurev-statistics-033121-110134](https://doi.org/10.1146%2Fannurev-statistics-033121-110134). URL:
[http://dx.doi.org/10.1146/annurev-statistics-033121-110134](http://dx.doi.org/10.1146/annurev-statistics-033121-110134).

[9] V. A. Huynh-Thu et al. "Inferring regulatory networks from expression data using tree-based methods". En. In: _PLoS One_ 5.9 (Sep. 2010), p. e12776.

---

### Figure Attribution

* presentation button by Sara from <a href="https://thenounproject.com/browse/icons/term/presentation-button/" target="_blank" title="presentation button Icons">Noun Project</a> (CC BY 3.0)
* discussion by Cuan Studio from <a href="https://thenounproject.com/browse/icons/term/discussion/" target="_blank" title="discussion Icons">Noun Project</a> (CC BY 3.0)
* workbook by john sapuomah from <a href="https://thenounproject.com/browse/icons/term/workbook/" target="_blank" title="workbook Icons">Noun Project</a> (CC BY 3.0)
* bridge by Edy Subiyanto from <a href="https://thenounproject.com/browse/icons/term/bridge/" target="_blank" title="bridge Icons">Noun Project</a> (CC BY 3.0)
* Computer by annisa luthfiasari from <a href="https://thenounproject.com/browse/icons/term/computer/" target="_blank" title="Computer Icons">Noun Project</a> (CC BY 3.0)