Biomedical research and the basic sciences are increasingly dependent on high-throughput technologies that have the ability to simultaneously measure thousands of nucleic acid molecules in a sample. In combination with ingenious laboratory protocols, these technologies have permitted unprecedented ways of studying the molecular basis of disease and phenotypic variation. As a result of the increasing adoption of these technologies, more investigations rely on complex datasets and require the development of new statistical techniques to adequately interpret data. Today, high-throughput technologies applications go far beyond their original task of studying DNA sequence itself and also include the measurement of quantitative and dynamic outcomes such as gene expression levels and DNA methylation (DNAm) status. These quantitative and dynamic outcomes introduce levels of variability that give rise to further data analytic challenges related to distinguishing unwanted sources of variability from bio- logically relevant signals. Furthermore, when measuring these quantitative outcomes, data are subject to severe technological and biological biases that can substantially impact downstream analyses. Our group has previously demonstrated that statistical methodology can provide great improvements over ad-hoc algorithms o?ered as de- faults by technology developers. Our highly cited statistical methodology and our widely used software demonstrate the success of our work. The National Research Council's Frontiers in Massive Data Analysis publication states that, ?the challenges for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems and instead hinge on the ambitious goal of inference?. Inference is particularly relevant in biomedical applications since we often look to draw conclusions based on observed di?erences between groups in the presence of within group variability. Two particularly challenging tasks relate to performing valid inference when 1) we perform scans over large spaces to identify small regions of interests and 2) the data is a?ected by unexpected systematic bias or batch e?ects. We will focus on these two general challenges. Our speci?c proposal is to work on the most urgent needs of researchers facing new challenges as they increasingly rely on high-throughput techniques. We will leverage the expertise of our collaborators to prioritize projects. We greatly appreciate the ?exibility permitted by the R35 mechanism as it will help us maximize the impact of our work.

Public Health Relevance

High-throughput technologies are poised to become instrumental in the era of precision medicine. As a result of the increasing adoption of these technologies, more investigations rely on complex datasets and require the development of new techniques to adequately interpret data. We will develop the necessary statistical methods to help make these technologies primary tools for translational research and clinical applications.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Unknown (R35)
Project #
5R35GM131802-02
Application #
9922327
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Ravichandran, Veerasamy
Project Start
2019-05-01
Project End
2024-04-30
Budget Start
2020-05-01
Budget End
2021-04-30
Support Year
2
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Dana-Farber Cancer Institute
Department
Type
DUNS #
076580745
City
Boston
State
MA
Country
United States
Zip Code
02215