This project develops novel statistical inference procedures for biomedical big data (BBD), including data from diverse omics platforms, various medical imaging technologies and electronic health records. Statistical inference, i.e., assess- ing uncertainty, statistical signi?cance and con?dence, is a key step in computational pipelines that aim to discover new disease mechanisms and develop effective treatments using BBD. However, the development of statistical inference procedures for BBD has lagged behind technological advances. In fact, while point estimation and variable selection procedures for BBD have matured over the past two decades, existing inference procedures are either limited to simple methods for marginal inference and/or lack the ability to integrate biomedical data across multiple studies and plat- forms. This paucity is, in large part, due to the challenges of statistical inference in high-dimensional models, where the number of features is considerably larger than the number of subjects in the study. Motivated by our team's extensive and complementary expertise in analyzing multi-omics data from heterogenous studies, including the TOPMed project on which multiple team members currently collaborate, the current proposal aims to address these challenges. The ?rst aim of the project develops a novel inference procedure for conditional parameters in high-dimensional models based on dimension reduction, which facilitates seamless integration of external biological information, as well as biomedical data across multiple studies and platforms. To expand the application of this method to very high-dimensional models that arise in BBD applications, the second aim develops a data-adaptive screening procedure for selecting an optimal subset of relevant variables.
The third aim develops a novel inference procedure for high-dimensional mixed linear models. This method expands the application domain of high-dimensional inference procedures to studies with longitu- dinal data and repeated measures, which arise commonly in biomedical applications.
The fourth aim develops a novel data-driven procedure for controlling the false discovery rate (FDR), which facilitates the integration of evidence from multiple BBD sources, while minimizing the false negative rate (FNR) for optimal discovery. Upon evaluation using ex- tensive simulation experiments and application to multi-omics data from the TOPMed project, the last aim implements the proposed methods into easy-to-use open-source software tools leveraging the R programming language and the capabilities of the Galaxy work?ow system, thus providing an expandable platform for further developments for BBD methods and tools.

Public Health Relevance

Biomedical big data (BBD), including large collections of omics data, medical imaging data, and electronic health records, offer unprecedented opportunities for discovering disease mechanisms and developing effective treatments. However, despite their tremendous potential, discovery using BBD has been hindered by computational challenges, including limited advances in statistical inference procedures that allow biomedical researchers to investigate uncon- founded associations among biomarkers of interest and various biological phenotypes, while integrating data from multiple BBD sources. The current proposal bridges this gap by developing novel statistical machine learning methods and easy-to-use open-source software for statistical inference in BBD, which are designed to facilitate the integration of data from multiple studies and platforms.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
1R01GM133848-01A1
Application #
9969887
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Brazhnik, Paul
Project Start
2020-09-05
Project End
2024-08-31
Budget Start
2020-09-05
Budget End
2021-08-31
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Washington
Department
Biostatistics & Other Math Sci
Type
Schools of Public Health
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195