Complex biological processes, including organ development, immune response and disease progression, are inherently dynamic. Learning their regulatory architecture requires understanding how components of a large system dynamically interact with each other and give rise to emergent behavior. Recent experimental advances have made ii possible to investigate these biological systems in a data-driven fashion al high temporal resolution, allowing identification of new genes and their regulatory interactions. Longitudinal omics data sets are becoming increasingly common in clinical practice as well. Information on these collections of interacting genes can be integrated to gain systems-level insights into the roles of biological pathways and processes, including progression of diseases. Consequently, developing interpretable methods for learning functional relationships among genes, proteins or metabolites from high-dimensional time series data has become a timely research problem. The nature of these time-course data sets presents exciting opportunities and interesting challenges from a statistical perspective. Typical time-course omics data sets are challenging because of their high-dimensionality and non-linear relationships among system components. To tackle these challenges, one needs sophisticated dimension-reduction techniques that are biologically meaningful, computationally efficient and allow uncertainty quantification. Methods that incorporate prior biological information (e.g., pathway membership, protein-protein interactions) into the data analysis are good candidates for analyzing such high-dimensional systems using small samples. Here, we will develop three core methods to address the above challenges - (Aim 1): an empirical Bayes framework for clustering high-dimensional omics time-course data using prior biological knowledge;
(Aim 2) : a quantile-based Granger causality framework for learning interactions among genes or metabolites from their lead-lag relationships;
and (Aim 3) : a decision tree ensemble framework for searching cascades of interactions among genes from their temporal expression profiles. Our interdisciplinary team of statisticians and scientists will analyze time-course omics data from three research projects: (i) innate immune response systems in Drosophila, (ii) developmental process in mouse models, and (ii) longitudinal metabolite profiling of TB patients. These insights will be used to build and validate our methodology, which will be implemented in a publicly available software. This proposal is innovative in its incorporation of prior biological knowledge in the framework of novel dimension reduction techniques for interrogating high-dimensional time-course omics data. This research is significant in that it will impact basic sciences by elucidating data-driven, testable hypotheses on the regulatory architecture of biological processes, and clinical practice by monitoring disease progression and prognosis.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM135926-02
Application #
10021429
Study Section
Special Emphasis Panel (ZGM1)
Program Officer
Brazhnik, Paul
Project Start
2019-09-23
Project End
2023-08-31
Budget Start
2020-09-01
Budget End
2021-08-31
Support Year
2
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Cornell University
Department
Biostatistics & Other Math Sci
Type
Earth Sciences/Resources
DUNS #
872612445
City
Ithaca
State
NY
Country
United States
Zip Code
14850