Big data is now ubiquitous in every field of modern scientific research. Many contemporary applications, such as the recent national microbiome initiative (NMI), greatly demand highly flexible statistical machine learning methods that can produce both interpretable and reproducible results. Thus, it is of paramount importance to identify crucial causal factors that are responsible for the response from a large number of available covariates, which can be statistically formulated as the false discovery rate (FDR) control in general high-dimensional nonlinear models. Despite the enormous applications of shotgun metagenomic studies, most existing investigations concentrate on the study of bacterial organisms. However, viruses and virus-host interactions play important roles in controlling the functions of the microbial communities. In addition, viruses have been shown to be associated with complex diseases. Yet, investigations into the roles of viruses in human diseases are significantly underdeveloped. The objective of this proposal is to develop mathematically rigorous and computationally efficient approaches to deal with highly complex big data and the applications of these approaches to solve fundamental and important biological and biomedical problems. There are four interrelated aims.
In Aim 1, we will theoretically investigate the power of the recently proposed model-free knockoffs (MFK) procedure, which has been theoretically justified to control FDR in arbitrary models and arbitrary dimensions. We will also theoretically justify the robustness of MFK with respect to the misspecification of covariate distribution. These studies will lay the foundations for our developments in other aims.
In Aim 2, we will develop deep learning approaches to predict viral contigs with higher accuracy, integrate our new algorithm with MFK to achieve FDR control for virus motif discovery, and investigate the power and robustness of our new procedure.
In Aim 3, we will take into account the virus-host motif interactions and adapt our algorithms and theories in Aim 2 for predicting virus-host infectious interaction status.
In Aim 4, we will apply the developed methods from the first three aims to analyze the shotgun metagenomics data sets in ExperimentHub to identify viruses and virus-host interactions associated with several diseases at some target FDR level. Both the algorithms and results will be disseminated through the web. The results from this study will be important for metagenomics studies under a variety of environments.

Public Health Relevance

Big data is ubiquitous in biological research. Identifying causal factors associated with complex diseases or traits from big data is highly important and challenging. New statistical and computational tools will be developed to control False Discovery Rate (FDR) for molecular sequence data based on the novel model-free knockoffs framework. They will be used to detect sequence motifs for viruses and motif-pairs for virus-host interactions, and to analyze multiple metagenomics data sets related to complex diseases.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM131407-03
Application #
9923688
Study Section
Special Emphasis Panel (ZGM1)
Program Officer
Ravichandran, Veerasamy
Project Start
2018-08-01
Project End
2022-04-30
Budget Start
2020-05-01
Budget End
2021-04-30
Support Year
3
Fiscal Year
2020
Total Cost
Indirect Cost
Name
University of Southern California
Department
Biostatistics & Other Math Sci
Type
Sch of Business/Public Admin
DUNS #
072933393
City
Los Angeles
State
CA
Country
United States
Zip Code
90089