Adaptive Reproducible High-Dimensional Nonlinear Inference for Big Biological Data

Fan, Yingying

Abstract

Big data is now ubiquitous in every field of modern scientific research. Many contemporary applications, such as the recent national microbiome initiative (NMI), greatly demand highly flexible statistical machine learning methods that can produce both interpretable and reproducible results. Thus, it is of paramount importance to identify crucial causal factors that are responsible for the response from a large number of available covariates, which can be statistically formulated as the false discovery rate (FDR) control in general high-dimensional nonlinear models. Despite the enormous applications of shotgun metagenomic studies, most existing investigations concentrate on the study of bacterial organisms. However, viruses and virus-host interactions play important roles in controlling the functions of the microbial communities. In addition, viruses have been shown to be associated with complex diseases. Yet, investigations into the roles of viruses in human diseases are significantly underdeveloped. The objective of this proposal is to develop mathematically rigorous and computationally efficient approaches to deal with highly complex big data and the applications of these approaches to solve fundamental and important biological and biomedical problems. There are four interrelated aims.
In Aim 1, we will theoretically investigate the power of the recently proposed model-free knockoffs (MFK) procedure, which has been theoretically justified to control FDR in arbitrary models and arbitrary dimensions. We will also theoretically justify the robustness of MFK with respect to the misspecification of covariate distribution. These studies will lay the foundations for our developments in other aims.
In Aim 2, we will develop deep learning approaches to predict viral contigs with higher accuracy, integrate our new algorithm with MFK to achieve FDR control for virus motif discovery, and investigate the power and robustness of our new procedure.
In Aim 3, we will take into account the virus-host motif interactions and adapt our algorithms and theories in Aim 2 for predicting virus-host infectious interaction status.
In Aim 4, we will apply the developed methods from the first three aims to analyze the shotgun metagenomics data sets in ExperimentHub to identify viruses and virus-host interactions associated with several diseases at some target FDR level. Both the algorithms and results will be disseminated through the web. The results from this study will be important for metagenomics studies under a variety of environments.

Public Health Relevance

Big data is ubiquitous in biological research. Identifying causal factors associated with complex diseases or traits from big data is highly important and challenging. New statistical and computational tools will be developed to control False Discovery Rate (FDR) for molecular sequence data based on the novel model-free knockoffs framework. They will be used to detect sequence motifs for viruses and motif-pairs for virus-host interactions, and to analyze multiple metagenomics data sets related to complex diseases.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Research Project (R01)
Project #: 1R01GM131407-01
Application #: 9674585
Study Section: Special Emphasis Panel (ZGM1)
Program Officer: Ravichandran, Veerasamy

Project Start: 2018-08-01
Project End: 2022-04-30
Budget Start: 2018-08-01
Budget End: 2019-04-30
Support Year: 1
Fiscal Year: 2018
Total Cost
Indirect Cost

Institution

Name: University of Southern California
Department: Biostatistics & Other Math Sci
Type: Sch of Business/Public Admin
DUNS #: 072933393

City: Los Angeles
State: CA
Country: United States
Zip Code: 90089

Related projects


NIH 2020 R01 GM	Adaptive Reproducible High-Dimensional Nonlinear Inference for Big Biological Data Fan, Yingying / University of Southern California
NIH 2019 R01 GM	Adaptive Reproducible High-Dimensional Nonlinear Inference for Big Biological Data Fan, Yingying / University of Southern California
NIH 2018 R01 GM	Adaptive Reproducible High-Dimensional Nonlinear Inference for Big Biological Data Fan, Yingying / University of Southern California

Comments

Be the first to comment on Yingying Fan's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: