The scientific community is increasingly appreciative of the important role that the microbiome community plays in many diseases and health conditions. The structure of the microbiome community (e.g., relative abundances of different taxa and microbial network/interactions) is subject to change in response to many environment and host factors. Scientific investigation of how microbiome interact with each other, with their environment and with their host can shed light on our understanding of the underlying biological mechanism of microbiome-related disease and health conditions. Despite the incredible amount of research interest and availability of massive data through the innovative use of cutting-edge techniques (16S rRNA gene sequencing, shotgun metagenomics sequencing and metabolomics), there are still insufficient statistical tools that can fully handle the complexity of microbiome data, including the high-dimensionality, phylogenetic relatedness, relatively small sample size, compositional constraint and others. The main goal of this proposal is to develop statistically powerful and computationally efficient methods to address these challenges in analyzing microbiome data. In particular, this research will be applied to high-throughput microbiome data and lead to new statistical controlled variable selection methods that 1) select a subgroup of taxa that are genuinely associated with disease-related outcomes under a pre-specified false discovery rate (FDR), where the outcomes can be either a single disease outcome of interest or multivariate such as multiple secondary phenotypes related to the disease; and b) identify taxa and taxa-metabolite interactions that are associated with a disease outcome under a certain FDR threshold. Our proposed methods are innovative in that it can both select important taxa features or taxa-metabolites interactions and have the FDR being controlled, which largely enhances the reproducibility and reliability of the discovery results in microbiome association studies. The enhanced taxa selection would further facilitate downstream laboratory-based functional studies, eventually leading to potential improvements in prevention, detection, treatment and monitoring of many health and disease conditions from a microbiome's perspective. Completion of this proposal will also help bridging the gap between the burgeoning research interest in microbiome studies and the lack of analytical tools. In addition to publication in peer-reviewed journals, we will make our results disseminated through conferences and open-source software that is freely available to the wider scientific community. The proposed methods are essential for improved understanding of microbiome mechanism along with its interaction with host genome or metabolome in the pathology of certain diseases, which are of central importance to human health.

Public Health Relevance

Microbiome is considered an important component of many disease states and clinical conditions including bacterial vaginosis, HIV risk, obesity, nonalcoholic fatty liver disease and many others. Critical questions about microbiome are how the bacterial community interact with each other, with environment and how these interactions impact on the host. This proposal is concerned with developing proper and efficient statistical methods to facilitate our understanding on the complex relationship among microbiome, metabolomics, environment factors and related disease outcomes.

National Institute of Health (NIH)
National Institute of Allergy and Infectious Diseases (NIAID)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Gezmu, Misrak
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Pennsylvania State University
Public Health & Prev Medicine
Schools of Medicine
United States
Zip Code