The bacterial communities (microbiota) residing on the human body have been linked to a variety of acute and chronic diseases and conditions, such as obesity, inflammatory bowel disorders, Type 2 diabetes, depression, and urinary tract infections (UTIs), as well as to the host?s response to a variety of treatments and health interventions for these diseases and conditions. As the critical role played by the microbiota has become increasingly recognized, microbiome sequencing data sets are now routinely generated under ever more sophisticated experimental designs and survey strategies. While such data share many common features and challenges of modern big data, such as high-dimensionality and sparsity, they also possess characteristics peculiar to the microbiota, including (i) the explicit and latent contextual relationships among the bacterial species, such as their evolutionary and functional relationships; and (ii) the substantial heterogeneity across samples and complex structure in the sample-to-sample variation. Effective analysis of modern microbiome studies calls for new statistical methodology that incorporates these important characteristics in the data generative mechanism. This project?s objective is to develop a suite of statistical models, methods, algorithms, and software that meet this urgent need. An initial aim is to develop a multi-scale probabilistic framework for modeling microbiome compositions that effectively characterizes the high dimensionality, sparsity, and substantial cross-sample variation in microbiome sequencing data, and incorporates a variety of common experimental designs, such as covariates, batch effects, and multiple time points, while striking a balance in flexibility, analytical parsimony, and computational tractability. An additional focus is to develop latent variable models for microbiome compositional data for the purpose of identifying latent structures such as sample clusters and species subcommunities.
A final aim i s to produce user-friendly, open-source software that implements all of the proposed methods for the analysis of microbiome sequencing data. All of the models and methods developed are informed by two on- going collaborative projects of PI Ma and his team. One is on the identification of microbial communities associated with UTIs in aging women, and the other on the study of longitudinal changes in the microbiome of cancer patients undergoing hematopoietic stem cell transplantation. These studies will serve as testbeds for all development. The models, methods, and software developed will not only result in better prediction of the health outcomes in these and other microbiome studies but also help decipher the roles of microbiome in various diseases and biomedical processes, with the ultimate goal of personalized interventions on the microbiome compositions of patients to lead to improved health.

Public Health Relevance

The goal of this research is to develop new statistical models, methods, algorithms, and software for more effective analysis of microbiome sequencing data collected under a variety of experimental and survey designs. These new statistical tools will not only lead to better prediction of the health outcomes in modern microbiome studies, but also help decipher the roles of microbiome in various diseases and biomedical processes, such as in cancer and urinary tract infections. They will help advance toward the ultimate goal of personalized interventions on the human microbiota of patients to lead to improved health.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biostatistical Methods and Research Design Study Section (BMRD)
Program Officer
Brazhnik, Paul
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Duke University
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code