We propose a three-year interdisciplinary research plan to address two key issues currently facing the metagenomics community. The first issue concerns accurate construction and annotation of OTU tables using of millions of 16S rRNA sequences, which is one of the most important yet most difficult problems inmicrobiome data analysis. Currently, it lacks computational algorithms capable of handling extremely large sequence data and constructing biologically consistent OTU tables. We propose a novel method that performs OTU table construction and annotation simultaneously by utilizing input and reference sequences, reference annotations, and data clustering structure within one analytical framework. Dynamic data-driven cutoffs are derived to identify OTUs that are consistent not only with data clustering structure but also with reference annotations. When successfully implemented, our method will generally address the computational needs of processing hundreds of millions of 16S rRNA reads that are currently being generated by large-scale studies. The second issue concerns developing novel methods to extract pertinent information from massive sequence data, thereby facilitating the field shifting from descriptive research to mechanistic studies. We are particularly interested in microbial community dynamics analysis, which can provide a wealth of insight into disease development unattainable through a static experiment design, and lays a critical foundation for developing probiotic and antibiotic strategies to manipulate microbial communities. Traditionally, system dynamics is approached through time-course studies. However, due to economical and logistical constraints, time-course studies are generally limited by the number of samples examined and the time period followed. With the rapid development of sequencing technology, many thousands of samples are being collected in large-scale studies. This provides us with a unique opportunity to develop a novel analytical strategy to use static data, instead of time-course data, to study microbial community dynamics. To our knowledge, this is the first time that massive static data is used to study dynamic aspects of microbial communities. When successfully implemented, our approach can effectively overcome the sampling limitation of time-course studies, and opens a new avenue of research to study microbial dynamics underlying disease development without performing a resource-intensive time-course study. The proposed pipeline will be intensively tested on a large oral microbiome dataset consisting of ~2,600 subgingival samples (~330M reads). The analysis can significantly advance our understanding of dynamic behaviors of oral microbial communities possibly contributing to the development of periodontal disease. To our knowledge, no prior work has been performed on this scale to study oral microbial community dynamics. We have assembled a multidisciplinary team that covers expertise spanning the areas of machine learning, bioinformatics, and oral microbiology. The expected outcome of this work will be a set of computational tools of high utility for the microbiology community and beyond.

Public Health Relevance

The human microbiome plays essential roles in many important physiological processes. We propose an interdisciplinary research plan to address some major computational challenges in current microbiome research. If successfully implemented, this work could significantly expand the capacity of existing pipelines for large-scale data analysis and scientific discovery, resulting in a significant impact on the field.

National Institute of Health (NIH)
National Institute of Allergy and Infectious Diseases (NIAID)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brown, Liliana L
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
State University of New York at Buffalo
Schools of Medicine
United States
Zip Code
McAdams, Natalie M; Simpson, Rachel M; Chen, Runpu et al. (2018) MRB7260 is essential for productive protein-RNA interactions within the RNA editing substrate binding complex during trypanosome RNA editing. RNA 24:540-556
Banack, Hailey R; Genco, Robert J; LaMonte, Michael J et al. (2018) Cohort profile: the Buffalo OsteoPerio microbiome prospective cohort study. BMJ Open 8:e024263
Tutino, Vincent M; Poppenberg, Kerry E; Jiang, Kaiyu et al. (2018) Circulating neutrophil transcriptome may reveal intracranial aneurysm signature. PLoS One 13:e0191407
Simpson, Rachel M; Bruno, Andrew E; Chen, Runpu et al. (2017) Trypanosome RNA Editing Mediator Complex proteins have distinct functions in gRNA utilization. Nucleic Acids Res 45:7965-7983
Furuya, Hideki; Tamashiro, Paulette M; Shimizu, Yoshiko et al. (2017) Sphingosine Kinase 1 expression in peritoneal macrophages is required for colon carcinogenesis. Carcinogenesis 38:1218-1227
Sun, Yijun; Yao, Jin; Yang, Le et al. (2017) Computational approach for deriving cancer progression roadmaps from static sample data. Nucleic Acids Res 45:e69
Cai, Yunpeng; Zheng, Wei; Yao, Jin et al. (2017) ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time. PLoS Comput Biol 13:e1005518
Qi Mao; Li Wang; Tsang, Ivor W et al. (2017) Principal Graph and Structure Learning Based on Reversed Graph Embedding. IEEE Trans Pattern Anal Mach Intell 39:2227-2241
Scharf, Michael E; Cai, Yunpeng; Sun, Yijun et al. (2017) A meta-analysis testing eusocial co-option theories in termite gut physiology and symbiosis. Commun Integr Biol 10:e1295187
Yacoub, Rabi; Nugent, Melinda; Cai, Weijin et al. (2017) Advanced glycation end products dietary restriction effects on bacterial gut microbiota in peritoneal dialysis patients; a randomized open label controlled trial. PLoS One 12:e0184789