We propose a three-year interdisciplinary research plan to address two key issues currently facing the metagenomics community. The first issue concerns accurate construction and annotation of OTU tables using of millions of 16S rRNA sequences, which is one of the most important yet most difficult problems inmicrobiome data analysis. Currently, it lacks computational algorithms capable of handling extremely large sequence data and constructing biologically consistent OTU tables. We propose a novel method that performs OTU table construction and annotation simultaneously by utilizing input and reference sequences, reference annotations, and data clustering structure within one analytical framework. Dynamic data-driven cutoffs are derived to identify OTUs that are consistent not only with data clustering structure but also with reference annotations. When successfully implemented, our method will generally address the computational needs of processing hundreds of millions of 16S rRNA reads that are currently being generated by large-scale studies. The second issue concerns developing novel methods to extract pertinent information from massive sequence data, thereby facilitating the field shifting from descriptive research to mechanistic studies. We are particularly interested in microbial community dynamics analysis, which can provide a wealth of insight into disease development unattainable through a static experiment design, and lays a critical foundation for developing probiotic and antibiotic strategies to manipulate microbial communities. Traditionally, system dynamics is approached through time-course studies. However, due to economical and logistical constraints, time-course studies are generally limited by the number of samples examined and the time period followed. With the rapid development of sequencing technology, many thousands of samples are being collected in large-scale studies. This provides us with a unique opportunity to develop a novel analytical strategy to use static data, instead of time-course data, to study microbial community dynamics. To our knowledge, this is the first time that massive static data is used to study dynamic aspects of microbial communities. When successfully implemented, our approach can effectively overcome the sampling limitation of time-course studies, and opens a new avenue of research to study microbial dynamics underlying disease development without performing a resource-intensive time-course study. The proposed pipeline will be intensively tested on a large oral microbiome dataset consisting of ~2,600 subgingival samples (~330M reads). The analysis can significantly advance our understanding of dynamic behaviors of oral microbial communities possibly contributing to the development of periodontal disease. To our knowledge, no prior work has been performed on this scale to study oral microbial community dynamics. We have assembled a multidisciplinary team that covers expertise spanning the areas of machine learning, bioinformatics, and oral microbiology. The expected outcome of this work will be a set of computational tools of high utility for the microbiology community and beyond.
The human microbiome plays essential roles in many important physiological processes. We propose an interdisciplinary research plan to address some major computational challenges in current microbiome research. If successfully implemented, this work could significantly expand the capacity of existing pipelines for large-scale data analysis and scientific discovery, resulting in a significant impact on the field.