The Human Microbiome Project will generate billions of high throughput sequence reads from rRNA gene PCR products and metagenomic DNA;these data have the potential to revolutionize our understanding of the microbial inhabitants of humans, the putative functions of these microbes, and their associations with health and disease. However, limitations in our ability to process this flood of data hinder our ability to make inferences or draw conclusions. Specifically, commonly available methods for identifying microbes from DNA or RNA sequences do not identify organisms to the species level, and may fail to perform confident assignment to the genus level or higher despite sufficient phylogenetic information to do so. As a result, many publicly available classification tools lump sequences representing distinct species into less specific taxonomic categories, as we have found when applying these tools to several novel bacteria linked with vaginal disease. This proposal is significant because it offers solutions to these fundamental problems by developing and refining novel computational tools;prototypes of these tools have already demonstrated significantly improved results. Our freely available software will help catalyze research on the human microbiome by increasing the speed, accuracy, and specificity of microbial identification, as well as offering methods for between-sample comparison. There are several innovative features of this proposal. First, computationally efficient maximum-likelihood phylogenetic placement of sequences on trees will provide a robust method for identifying microbes and distinguishing between novelty and uncertainty. Second, this proposal will provide accurately annotated collections of reference sequences that can facilitate classification of organisms present in major human body sites. More importantly, this proposal will develop software tools that will enable individual researchers to assemble sets of reference sequences using an approach that maximizes sequence diversity within each represented taxon while excluding poor quality and mislabeled sequences. Third, this proposal will develop new analysis and visualization tools to aid statistical comparison of microbial communities across space and time, and help capture these complex changes in intuitive visualizations.
Aim 1 : Develop and optimize phylogenetic placement software for the analysis of 16S rRNA and other phylogenetically informative loci to better describe bacterial diversity and community composition.
This aim will advance the development of our phylogenetic placement software pplacer, including the addition of algorithms for taxonomic annotation and species delineation, implementation of improved measures of uncertainty, and low-level code optimization.
Aim 2 : Develop computational tools to curate project-specific sets of reference sequences from public repositories and local sources.
This aim i s motivated by our observation that appropriately selected reference sequences and accurate phylogenies are a critical and limiting component of the classification process.
Aim 3 : Develop a software pipeline to integrate high throughput sequencing data analysis, including preprocessing, phylogenetic placement, statistical comparison, and phylogenetic visualization.
This aim will result in two deliverables extending the capabilities of a broad spectrum of researchers: a web service for users who value simplicity, as well as R / Bioconductor software packages for users who value modularity, reproducibility, and extensibility.
Human-associated microbes can have a major impact on human health, either by promoting beneficial interactions (such as facilitating nutrient absorption) or by damaging host tissues thereby producing disease. New sequencing technologies provide an unprecedented opportunity to explore the relationships between microbes and humans, but our computational tools have not kept pace with the technology for characterizing microbial populations. This project seeks to close this gap by developing computational tools for analyzing high throughput sequence data so that the full power of sequencing technologies can be used to accurately identify microbes and assess their relationships with human health.