The study of the human microbiome, with its multitudes of host-associated organisms, holds great promise for increasing our understanding of human health and disease. With its fragmented sequence data unlinked from genome of origin information, the particular challenge of metagenomics is how to provide reliable functional annotation and taxonomic assignment. Here we address these issues by leveraging existing profile hidden Markov models (HMMs) of functionally characterized gene families. Instead of relying on fragment matches to full-length genes or gene models, we will determine which segments of gene models are capable of high-quality annotations of function and origin, and focus on those. By this approach, the portions of the gene models that have low sequence conservation or have variable insertion/gap length (tending towards low recall), or those that are composed of sequence shared among multiple gene families and functions (tending towards low precision) are systematically eliminated, increasing overall signal-to-noise. The high-quality segments of the models (?mini? HMMs) will be our analytical tools. Using these methods we hope to provide robust approach that frees metagenomics from the limitations of assembly-first strategies, and thereby provide access to information about the numerous low-abundance species in complex biological samples. We will use bacterial single-copy genes as taxonomic markers, and will produce a database of these genes from high-quality genomes. We expect to identify ~80 suitable marker genes, determined for several thousand genomes. For each of these genes, we will produce a corresponding reference phylogenetic tree. In the course of producing these resources, the existing models (TIGRFAMs and Pfam HMMs) will be updated based on the current set of reference genomes and a constant, state-of-the-art construction process. These resources, and any software we produce will be made available through our public website. With these methods and resources, we will obtain taxonomic profiles, investigate genes of interest and devise methods for linking those genes to the taxa in the profile. We will utilize real and synthetic metagenomes to perform validation of the methods, and establish statistical confidence metrics for our results.
Metagenomes consist of short sequence fragments that are disconnected from information about their genome of origin. Methods proposed here attempt to overcome the limitations of the fragmentary nature of the data by identifying reliable short fragment-sized markers of genes as detection, annotation and taxonomic placement tools. Based on profile hidden Markov models (HMMs), these short markers are called ?mini? HMMs (mHMMs).