Analysis of the microbial communities present in or on the human body holds promise for explaining the dynamic basis of host-microbiome symbiosis and the contribution of these communities (the human microbiome) to health and disease. Vast amounts of metagenomic DNA sequence can be collected. However, current bioinformatics tools limit our ability to translate sequence into fundamentally new biomedical knowledge. There is a great need to improve existing tools and develop computational methods to address the complexity of data generated by human microbiome projects (HMP). This proposal takes a three-pronged approach to dramatically improve methods for extracting meaning from HMP sequence data. The first is to develop algorithms that build protein families, each family just inclusive enough that checking a genome for some cohort of families tells whether or not a pathway is present. These algorithms resemble Phylogenetic Profiling, a data mining technique, but go through optimization steps that guide the building of each family. Pre-built families are not required. The result is new descriptive power that can discover and describe new systems and pathways. Thousands of new families will be created. The second is a new way to apply annotation rules. Large numbers of rules created automatically, each of which works on fairly small numbers of proteins, can apply very exacting tests to determine whether one protein should be expected to have the same function as another that is already characterized. By deriving support from comparing gene regions or metabolic backgrounds in ways made possible only by having large numbers of complete genomes, these rules can achieve much greater confidence than more simplistic annotation techniques. The third is a systematic compilation of the right starting points for annotation. Annotation methods today are built to achieve maximum leverage from those few proteins whose functions are known for sure, but searching for those good anchors is surprisingly difficult, and searching repeatedly wasteful. The CHAR database will collect experimentally characterized proteins and make them """"""""rule-ready"""""""" and universally available. All of the resources developed through this proposal will be made publicly available. These approaches combine to let us read metabolic properties from microbial genome sequences more accurately, and figure out better ways to fight disease.

Public Health Relevance

The massive numbers of microbial species living in and on the human organism vary greatly from person to person, and transform our metabolism enough to impact our health. The work we propose reads patterns of DNA differences from microbe to microbe as a means to figure out which species do what inside the human gut, and therefore how we can make changes to treat or prevent disease.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-F (50))
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
J. Craig Venter Institute, Inc.
United States
Zip Code
Ansong, Charles; Ortega, Corrie; Payne, Samuel H et al. (2013) Identification of widespread adenosine nucleotide binding in Mycobacterium tuberculosis. Chem Biol 20:123-33
Haft, Daniel H; Selengut, Jeremy D; Richter, Roland A et al. (2013) TIGRFAMs and Genome Properties in 2013. Nucleic Acids Res 41:D387-95
Madupu, Ramana; Richter, Alexander; Dodson, Robert J et al. (2012) CharProtDB: a database of experimentally characterized protein annotations. Nucleic Acids Res 40:D237-41
Eberhardt, Ruth Y; Haft, Daniel H; Punta, Marco et al. (2012) AntiFam: a tool to help identify spurious ORFs in protein annotation. Database (Oxford) 2012:bas003
Pathak, Darshankumar T; Wei, Xueming; Bucuvalas, Alex et al. (2012) Cell contact-dependent outer membrane exchange in myxobacteria: genetic determinants and mechanism. PLoS Genet 8:e1002626
Haft, Daniel H; Basu, Malay Kumar (2011) Biological systems discovery in silico: radical S-adenosylmethionine protein families and their target peptides for posttranslational modification. J Bacteriol 193:2745-55
Haft, Daniel H; Varghese, Neha (2011) GlyGly-CTERM and rhombosortase: a C-terminal protein processing signal in a many-to-one pairing with a rhomboid family intramembrane serine protease. PLoS One 6:e28886
Basu, Malay K; Selengut, Jeremy D; Haft, Daniel H (2011) ProPhylo: partial phylogenetic profiling to guide protein family construction and assignment of biological process. BMC Bioinformatics 12:434
Makarova, Kira S; Haft, Daniel H; Barrangou, Rodolphe et al. (2011) Evolution and classification of the CRISPR-Cas systems. Nat Rev Microbiol 9:467-77
Haft, Daniel H (2011) Bioinformatic evidence for a widely distributed, ribosomally produced electron carrier precursor, its maturation proteins, and its nicotinoprotein redox partners. BMC Genomics 12:21

Showing the most recent 10 out of 14 publications