For billions of years, nature has been conducting the greatest experiment of all time. Imagine one day gaining access to the detailed notes from these experiments. Today, with worldwide expeditions to collect samples from all habitats, single-cell sequencing of unculturable microbes and the rapid drop in sequencing costs, we can finally tap into nature and gain access to these notes. All that is missing is a Rosetta Stone to interpret this data. The traditional approach, to interpreting sequence data, is through comparison to known information, such as annotated genomes and/or experimentally characterized protein families. Unfortunately, nearly half of metagenomic data (coming from either environmental samples or microbiomes) lacks any detectable sequence homology to any protein family, let alone to any isolated genome. Furthermore, the rate at which this ?dark matter? is discovered, far exceeds the rate at which experiments can be done to characterize it. An alternative approach is to learn a generative, statistical model of the evolutionary process itself. The parameters of this model should in turn provide the constraints on natural selection. For protein-coding genes, the constraints includes folding, stability, and function. Recently, it was shown that a global statistical model of a protein family that captures both conservation and coevolution patterns in the family possesses this quality. The strength of coevolution term is correlated with residue-residue contacts in 3D structure. These contacts have since been used to computationally determine the 3D structures of hundreds of unknown protein families and complexes. These in turn, have been used to predict the function by looking at arrangement of conserved residues and structural similarity to known protein structures. Structural matches can occur in the absence of detectable sequence similarity because structural similarity is retained over larger evolutionary distances. I propose to 1) Develop an improved, unified, statistical model of protein evolution that takes into account functional and lineage constraints; 2) Apply the model to mine metagenomic ?dark matter? sequences for new protein families, functions and protein-protein interactions; 3) Probe evolution of multicellularity through comparison of structures and interactions in the early tree of life. One of the results of the research will be a public database of new protein families and their predicted 3D structure and function. These will be used by structural, molecular and evolutionary biologists as a reference for future studies into the unknown protein universe.

Public Health Relevance

The goal of the proposed research is to develop new computational tools for analysis of genomic and metagenomic data from environmental and microbiome samples (such as from human gut and other organs). One of the results of the research will be a public database of new protein families, including their predicted 3D structure and function, to be used for future studies into the unknown protein universe. For public health, it is important to characterize these proteins for both potential therapeutic purposes and as possible drug targets involved in disease.

National Institute of Health (NIH)
Office of The Director, National Institutes of Health (OD)
Early Independence Award (DP5)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Miller, Becky
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Harvard University
Schools of Arts and Sciences
United States
Zip Code