The traditional view of protein evolution has been that all protein domains are descendents of distinct evolutionary lines, that there are a relatively small number of such lines (about 1000), and that these lines are all of relatively ancient origin. Two new bodies of evidence make that view untenable. First, analysis of sets of fully sequenced genomes shows that most protein families appear to be small, and narrowly distributed in phylogenetic space, apparently implying recent emergence. Second, analysis of the relationship between known protein structures shows that there are many more than a 1000 distinct folds, appearing to imply many more evolutionary lines. The large discrepancy between theory and fact has been clear for some time and a number of explanations have been put forward. But so far there has been no definitive study of alternatives. In this project we will systematically and quantitatively investigate four separate hypotheses, each of which may account for some share of the large number of protein folds and apparently young proteins: that these (1) are the result of generation of new open reading frames or frame shifted older ones;(2) are the outcome of extensive recombination of portions of older proteins;(3) are laterally transferred from unexplored parts of phylogenetic space;(4) are part of larger older families where there has been rapid sequence change, such that not all relatives are found. To investigate these hypotheses we will develop and extend a set of computational methods. These include methods of building protein families;reliably estimating the age of protein families, detecting lateral gene transfer effects;determining to what extent members of families are likely to have been detected with sequence methods;more quantitatively determining whether protein structures are evolutionarily related;searching for remote structure and sequence relationships;and analyzing a range of protein properties as a function of family age. We will also construct a web resource for distributing the results and soliciting extensive community annotation and discussion.

Public Health Relevance

Understanding the structural and functional adaptive properties of protein molecules underpins many aspects of medicine, particularly the emergence of new viruses, drug design, and combating resistance to new therapeutics in infectious diseases.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Macromolecular Structure and Function D Study Section (MSFD)
Program Officer
Eckstrand, Irene A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Maryland College Park
Other Domestic Higher Education
College Park
United States
Zip Code
Jeong, Jinseon; Kim, Young-Jun; Yoon, Sun Young et al. (2016) PLAG (1-Palmitoyl-2-Linoleoyl-3-Acetyl-rac-Glycerol) Modulates Eosinophil Chemotaxis by Regulating CCL26 Expression from Epithelial Cells. PLoS One 11:e0151758
Yu, Guoqin; Stoltzfus, Arlin (2012) Population diversity of ORFan genes in Escherichia coli. Genome Biol Evol 4:1176-87
Yomtovian, Inbal; Teerakulkittipong, Nuttinee; Lee, Byungkook et al. (2010) Composition bias and the origin of ORFan genes. Bioinformatics 26:996-9