The large and growing databases of known protein sequences represent a knowledge base with the power to revolutionize biology, biochemistry, and biotechnology. These sequencing efforts have highlighted the growing gap between the sequence data and our ability to analyze this data. We are generally interested in answering specific questions about structure, function, and mechanisms. Much information can come from the identification of homologous proteins about which more is known. Identifying distant homologs is still difficult, even with the advent of new profile methods. Another powerful approach is to predict the tertiary structure. While progress is being made, we are still far from being able to reliably predict structures based on sequence data alone. Both of these techniques can be assisted by an analysis of the evolutionary record encoded in the sequences of available homologous proteins. We still do not have a good understanding of how to interpret this record, partially due to a lack of good models of the evolutionary process. Optimal score functions for the identification of distant homologies will be developed and analyzed, and the optimization techniques will be applied to the creation of optimal score functions for alignment of known homologs. Models of amino acid site substitutions will be used to create protein profiles that will allow the identification of further homologs and analogs. Optimization procedures will be developed for the identification of tertiary structures in proteins, including encoding the evolutionary patterns of sidechain conservation and variation. These techniques will be applied to the """"""""inverse-felding"""""""" process, that is, identifying sequences that are likely to fold into a given structure. Simple models of the evolutionary process will be developed to examine how observed properties of proteins can be understood in an evolutionary context. These models will be elaborated to include the effect of population dynamics on the evolutionary process, as well as selective pressure resulting from the need for the protein to be functional. These models will be used to explore which protein properties are likely to be inherent, and to understand how much information can be derived for proteins based on information about known homologs.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM005770-09
Application #
6638862
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Florance, Valerie
Project Start
1995-04-01
Project End
2005-03-31
Budget Start
2003-04-01
Budget End
2004-03-31
Support Year
9
Fiscal Year
2003
Total Cost
$186,287
Indirect Cost
Name
University of Michigan Ann Arbor
Department
Genetics
Type
Schools of Medicine
DUNS #
073133571
City
Ann Arbor
State
MI
Country
United States
Zip Code
48109