Several lines of evidence suggest that there exists a strong analogy between natural languages and biological sequences, i.e. there appear to be organism-specific words, phrases and paragraphs in the collective of proteins encoded by the genomes of fully sequenced organisms. It is proposed that the biological analogy of meaning in a natural text is the ability of a protein sequence to fold into its functional three-dimensional fold. In natural languages, frequent words carry little meaning, while rare words often allow identification of the topic of a particular text. The hypothesis predicts, therefore, that rare stretches of amino acids indicate the location of folding initiators. The distribution of global properties along the sequence of lysozyme, a model protein for protein folding studies, indicated that features in these properties can be recognized when inverse frequencies of amino acid n-grams were plotted, supporting the analogy to natural languages. In the next 12 months the focus will be on studying the distribution of rare n-grams in the human genome. If a correlation between folding domains and distribution of rare n-grams can be established, this would (i) provide compelling support for the hypothesis and (ii) shed light on one of the major unsolved questions in biology today, the mechanism by which functional three-dimensional structures are formed from a one-dimensional sequence of amino acids.

Project Start
Project End
Budget Start
2001-12-15
Budget End
2002-11-30
Support Year
Fiscal Year
2002
Total Cost
$99,596
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213