Sequence alignments of homologue proteins from evolutionary distant organisms are used to pinpoint regions of structural and functional importance. Over long periods only the most constrained segments retain a detectable similarity with each others. This concept was extended to the whole database, by performing cross-comparisons of comprehensive sets of sequences from various kingdoms and phyla with evolutionary distances ranging from 2 billion years for the eukaryote/eubacteria divergence to 550 million years for the coelomate radiation. Significant similarities between these sets thus correspond to strongly conserved ancestral features. Using a series of matching/orthogonalization procedures, 500 independent ancestral types were detected within contemporary sequences. This fossil set only represents 4% of the original database but significantly matches 40% of the whole. Thus, it realizes a 10-fold enrichment in sequences of the greatest structural/ functional significance and is an optimal source for the definition of motifs. Approximately 200 of those highly conserved sequences correspond to proteins the role of which is not obviously central, and warrant further analysis. Theoretical computations suggest that the 500 ancestral types defined so far constitute most of the fossil sequences detectable in modern sequences. This is consistent with another independent study comparing 3 large new datasets: partial cDNAs from human and nematode and ORFs from chromosome III of yeast. Thus, known proteins might already include representative for most ancestral features antedating the coelomate radiation.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000012-01
Application #
3845103
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
1992
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code