Molecular sequence databases contain approximately 5,000 independent families of protein sequence. A small number of these span multiple phyla and must represent ancient evolutionarily conserved families of proteins. For well studied phyla, most of these ancient families now appear to be represented in the molecular sequence databases. Proposed course: An algorithm, HHS, has bee developed to take pairwise similarity relations generated by the program BLASTP and to assemble these into classes of mutually related proteins. Two phases were used. In the first phase, the ungapped high scoring segments identified by BLAST are assembled into sets of mutually consistent diagonals forming a gapped sequence alignment. In the second phase, the extents of these gapped alignments two each protein are compared. Overlapping alignments indicate the presence of a protein sequence domain. A connected set definition is employed to map out each family of protein domains. The algorithm is computationally efficient and has been used to classify BLAST searches run between all pairs of the NCBI non-redundant sequence database. Future work will implement a name generator for these protein domains to allow them to be used se as an automated source of protein annotation for molecular sequences. The evolution of individual domains is also being explored.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000028-01
Application #
3845116
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
1992
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code