Recent genomic data have revealed that approximately two fifths of all cellular proteins contain regions that have been detectably conserved among species evolutionarily diverged by over 500 million years. It is estimated that fewer than 1000 such regions exist, and that current protein databases contain most of them. Proteins containing these regions will be of particular use in identifying novel sequences from the various genome projects. We have therefore attempted to construct a relatively small """"""""core"""""""" database of such proteins. This database, while under 15% the size of comprehensive proteins sequence databases, is nevertheless able to detect virtually all the significant similarities that newly sequenced proteins show to the complete databases. It may be used to speed up database searches and to reduce the amount of redundant output they generate, thereby improving as well the ability to detect relatively subtle similarities. The core database is also interesting as an object of study in its own right, permitting improved statistical studies of protein sequences and suggesting the broad outline of proteins necessary for life.
Our aim has not been to provide a rigorous definition for the core of a protein sequence database, but rather to construct a simple but useful tool where none existed before.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000047-01
Application #
3781280
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
1993
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code