We are examining theoretical aspects of the structure, function, and evolution of proteins with emphasis upon protein sequences and upon those problems for which a computer is essential. We detect distant relationships and infer evolutionary trees of proteins and phylogenetic trees of species in which they occur, using sequence data. We organize all known sequences into the Superfamily List, a hierarchical tabulation with five levels of distinction based on sequence similarity. We plan to develop an improved computer model of the evolutionary process by incorporating additional data on point mutations, parameters for deletion-insertion events, and parameters to allow variable mutability at different positions in the chain. Groups of simulated sequences of known evolutionary distances will be constructed and used to test and improve the performance of our programs for detecting relationships and constructing trees. This grant also partially supports the Atlas of Protein Sequence and Structure Reference Data Center, which contains a complete, currently correct, continuing collection of protein sequence data and files of background information including evolutionary history, distant relationships, alignments, genetic relationships, and three-dimensional structures. The protein sequence data are made available to the scientific ccmmunity in several forms: published volumes of the Atlas of Protein Sequence and Structure and of the Protein Segment Dictionary, and computer-readable tapes of the sequence data. These are periodically updated. Data searches and other computer services using the up-to-date sequence data collection are performed at cost for other research workers upon request. In the 1980-81 grant year we obtained an administrative supplement to support partially the preparation of the information and the development of an efficient computer retrieval system for our Nuclic Acid Sequence Database.
Showing the most recent 10 out of 12 publications