The most highly conserved regions of proteins can be represented as blocks of aligned sequenced segments, typically with multiple blocks for a given protein family. During the previous funding period, 1) we developed an automated system for finding a set of blocks for each protein family represented in a catalog of families. The resulting database of blocks formed the basis for a gene classification system sensitive both to local and global relationships. 2) This system was used to detect distant homologies, cross-family relationships and repeated motifs using protein or DNA sequences as queries to search the database. 3) The alignments in blocks were used to construct a series of amino acid substitution matrices for scoring local alignments in general (the BLOSUM series). 4) To test these matrices, we developed comprehensive evaluation procedures using the hundreds of protein families represented in our database of blocks. These matrices were found to strikingly outperform matrices based on the Dayhoff evolutionary model. 5) The results of this work have been made available to the community via our e-mail server with more than 100 users, and with the replacement of Dayhoff matrices by our best overall performer, BLOSUM 62, in searching applications implemented by others. Since the detection of protein homology is often the best clue to the function of a gene, improved methodology is of importance to many fields of biological research. For the next funding period, we propose to expand the database of blocks, to improve the systems used for making and searching blocks, to apply our evaluation methods to these potential improvements, and to evaluate other substitution matrix strategies for sequence database searching. The protein comparison tools that we are developing are expected to have wide utility, such as in the detection and exon prediction of genes in raw genomic sequence. We also propose to extend the utility of our database for other applications, such as blocks-based design of variable-length PCR primers for sensitive detection of gene family members in whole genomic DNA.
Showing the most recent 10 out of 44 publications