The most highly conserved regions of proteins can be represented as blocks of aligned sequenced segments, typically with multiple blocks for a given protein family. During the previous funding period, 1) we developed an automated system for finding a set of blocks for each protein family represented in a catalog of families. The resulting database of blocks formed the basis for a gene classification system sensitive both to local and global relationships. 2) This system was used to detect distant homologies, cross-family relationships and repeated motifs using protein or DNA sequences as queries to search the database. 3) The alignments in blocks were used to construct a series of amino acid substitution matrices for scoring local alignments in general (the BLOSUM series). 4) To test these matrices, we developed comprehensive evaluation procedures using the hundreds of protein families represented in our database of blocks. These matrices were found to strikingly outperform matrices based on the Dayhoff evolutionary model. 5) The results of this work have been made available to the community via our e-mail server with more than 100 users, and with the replacement of Dayhoff matrices by our best overall performer, BLOSUM 62, in searching applications implemented by others. Since the detection of protein homology is often the best clue to the function of a gene, improved methodology is of importance to many fields of biological research. For the next funding period, we propose to expand the database of blocks, to improve the systems used for making and searching blocks, to apply our evaluation methods to these potential improvements, and to evaluate other substitution matrix strategies for sequence database searching. The protein comparison tools that we are developing are expected to have wide utility, such as in the detection and exon prediction of genes in raw genomic sequence. We also propose to extend the utility of our database for other applications, such as blocks-based design of variable-length PCR primers for sensitive detection of gene family members in whole genomic DNA.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM029009-17
Application #
2391895
Study Section
Genome Study Section (GNM)
Project Start
1981-04-01
Project End
1998-03-31
Budget Start
1997-04-01
Budget End
1998-03-31
Support Year
17
Fiscal Year
1997
Total Cost
Indirect Cost
Name
Fred Hutchinson Cancer Research Center
Department
Type
DUNS #
075524595
City
Seattle
State
WA
Country
United States
Zip Code
98109
Ng, Pauline C; Henikoff, Steven (2003) SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res 31:3812-4
Tompa, Rachel; McCallum, Claire M; Delrow, Jeffrey et al. (2002) Genome-wide profiling of DNA methylation reveals transposon targets of CHROMOMETHYLASE3. Curr Biol 12:65-8
Lindroth, A M; Cao, X; Jackson, J P et al. (2001) Requirement of CHROMOMETHYLASE3 for maintenance of CpXpG methylation. Science 292:2077-80
Colbert, T; Till, B J; Tompa, R et al. (2001) High-throughput screening for induced point mutations. Plant Physiol 126:480-4
Kunin, V; Chan, B; Sitbon, E et al. (2001) Consistency analysis of similarity between multiple alignments: prediction of protein function and fold structure from analysis of local sequence motifs. J Mol Biol 307:939-49
Malik, H S; Eickbush, T H (2001) Phylogenetic analysis of ribonuclease H domains suggests a late, chimeric origin of LTR retrotransposable elements and retroviruses. Genome Res 11:1187-97
Malik, H S; Henikoff, S (2000) Dual recognition-incision enzymes might be involved in mismatch repair and meiosis. Trends Biochem Sci 25:414-8
Henikoff, J G; Pietrokovski, S; McCallum, C M et al. (2000) Blocks-based methods for detecting protein homology. Electrophoresis 21:1700-6
McCallum, C M; Comai, L; Greene, E A et al. (2000) Targeted screening for induced mutations. Nat Biotechnol 18:455-7
Malik, H S; Burke, W D; Eickbush, T H (2000) Putative telomerase catalytic subunits from Giardia lamblia and Caenorhabditis elegans. Gene 251:101-8

Showing the most recent 10 out of 44 publications