The generation of protein sequence data on a genome scale has greatly increased the demand for rapid, sensitive and reliable methods for detecting functionally important, conserved motifs (cm) in proteins. A method for detecting cm in protein sequence databases and assessing their statistical significance was developed and implemented in the CAP (Consistent Alignment Parser) and MoST (Motif Search Tool) programs. The MoST procedure consists of iteratively abstracting from an alignment block a weight matrix representing the cm, scanning the database with this matrix, and locating new segments to add to the alignment block. The approach is based on the statistics of score distributions for position-dependent weight matrices. This method was generalized to allow searches with two alignment blocks separated by a variable distance; this procedure was implemented in the MoST2 program. Methods for motif detection are further used in conjunction with other methods for protein sequence analysis in order to identify conserved domains and delineate protein superfamilies. This strategy was applied to a variety of biologically important groups of proteins. Selected examples: S-adenosyl methionine-binding motifs was identified in eukaryotic nucleolar proteins fibrillarins, and it was predicted that fibrillarins possess rRNA methyltransferase activity. A dinucleotide-binding domain was detected in a family of guanine nucleotide exchange proteins one of which is implicated in human hereditary blindness. A superfamily of proteins containing a lyase domain was delineated, and unexpectedly, such a domain was detected in adducin, a eukaryotic cytoskeletal protein implicated in hereditary hypertension. A nucleotidyltransferase domain, an acetyltransferase domain, and a putative new protein-protein interaction domain were detected in a family of eukaryotic translation initiation factors. A library of conserved motifs that characterize protein families with representatives encoded int he Escherichia coli genome was constructed. The library consists of 166 con-served alignment blocks that can be used by the MoST program. The significance of the project is in the development of a coherent strategy for identifying cm and domains and delineating protein superfamilies and in the prediction of the functions of a number of biologically important proteins using these methods.
Showing the most recent 10 out of 50 publications