Motifs--short, gapless regions of similar sequence--have been shown to be useful for understanding the evolutionary and functional relationships among biopolymers. We are making good progress on automating the process of extracting motif descriptions from groups of related protein or DNA sequences and using those descriptions to search sequence databases and to analyze newly sequenced molecules. Recent results show that our methods are able to detect distant and subtle relationships among proteins not apparent using other sequence-based methods. We have made progress on three fronts. We have improved the MEME [I] algorithm for motif discovery--Multiple Expectation maximization for Motif Elicitation; we have ported MEME to the Intel Paragon massively parallel computer and made it available to the world as WWW server; we have developed MAST--Motif Alignment Search Tool--based on a new method for assessing the statistical significance of multiple motif scores. The MEME algorithm has been improved to enable it to discover motif patterns in situations where little is known about the number or arrangements of motifs within the training sequences. This was accomplished using background information about protein motifs--encoded as a mixture of Dirichlet priors--in a novel way [2]. This technique dramatically improves the ability of MEME to discover motif patterns when the pattern only occurs in a few of the sequences in the training set and when the pattern is very weak but occurs multiple times in some sequences in the training set. The biological community is now served by a parallel implementation of MEME running on SDSC's Intel Paragon computer. We have made this service available via a world-wide web site [3] and are proceeding to advertise it. We expect it to be a valuable addition to single sequence search tools (e.g., BLAST, FAST) and multiple alignment tools because MEME patterns are able to detect more distant relationships than single sequence searches and because MEME can be used in situations where the sequences are too distantly related to be multiply aligned reliably. We have developed a new method for searching sequence databases with one or more motifs that characterize a protein family and implemented it in the MAST algorithm (Bailey and Gribskov, in preparation). One novel feature of this program is a method for calculating the p-value for multiple motif scores. This allows biologists to evaluate the statistical significance of apparent sequence similarities. We are planning to make MAST available on-line through a web site to complement the usefulness of the MEME web site. [l] T.L. Bailey and C. Elkan """"""""Fitting a mixture model by expectation maximization to discover motifs in biopolymers"""""""" , Proc. Second Int. Conf. Intelligent Sys. Molec. Biol., (28-36), AAAI Press, 1994. [2] T.L. Bailey and M. Gribskov """"""""The megaprior heuristic for discovering sequence patterns"""""""", To appear: Proc. Fourth Int. Conf. Intelligent Sys. Molec. Biol., AAAI Press, 1996. [3] W. Grundy, T.L. Bailey, C. Elkan """"""""ParaMEME: Discovering DNA and protein motifs with a scalable parallel Computer--RESEARCH ABSTRACT"""""""", To appear: Proc. Fourth Int. Conf. Intelligent Sys. Molec.Biol. AAAI Press, 1996.
Showing the most recent 10 out of 270 publications