Computational methods will play a key role in extracting medically and scientifically useful information from the complete genome sequences of humans and other model organisms. Recognition of similarities to known biological sequence families has recently been significantly enhanced by the introduction of full probabilistic consensus models. The proposed research aims to further develop probabilistic models of protein and RNA consensus structure in order to improve recognition of distantly related protein and RNA homologues. These methods will be applied to large scale genome analysis, protein fold recognition, and RNA secondary structure prediction. Hidden Markov modeling methods will be extended to include structural information in addition to consensus sequence information from a protein family, in order to increase the sensitivity of protein fold recognition. A library of hidden Markov models of several hundred known protein structure families will be made and incorporated into the publicly available SCOP protein structure database on the World Wide Web. RNA covariance models describe RNA secondary structure in addition to sequence consensus, but their use is limited to small RNAs. Algorithmic improvements will be developed which greatly extend the useful range of covariance models. Sensitive secondary structure based recognition and alignment of most RNAs will be made feasible, as will consensus secondary structure prediction. Models will be developed for identifying homologues of a number of RNA gene families. These methods will be applied to the analysis of protein and RNA genes in the genome sequences of Caenorhabditis elegans, Saccharomyces cerevisiae, and other organisms obtain by genome sequencing projects in St. Louis and at the Sanger Center in Cambridge UK.
Showing the most recent 10 out of 22 publications