Interpretation of genome sequence data relies heavily upon computational analysis. The goals of the Human Genome Project include the development of improved computer algorithms for more accurate identification of genes and for more sensitive recognition of homologies. Both kinds of algorithms have been improved by the use of probabilistic modeling methods. The first half of this proposal uses probabilistic modeling methods to identify noncoding RNA genes in genome sequences. Most research into genefinding algorithms has understandably focused on protein coding genes. However, an unknown number of genes make functional noncoding RNAs instead of coding for proteins. Three different computational approaches are proposed. First, a computational screen for about 10-18 new pseudouridylation guide small nucleolar RNA genes in Saccharomyces cerevisiae is proposed, based on structural and sequence homology. Second, a probabilistic model of RNA secondary structure will be developed for use as an RNA genefinder program, identifying novel structural RNAs by significant secondary structure content. Third, comparative sequence analysis of the Caeonorhabditis elegans and Caenorhabditis briggsae genomes will be used to identify conserved sequences that do not correspond to coding regions. The second part of the proposal focuses on using profile hidden Markov models to improve functional annotation of predicted protein coding genes. HMMER, a profile HMM software package, will continue to be supported and developed, and will continue to support the PFAM database of more than 800 common protein domains. Additionally, a """"""""simulated evolution"""""""" algorithm is proposed for increasing HMMER's sensitivity.
Showing the most recent 10 out of 22 publications