One of the big challenges in genomics is to organize and classify the huge amount of sequence data. This motivates the development of computational methods that can infer biological information from sequence alone. A number of computer programs have been designed for computational gene annotation, and these have had varying degrees of success. Algorithms based on Hidden Markov Models (HMMs) locate translational and transcriptional features of the genome, such as coding regions, splice sites, and initiation and termination signals. These signals are then used to predict gene structures. The second class of gene finding programs build on sequence similarity and produce an alignment of a new sequence to a known protein, or align two syntenic sequences. The success of such homology based methods comes from the fact that coding regions are generally well conserved in species which diverged as far back as 450 million years. At evolutionary distances around 50- 100 million years, as in human and mouse, the conservation also extends to other functional regions important for gene expression, such as promoters, UTRs, and other regulatory domains. In this project we intend to construct an annotation tool that combines and generalizes the two approaches of HMM and sequence alignment mentioned above. The actual prediction of genes and other functionally related elements will be carried out by a generalized form of HMM called generalized pair HMM (GPHMM). The computational complexity of the problem is greatly reduced by the use of something we call an approximate alignment.
Showing the most recent 10 out of 26 publications