One of the big challenges in genomics is to organize and classify the huge amount of sequence data. This motivates the development of computational methods that can infer biological information from sequence alone. A number of computer programs have been designed for computational gene annotation, and these have had varying degrees of success. Algorithms based on Hidden Markov Models (HMMs) locate translational and transcriptional features of the genome, such as coding regions, splice sites, and initiation and termination signals. These signals are then used to predict gene structures. The second class of gene finding programs build on sequence similarity and produce an alignment of a new sequence to a known protein, or align two syntenic sequences. The success of such homology based methods comes from the fact that coding regions are generally well conserved in species which diverged as far back as 450 million years. At evolutionary distances around 50- 100 million years, as in human and mouse, the conservation also extends to other functional regions important for gene expression, such as promoters, UTRs, and other regulatory domains. In this project we intend to construct an annotation tool that combines and generalizes the two approaches of HMM and sequence alignment mentioned above. The actual prediction of genes and other functionally related elements will be carried out by a generalized form of HMM called generalized pair HMM (GPHMM). The computational complexity of the problem is greatly reduced by the use of something we call an approximate alignment.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG002362-02
Application #
6623989
Study Section
Genome Study Section (GNM)
Program Officer
Good, Peter J
Project Start
2002-06-01
Project End
2005-05-31
Budget Start
2003-06-01
Budget End
2004-05-31
Support Year
2
Fiscal Year
2003
Total Cost
$308,937
Indirect Cost
Name
University of California Berkeley
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
124726725
City
Berkeley
State
CA
Country
United States
Zip Code
94704
Snir, Sagi; Rao, Satish (2010) Quartets MaxCut: a divide and conquer quartets algorithm. IEEE/ACM Trans Comput Biol Bioinform 7:704-18
Snir, Sagi; Warnow, Tandy; Rao, Satish (2008) Short quartet puzzling: a new quartet-based phylogeny reconstruction algorithm. J Comput Biol 15:91-103
Begun, David J; Holloway, Alisha K; Stevens, Kristian et al. (2007) Population genomics: whole-genome analysis of polymorphism and divergence in Drosophila simulans. PLoS Biol 5:e310
Schwartz, Ariel S; Pachter, Lior (2007) Multiple alignment by sequence annealing. Bioinformatics 23:e24-9
Chatterji, Sourav; Pachter, Lior (2007) Patterns of gene duplication and intron loss in the ENCODE regions suggest a confounding factor. Genomics 90:44-8
Beerenwinkel, Niko; Drton, Mathias (2007) A mutagenetic tree hidden Markov model for longitudinal clonal HIV sequence data. Biostatistics 8:53-71
Chen, K; Rajewsky, N (2006) Deep conservation of microRNA-target relationships and 3'UTR motifs in vertebrates, flies, and nematodes. Cold Spring Harb Symp Quant Biol 71:149-56
Snir, Sagi; Rao, Satish (2006) Using max cut to enhance rooted trees consistency. IEEE/ACM Trans Comput Biol Bioinform 3:323-33
Dewey, Colin N; Pachter, Lior (2006) Evolution at the nucleotide level: the problem of multiple whole-genome alignment. Hum Mol Genet 15 Spec No 1:R51-6
Dewey, Colin N; Huggins, Peter M; Woods, Kevin et al. (2006) Parametric alignment of Drosophila genomes. PLoS Comput Biol 2:e73

Showing the most recent 10 out of 26 publications