De novo gene prediction is the automated identification of gene structures using genome sequences as the only inputs. We propose to continue a project that has significantly improved the accuracy of de novo gene prediction in vertebrates. When we started, GENSCAN predicted a correct exon-intron structure throughout one open reading frame (ORF) at only 10% of human gene loci. We have now published systems that predict a correct ORF at 35% of human loci. RT-PCR and sequencing of our predictions have verified hundreds of new human genes. With this renewal we aim to continue driving improvements in the accuracy of vertebrate gene prediction and its utility for biomedical applications.
Aim 1 Improve the accuracy of gene structure prediction in vertebrates - A. Develop improved models of informative patterns in multi-genome alignments Comparing the sequences of multiple vertebrate genomes should allow us to estimate the degree and pattern of selection at each site, lead ingto more accurate gene predictions. We propose a robust approach based on learning the patterns that exist in real alignment columns, even if they are due in part to sequencing, alignment, and assembly errors. The proposed model is a generalization of our successful TWINSCAN gene predictor. In preliminary studies its accuracy surpassed that of any previous gene prediction system for human. B. Develop improved models of informative patterns in the target DNA sequence We propose to systematically model regularities in gene structure that were previously considered too rare or elusive to be worthy of attention, such as splicing enhancers and suppressors, correlations between intron length and splice site sequence, and differential patterns of repeat insertion in introns versus non-transcribed regions.
Aim 2 Develop and maintain software, web server, and genome annotations Our goal is to improve scientific understanding and human health by providing more accurate gene predictions to the biomedical research community. Therefore, we will develop high quality, open source software, parameter sets for a variety of genomes, and a web server where users can submit sequences for annotation. Finally, we will distribute and display annotation for each new assembly of every vertebrate genome. This project will result in open source software that predicts exon-intron structures in vertebrate genomes more accurately than any current system. It will also increase the sensitivity and specificity of gene verification by RT-PCR. ? ? ?

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG002278-04A1
Application #
7105917
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Good, Peter J
Project Start
2000-08-01
Project End
2009-03-31
Budget Start
2006-04-01
Budget End
2007-03-31
Support Year
4
Fiscal Year
2006
Total Cost
$305,250
Indirect Cost
Name
Washington University
Department
Genetics
Type
Schools of Medicine
DUNS #
068552207
City
Saint Louis
State
MO
Country
United States
Zip Code
63130
Brent, Michael R (2008) Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9:62-73
Tenney, Aaron E; Wu, Jia Qian; Langton, Laura et al. (2007) A tale of two templates: automatically resolving double traces has many applications, including efficient PCR-based elucidation of alternative splices. Genome Res 17:212-8
Keibler, Evan; Arumugam, Manimozhiyan; Brent, Michael R (2007) The Treeterbi and Parallel Treeterbi algorithms: efficient, optimal decoding for ordinary, generalized and pair HMMs. Bioinformatics 23:545-54
van Baren, Marijke J; Koebbe, Brian C; Brent, Michael R (2007) Using N-SCAN or TWINSCAN to predict gene structures in genomic DNA sequences. Curr Protoc Bioinformatics Chapter 4:Unit 4.8
Brent, Michael R (2007) How does eukaryotic gene prediction work? Nat Biotechnol 25:883-5
Arumugam, Manimozhiyan; Wei, Chaochun; Brown, Randall H et al. (2006) Pairagon+N-SCAN_EST: a model-based gene annotation pipeline. Genome Biol 7 Suppl 1:S5.1-10
Flicek, Paul; Brent, Michael R (2006) Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts. Genome Biol 7 Suppl 1:S8.1-9
Gross, Samuel S; Brent, Michael R (2006) Using multiple alignments to improve gene prediction. J Comput Biol 13:379-93
van Baren, Marijke J; Brent, Michael R (2006) Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res 16:678-85
Wei, Chaochun; Brent, Michael R (2006) Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7:327

Showing the most recent 10 out of 21 publications