A steadily increasing proportion of biomedical research is conducted on organisms for which genomic sequence is available. For many research questions, however, a genome is important primarily because of its protein products. Thus, a critical question in genome analysis is: What are the structures of all the genes and the exact amino acid sequences of their translation products? Despite significant contributions from experimental biology, high-throughput sequencing, and bioinformatics, we are still far from being able to answer this question accurately. The proposed research aims to improve gene-structure prediction by exploiting patterns of evolutionary conservation.
Aim 1 focuses on developing probability models for exploiting genomic homology to improve gene-structure prediction. A novel aspect of the proposed models is their use of """"""""conservation sequence"""""""" to represent the degree and pattern of evolutionary conservation at each nucleotide in the genome to be annotated. A conservation sequence is a synthesis of potentially overlapping local alignments into one sequence. Our probability models build on the Hidden Markov Model (HMM) approach used in state-of-the-art gene-structure prediction systems.
Aim 2 focuses on developing probability models for improving gene prediction by exploiting cDNA and EST alignments. The fundamental approach is similar to that of Aim 1. The most important new question is how to combine information from transcript alignments with information from genomic homology in a way that does not count the same evidence twice.
Aim 3 focuses on analysis of vertebrate genomes using homology from multiple vertebrate genomes. The best method is expected to depend on the evolutionary distances among the genomes. Our investigations will focus on gene-structure prediction in human, with homology provided by (a) both mouse and pufferfish, and (b) both mouse and rat. Our gene-structure predictions will be provided to the research community through our web site and that of our collaborators at Ensembl. The ability to predict complete gene structures reliably would constitute significant progress in high-throughput biology. Potential biomedical applications include: (1) Identifying novel protein families that could serve as drug targets, and (2) Accelerating positional cloning projects for the identification of disease related genes.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG002278-01A2
Application #
6473279
Study Section
Genome Study Section (GNM)
Program Officer
Good, Peter J
Project Start
2002-04-19
Project End
2005-03-31
Budget Start
2002-04-19
Budget End
2003-03-31
Support Year
1
Fiscal Year
2002
Total Cost
$398,500
Indirect Cost
Name
Washington University
Department
Biostatistics & Other Math Sci
Type
Schools of Engineering
DUNS #
062761671
City
Saint Louis
State
MO
Country
United States
Zip Code
63130
Brent, Michael R (2008) Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9:62-73
Tenney, Aaron E; Wu, Jia Qian; Langton, Laura et al. (2007) A tale of two templates: automatically resolving double traces has many applications, including efficient PCR-based elucidation of alternative splices. Genome Res 17:212-8
Keibler, Evan; Arumugam, Manimozhiyan; Brent, Michael R (2007) The Treeterbi and Parallel Treeterbi algorithms: efficient, optimal decoding for ordinary, generalized and pair HMMs. Bioinformatics 23:545-54
van Baren, Marijke J; Koebbe, Brian C; Brent, Michael R (2007) Using N-SCAN or TWINSCAN to predict gene structures in genomic DNA sequences. Curr Protoc Bioinformatics Chapter 4:Unit 4.8
Brent, Michael R (2007) How does eukaryotic gene prediction work? Nat Biotechnol 25:883-5
Arumugam, Manimozhiyan; Wei, Chaochun; Brown, Randall H et al. (2006) Pairagon+N-SCAN_EST: a model-based gene annotation pipeline. Genome Biol 7 Suppl 1:S5.1-10
Flicek, Paul; Brent, Michael R (2006) Using several pair-wise informant sequences for de novo prediction of alternatively spliced transcripts. Genome Biol 7 Suppl 1:S8.1-9
Gross, Samuel S; Brent, Michael R (2006) Using multiple alignments to improve gene prediction. J Comput Biol 13:379-93
van Baren, Marijke J; Brent, Michael R (2006) Iterative gene prediction and pseudogene removal improves genome annotation. Genome Res 16:678-85
Wei, Chaochun; Brent, Michael R (2006) Using ESTs to improve the accuracy of de novo gene prediction. BMC Bioinformatics 7:327

Showing the most recent 10 out of 21 publications