De novo gene prediction is the automated identification of gene structures using genome sequences as the only inputs. We propose to continue a project that has significantly improved the accuracy of de novo gene prediction in vertebrates. When we started, GENSCAN predicted a correct exon-intron structure throughout one open reading frame (ORF) at only 10% of human gene loci. We have now published systems that predict a correct ORF at 35% of human loci. RT-PCR and sequencing of our predictions have verified hundreds of new human genes. With this renewal we aim to continue driving improvements in the accuracy of vertebrate gene prediction and its utility for biomedical applications.
Aim 1 Improve the accuracy of gene structure prediction in vertebrates - A. Develop improved models of informative patterns in multi-genome alignments Comparing the sequences of multiple vertebrate genomes should allow us to estimate the degree and pattern of selection at each site, lead ingto more accurate gene predictions. We propose a robust approach based on learning the patterns that exist in real alignment columns, even if they are due in part to sequencing, alignment, and assembly errors. The proposed model is a generalization of our successful TWINSCAN gene predictor. In preliminary studies its accuracy surpassed that of any previous gene prediction system for human. B. Develop improved models of informative patterns in the target DNA sequence We propose to systematically model regularities in gene structure that were previously considered too rare or elusive to be worthy of attention, such as splicing enhancers and suppressors, correlations between intron length and splice site sequence, and differential patterns of repeat insertion in introns versus non-transcribed regions.
Aim 2 Develop and maintain software, web server, and genome annotations Our goal is to improve scientific understanding and human health by providing more accurate gene predictions to the biomedical research community. Therefore, we will develop high quality, open source software, parameter sets for a variety of genomes, and a web server where users can submit sequences for annotation. Finally, we will distribute and display annotation for each new assembly of every vertebrate genome. This project will result in open source software that predicts exon-intron structures in vertebrate genomes more accurately than any current system. It will also increase the sensitivity and specificity of gene verification by RT-PCR. ? ? ?
Showing the most recent 10 out of 21 publications