The first human genome sequencing efforts are complete, which has opened the door to many new and challenging questions. Among these are the quantity and location of genes in the genome, both of which have proven surprisingly difficult to pinpoint. Even further from a definitive answer is the question of how many distinct functional RNA and protein products are produced by each gene through mechanisms such as alternative splicing. These unanswered questions impede a full understanding of the genome and how it functions in relation to human disease. We are proposing innovative software technology that has the potential to help overcome this obstacle, using mass spectrometry measurements of proteins to reveal the location and structure of the genes encoding those proteins within the genome. This technology can be applied to help answer several critical questions. For example, where are all the genes located in the genome? What are their exon-intron structures? How many distinct products do they encode? ? ? We propose to modify and combine the already proven software programs TWINSCAN and GFS, that were developed by our labs for genomic and proteomic purposes, respectively, to address these new challenges in genome analysis. TWINSCAN is a highly accurate, automated gene finder, and GFS is a proteomics tool that matches mass spectrometry (MS) peptide data from enzymatically digested proteins direcdy to raw (even unfinished) genome sequence, identifying the coding loci for the proteins. Here, we propose a two-pronged approach to produce a novel, protein-based method for finding genes and determining their structure.
Our aims comprise the following: a) extending GFS for automated use with multi-exon genes und very large genomes, to facilitate discovery of novel genes and gene structures; b) modifying TWINSCAN to use peptide data from GFS to enhance its rapid, automated gene finding capabilities; c) combining the two programs into an automated protein-based gene finder, and d) validating the approach for gene-finding using synthetic and experimental data sets. ? ?
Showing the most recent 10 out of 15 publications