The first human genome sequencing efforts are complete, which has opened the door to many new and challenging questions. Among these are the quantity and location of genes in the genome, both of which have proven surprisingly difficult to pinpoint. Even further from a definitive answer is the question of how many distinct functional RNA and protein products are produced by each gene through mechanisms such as alternative splicing. These unanswered questions impede a full understanding of the genome and how it functions in relation to human disease. We are proposing innovative software technology that has the potential to help overcome this obstacle, using mass spectrometry measurements of proteins to reveal the location and structure of the genes encoding those proteins within the genome. This technology can be applied to help answer several critical questions. For example, where are all the genes located in the genome? What are their exon-intron structures? How many distinct products do they encode? ? ? We propose to modify and combine the already proven software programs TWINSCAN and GFS, that were developed by our labs for genomic and proteomic purposes, respectively, to address these new challenges in genome analysis. TWINSCAN is a highly accurate, automated gene finder, and GFS is a proteomics tool that matches mass spectrometry (MS) peptide data from enzymatically digested proteins direcdy to raw (even unfinished) genome sequence, identifying the coding loci for the proteins. Here, we propose a two-pronged approach to produce a novel, protein-based method for finding genes and determining their structure.
Our aims comprise the following: a) extending GFS for automated use with multi-exon genes und very large genomes, to facilitate discovery of novel genes and gene structures; b) modifying TWINSCAN to use peptide data from GFS to enhance its rapid, automated gene finding capabilities; c) combining the two programs into an automated protein-based gene finder, and d) validating the approach for gene-finding using synthetic and experimental data sets. ? ?

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG003700-02
Application #
7122546
Study Section
Special Emphasis Panel (ZRG1-BDMA (01))
Program Officer
Good, Peter J
Project Start
2005-09-16
Project End
2008-06-30
Budget Start
2006-07-01
Budget End
2007-06-30
Support Year
2
Fiscal Year
2006
Total Cost
$402,318
Indirect Cost
Name
University of North Carolina Chapel Hill
Department
Microbiology/Immun/Virology
Type
Schools of Medicine
DUNS #
608195277
City
Chapel Hill
State
NC
Country
United States
Zip Code
27599
Risk, Brian A; Edwards, Nathan J; Giddings, Morgan C (2013) A peptide-spectrum scoring system based on ion alignment, intensity, and pair probabilities. J Proteome Res 12:4240-7
Risk, Brian A; Spitzer, Wendy J; Giddings, Morgan C (2013) Peppy: proteogenomic search software. J Proteome Res 12:3019-25
Su, Hsun-Cheng; Khatun, Jainab; Kanavy, Dona M et al. (2013) Comparative genome analysis of ciprofloxacin-resistant Pseudomonas aeruginosa reveals genes within newly identified high variability regions associated with drug resistance development. Microb Drug Resist 19:428-36
Khatun, Jainab; Yu, Yanbao; Wrobel, John A et al. (2013) Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 14:141
Djebali, Sarah; Davis, Carrie A; Merkel, Angelika et al. (2012) Landscape of transcription in human cells. Nature 489:101-8
ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57-74
Miller, Jameson; Parker, Miles; Bourret, Robert B et al. (2010) An agent-based model of signal transduction in bacterial chemotaxis. PLoS One 5:e9454
Maier, Christopher W; Long, Jeffrey G; Hemminger, Bradley M et al. (2009) Ultra-Structure database design methodology for managing systems biology data and analyses. BMC Bioinformatics 10:254
Giddings, Morgan C (2008) On the process of becoming a great scientist. PLoS Comput Biol 4:e33
Khatun, Jainab; Hamlett, Eric; Giddings, Morgan C (2008) Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification. Bioinformatics 24:674-81

Showing the most recent 10 out of 15 publications