Genome sequencing efforts are producing ever greater quantities of raw DNA sequence, but the annotation process for locating and determining the function of genetic elements has not kept up. While many aspects of annotation are difficult, it is particularly challenging to determine which parts of a genome sequence encode proteins, and therefore how the processes leading to protein translation are regulated. Not only are technologies for examining proteins more limited than those for studying RNA transcription, in an extensive study of transcription by the Encyclopedia of DNA elements consortium, a picture of great complexity emerged. The project uncovered many novel exons, alternative splice forms, and novel regulatory elements. These results indicate that nearly 9/10ths of human genes undergo alternative splicing, and the average gene produces approximately 6 splice variants. Rather than solidify knowledge regarding the location and function of genes, these results question whether we accurately know what constitutes a gene, and how the products encoded by genes determine the function of cells. The results particularly obfuscate determination of which transcripts are selected for translation to protein, further complicating annotation efforts. To address that gap, our project will determine which transcripts encode proteins, and how these are affected in several tissue types and disease conditions. We will use large tandem mass spectrometry-based proteomic data sets, mapping the analyzed protein data directly to several available human genome sequences, along with sets of predicted transcripts produced by the N-SCAN and CONTRAST gene finders, to reveal which parts of transcripts are translated into proteins, and in which types of cells this translation occurs. To accomplish this, our project has three specific aims: 1) to develop high-accuracy methods and software for mapping proteomic data from mass spec analyzed proteins directly to the genome locus encoding them;2) to develop an analysis pipeline software system using a novel rule-based information management approach;and 3) to apply these developments for the high-throughput analysis of large proteomic data sets, identifying the transcripts that encode proteins in distinct tissue types and disease conditions, and placing the results in a publicly accessible track in the UCSC genome browser. We believe this project will yield significant knowledge about the location and timing of protein translation in cells, which will potentiate further investigation of how misregulation of the path from transcription to translation leads to human disease conditions.

Public Health Relevance

Sequencing of the human genome is complete, but figuring out where genes are located, how they function, and how they cause or prevent human diseases like cancer has only just begun. Genes act as blueprints for RNA and proteins, the workhorses of the cell. We are developing technologies to address the key challenges of determining which genes specify the building of which proteins and how this process is orchestrated to ultimately unravel how disease processes occur.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG003700-05
Application #
7802061
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Good, Peter J
Project Start
2005-09-16
Project End
2011-01-31
Budget Start
2010-04-01
Budget End
2011-01-31
Support Year
5
Fiscal Year
2010
Total Cost
$435,435
Indirect Cost
Name
University of North Carolina Chapel Hill
Department
Microbiology/Immun/Virology
Type
Schools of Medicine
DUNS #
608195277
City
Chapel Hill
State
NC
Country
United States
Zip Code
27599
Risk, Brian A; Edwards, Nathan J; Giddings, Morgan C (2013) A peptide-spectrum scoring system based on ion alignment, intensity, and pair probabilities. J Proteome Res 12:4240-7
Risk, Brian A; Spitzer, Wendy J; Giddings, Morgan C (2013) Peppy: proteogenomic search software. J Proteome Res 12:3019-25
Su, Hsun-Cheng; Khatun, Jainab; Kanavy, Dona M et al. (2013) Comparative genome analysis of ciprofloxacin-resistant Pseudomonas aeruginosa reveals genes within newly identified high variability regions associated with drug resistance development. Microb Drug Resist 19:428-36
Khatun, Jainab; Yu, Yanbao; Wrobel, John A et al. (2013) Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC Genomics 14:141
Djebali, Sarah; Davis, Carrie A; Merkel, Angelika et al. (2012) Landscape of transcription in human cells. Nature 489:101-8
ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489:57-74
Miller, Jameson; Parker, Miles; Bourret, Robert B et al. (2010) An agent-based model of signal transduction in bacterial chemotaxis. PLoS One 5:e9454
Maier, Christopher W; Long, Jeffrey G; Hemminger, Bradley M et al. (2009) Ultra-Structure database design methodology for managing systems biology data and analyses. BMC Bioinformatics 10:254
Giddings, Morgan C (2008) On the process of becoming a great scientist. PLoS Comput Biol 4:e33
Khatun, Jainab; Hamlett, Eric; Giddings, Morgan C (2008) Incorporating sequence information into the scoring function: a hidden Markov model for improved peptide identification. Bioinformatics 24:674-81

Showing the most recent 10 out of 15 publications