Genome sequencing efforts are producing ever greater quantities of raw DNA sequence, but the annotation process for locating and determining the function of genetic elements has not kept up. While many aspects of annotation are difficult, it is particularly challenging to determine which parts of a genome sequence encode proteins, and therefore how the processes leading to protein translation are regulated. Not only are technologies for examining proteins more limited than those for studying RNA transcription, in an extensive study of transcription by the Encyclopedia of DNA elements consortium, a picture of great complexity emerged. The project uncovered many novel exons, alternative splice forms, and novel regulatory elements. These results indicate that nearly 9/10ths of human genes undergo alternative splicing, and the average gene produces approximately 6 splice variants. Rather than solidify knowledge regarding the location and function of genes, these results question whether we accurately know what constitutes a gene, and how the products encoded by genes determine the function of cells. The results particularly obfuscate determination of which transcripts are selected for translation to protein, further complicating annotation efforts. To address that gap, our project will determine which transcripts encode proteins, and how these are affected in several tissue types and disease conditions. We will use large tandem mass spectrometry-based proteomic data sets, mapping the analyzed protein data directly to several available human genome sequences, along with sets of predicted transcripts produced by the N-SCAN and CONTRAST gene finders, to reveal which parts of transcripts are translated into proteins, and in which types of cells this translation occurs. To accomplish this, our project has three specific aims: 1) to develop high-accuracy methods and software for mapping proteomic data from mass spec analyzed proteins directly to the genome locus encoding them;2) to develop an analysis pipeline software system using a novel rule-based information management approach;and 3) to apply these developments for the high-throughput analysis of large proteomic data sets, identifying the transcripts that encode proteins in distinct tissue types and disease conditions, and placing the results in a publicly accessible track in the UCSC genome browser. We believe this project will yield significant knowledge about the location and timing of protein translation in cells, which will potentiate further investigation of how misregulation of the path from transcription to translation leads to human disease conditions.
Sequencing of the human genome is complete, but figuring out where genes are located, how they function, and how they cause or prevent human diseases like cancer has only just begun. Genes act as blueprints for RNA and proteins, the workhorses of the cell. We are developing technologies to address the key challenges of determining which genes specify the building of which proteins and how this process is orchestrated to ultimately unravel how disease processes occur.
Showing the most recent 10 out of 15 publications