The specific aim of this proposal is to annotate all the evidence-based gene features at high accuracy on the human genome reference sequence. This includes identifying all the protein-coding loci with associated alternative variants, non-coding loci which have transcript evidence available in the public nucleotide database (NCBI/EMBL/DDBJ) and pseudogenes. To achieve this goal we will integrate computational approaches, including recent comparative methods, expert manual annotation, able to integrate literature information, and targeted experimental approaches. Based on the exhaustive experimental and computation investigation of our initial GENCODE annotation of the ENCODE regions we are confident that we can deliver a gene set with high specificity and sensitivity that will provide critical information to other biologists and other ENCODE groups. As part of this process we will label all apparent gene loci clearly, classifying them according to their likely current functional status, so users are informed where regions that appear gene like are most likely pseudogenes or where transcript evidence is most likely artefactual. There are a number of motivated groups working in the area of defining protein coding genes for the human genome. This proposal includes most such groups and coordinates with other key groups. Critically, all the groups bring extensive experience of data integration and evaluation, leading to the resolution of annotation discrepancies by multiple approaches. This gives us confidence that through this integrated project we will be able to eliminate many of the remaining uncertainties about the precise location of genes and their component exons and transcript structure in the human genome. Genome-wide, highly accurate transcript definition will be of enormous value to the myriad of researchers working on the human genome. It will both have large cost savings worldwide due to increased specificity of reagent design and provide a more complete view of human genes, in particular those associated with disease. From this foundation, more accurate descriptions of the genetic causes of disease can be discovered.
Aken, Bronwen L; Ayling, Sarah; Barrell, Daniel et al. (2016) The Ensembl gene annotation system. Database (Oxford) 2016: |
Pervouchine, Dmitri D; Djebali, Sarah; Breschi, Alessandra et al. (2015) Enhanced transcriptome maps from multiple mouse tissues reveal evolutionary constraint in gene expression. Nat Commun 6:5903 |
Nguyen, Ngan; Hickey, Glenn; Zerbino, Daniel R et al. (2015) Building a pan-genome reference for a population. J Comput Biol 22:387-401 |
Washietl, Stefan; Kellis, Manolis; Garber, Manuel (2014) Evolutionary dynamics and tissue specificity of human long noncoding RNAs in six mammals. Genome Res 24:616-28 |
Harrow, Jennifer L; Steward, Charles A; Frankish, Adam et al. (2014) The Vertebrate Genome Annotation browser 10 years on. Nucleic Acids Res 42:D771-9 |
Flicek, Paul; Amode, M Ridwan; Barrell, Daniel et al. (2014) Ensembl 2014. Nucleic Acids Res 42:D749-55 |
Gerstein, Mark B; Rozowsky, Joel; Yan, Koon-Kiu et al. (2014) Comparative analysis of the transcriptome across distant species. Nature 512:445-8 |
Pervouchine, Dmitri D (2014) IRBIS: a systematic search for conserved complementarity. RNA 20:1519-31 |
Farrell, Catherine M; O'Leary, Nuala A; Harte, Rachel A et al. (2014) Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res 42:D865-72 |
Mudge, Jonathan M; Frankish, Adam; Harrow, Jennifer (2013) Functional transcriptomics in the post-ENCODE era. Genome Res 23:1961-73 |
Showing the most recent 10 out of 67 publications