The goal of this proposal is to characterize the gene content of the ENCODE regions. This means the delineation of one complete mRNA sequence for at least one splice isoform of each protein coding gene in the ENCODE regions, and the inference of a number of additional alternative splice forms - either complete or partial. The proposal builds on the complementary strength of a team with unique expertise in the fields of computational gene prediction, experimental verification of DNA functional domains, and genome annotation systems, that has already proven successful in the design of efficient high throughput mammalian gene identification systems. Complementary to other undirected large-scale gene characterization projects, our proposal emphasizes a targeted approach in which computational gene predictions guide the subsequent experimental verification. In this way, genes and exonic variants likely to be underrepresented in the current catalog of human genes can be specifically targeted. These include: short and intronless genes, genes undergoing non-canonical splicing, selenoprotein genes (genes translating the TGA stop codon, into a selenocysteine residue), genes with unusual codon composition that may express at very low levels of with a very restricted pattern, human specific genes and genes evolving very rapidly, whose corresponding homologues either do not exist in other species or are difficult to identify. Our strategy includes the utilization of a variety of existing computational and experimental techniques, often through novel strategies. Among these techniques, those that take advantage of the conservation of characteristic features between the human genes and their orthologs in other vertebrate species will play an essential role. By the end of the ENCODE project, we expect our strategy to be implemented in a largely automated pipeline that can be efficiently applied to the analysis of the entire human genome sequence.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZHG1-HGR-P (02))
Program Officer
Good, Peter J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Municipal Institute of Medical Research
Zip Code
Djebali, Sarah; Lagarde, Julien; Kapranov, Philipp et al. (2012) Evidence for transcript networks composed of chimeric RNAs in human cells. PLoS One 7:e28213
Harrow, Jennifer; Nagy, Alinda; Reymond, Alexandre et al. (2009) Identifying protein-coding genes in genomic sequences. Genome Biol 10:201
Djebali, Sarah; Kapranov, Philipp; Foissac, Sylvain et al. (2008) Efficient targeted transcript discovery via array-based normalization of RACE libraries. Nat Methods 5:629-35
Keibler, Evan; Arumugam, Manimozhiyan; Brent, Michael R (2007) The Treeterbi and Parallel Treeterbi algorithms: efficient, optimal decoding for ordinary, generalized and pair HMMs. Bioinformatics 23:545-54
(2007) Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447:799-816
Denoeud, France; Kapranov, Philipp; Ucla, Catherine et al. (2007) Prominent use of distal 5'transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res 17:746-59
Zheng, Deyou; Frankish, Adam; Baertsch, Robert et al. (2007) Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution. Genome Res 17:839-51
Chatterji, Sourav; Pachter, Lior (2007) Patterns of gene duplication and intron loss in the ENCODE regions suggest a confounding factor. Genomics 90:44-8
Washietl, Stefan; Pedersen, Jakob S; Korbel, Jan O et al. (2007) Structured RNAs in the ENCODE selected regions of the human genome. Genome Res 17:852-64
Arumugam, Manimozhiyan; Wei, Chaochun; Brown, Randall H et al. (2006) Pairagon+N-SCAN_EST: a model-based gene annotation pipeline. Genome Biol 7 Suppl 1:S5.1-10

Showing the most recent 10 out of 16 publications