The goal of this proposal is to characterize the gene content of the ENCODE regions. This means the delineation of one complete mRNA sequence for at least one splice isoform of each protein coding gene in the ENCODE regions, and the inference of a number of additional alternative splice forms - either complete or partial. The proposal builds on the complementary strength of a team with unique expertise in the fields of computational gene prediction, experimental verification of DNA functional domains, and genome annotation systems, that has already proven successful in the design of efficient high throughput mammalian gene identification systems. Complementary to other undirected large-scale gene characterization projects, our proposal emphasizes a targeted approach in which computational gene predictions guide the subsequent experimental verification. In this way, genes and exonic variants likely to be underrepresented in the current catalog of human genes can be specifically targeted. These include: short and intronless genes, genes undergoing non-canonical splicing, selenoprotein genes (genes translating the TGA stop codon, into a selenocysteine residue), genes with unusual codon composition that may express at very low levels of with a very restricted pattern, human specific genes and genes evolving very rapidly, whose corresponding homologues either do not exist in other species or are difficult to identify. Our strategy includes the utilization of a variety of existing computational and experimental techniques, often through novel strategies. Among these techniques, those that take advantage of the conservation of characteristic features between the human genes and their orthologs in other vertebrate species will play an essential role. By the end of the ENCODE project, we expect our strategy to be implemented in a largely automated pipeline that can be efficiently applied to the analysis of the entire human genome sequence.
Showing the most recent 10 out of 16 publications