We developed a suite of four transcript prediction algorithms collectively called """"""""FEAST"""""""" (Fast Empirical Algorithms Suggesting Transcripts), which are conceptually independent of the two established classes of gene discovery algorithms, namely """"""""ab initio"""""""" and database search methods. The main goals of this proposal are (1) to develop further this independent third class of gene prediction algorithms, (2) to apply them to the dentification of novel genes in the genome, and (3) to test the hypothesis that non-coding transcripts are prevalent in the genome, and are the medium for the expression of small RNA genes and other functional genomic elements. We will extend the statistical model and develop the software towards a fully integrated gene prediction tool capable of discovering genes in genomic sequences of one species, or in multiple species simultaneously for higher precision. We will use the new tool to produce a comprehensive catalog of predicted genes. This is the genetic """"""""parts list"""""""", that is required for the construction of metabolic and regulatory models of cell function. We will correlate the transcript predictions to expression data from hybridization array technology, and validate novel genes experimentally by RT-PCR and sequencing. We identified an unusual class of genes (which we call """"""""stencil"""""""" genes) in which the exons play no other role than the production of introns as precursor material for deriving one or more functional RNA molecules, like miRNAs and snoRNAs. We will put special emphasis in obtaining a comprehensive catalog of such """"""""stencil"""""""" genes and will study computationally their prevalence, their modes of regulation and how they evolve. We expect many of the novel transcripts to be central to the genetic regulation of development, and therefore of direct importance to cancer research. ? ? ?

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-BST-E (51))
Program Officer
Lyster, Peter
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Institute for Systems Biology
United States
Zip Code
Caballero, Juan; Smit, Arian F A; Hood, Leroy et al. (2014) Realistic artificial DNA sequences as negative controls for computational genomics. Nucleic Acids Res 42:e99
Roach, Jared C; Glusman, Gustavo; Hubley, Robert et al. (2011) Chromosomal haplotypes by genetic phasing of human families. Am J Hum Genet 89:382-97
Glusman, Gustavo; Caballero, Juan; Mauldin, Denise E et al. (2011) Kaviar: an accessible system for testing SNV novelty. Bioinformatics 27:3216-7
Roach, Jared C; Glusman, Gustavo; Smit, Arian F A et al. (2010) Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328:636-9
Dishaw, Larry J; Mueller, M Gail; Gwatney, Natasha et al. (2008) Genomic complexity of the variable region-containing chitin-binding proteins in amphioxus. BMC Genet 9:78