High-throughput genomic sequencing efforts must be accompanied by high throughput, cost-effective sequence annotation to fully realize the value of the data. Annotation encompasses the identification and archiving of putative biological signals, sequence characteristics, and features, including genes, and, wherever cost-effective, the further characterization of those features experimentally. While one might hope that annotation could be entirely computational and thus inexpensive and rapid, computational predictions, especially of gene models, must ultimately be confirmed experimentally as an additional and independent validation of the genomic sequence data, and as a means to establish the firm foundation necessary to simplify and accelerate future biological research. The proposed work integrates computational and experimental approaches, creating a test-bed and ultimately a production system for high-throughput, high-information-gain annotation. It is designed as an open system where new computational and experimental components, and new scientific visualization tools, can be easily installed and maintained in the data management and analysis framework. Experimental annotation will be streamlined, targeted versions of standard techniques, including single pass sequencing of cDNAs selected from EST hits of genomic DNA, RT-PCR across inter- and intra-regions of putative, and dot-blots of plasmid DNA used in genomic sequencing against labeled mRNA. The three basic goals of experimental annotation are to 1) establish laboratory protocols, management structures and automation techniques for high-throughput experimental annotation; 2) validate and refine computational annotation, especially for gene model finders such as GRAIL; and 3) extract high-information-gain data, for example, by concentrating single pass cDNA sequencing efforts on ESTs from unknown gene classes, to extend the sequence similarly databases and computational gene finders. In its initial phase, development of the system infrastructure will be tightly coupled to ongoing high-throughput sequencing at the University of Oklahoma with the goal of transitioning the technology for deployment to the genomics research community at large.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genome Study Section (GNM)
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pennsylvania
Schools of Medicine
United States
Zip Code
Mazzarelli, Joan M; White, Peter; Gorski, Regina et al. (2006) Novel genes identified by manual annotation and microarray expression analysis in the pancreas. Genomics 88:752-61
Schug, Jonathan; Schuller, Winfried-Paul; Kappen, Claudia et al. (2005) Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biol 6:R33
Ananko, E A; Podkolodny, N L; Stepanenko, I L et al. (2005) GeneNet in 2005. Nucleic Acids Res 33:D425-7
Jones, Andrew; Hunt, Ela; Wastling, Jonathan M et al. (2004) An object model and database for functional genomics. Bioinformatics 20:1583-90
Manduchi, E; Grant, G R; He, H et al. (2004) RAD and the RAD Study-Annotator: an approach to collection, organization and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics 20:452-9
Levitsky, Victor G; Katokhin, Alexey V (2003) Recognition of eukaryotic promoters using a genetic algorithm based on iterative discriminant analysis. In Silico Biol 3:81-7
Grant, G R; Manduchi, E; Pizarro, A et al. (2003) Maintaining data integrity in microarray data management. Biotechnol Bioeng 84:795-800
Schug, Jonathan; Diskin, Sharon; Mazzarelli, Joan et al. (2002) Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res 12:648-55
Crabtree, J; Wiltshire, T; Brunk, B et al. (2001) High-resolution BAC-based map of the central portion of mouse chromosome 5. Genome Res 11:1746-57
Bailey Jr, L C; Searls, D B; Overton, G C (1998) Analysis of EST-driven gene annotation in human genomic sequence. Genome Res 8:362-76

Showing the most recent 10 out of 13 publications