High Throughput Annotation of Genomic DNA Sequence

Stoeckert, Christian

Abstract

Now that a working draft sequence of the human genome is in hand and an ongoing effort is in place to provide a draft of the mouse genome, the challenge is to identify the genes encoded by these genomes. Several efforts are underway in this regard including our own using ab initio gene finders and transcribed sequences in the form of mRNAs and ESTs. Gene prediction is the first step in identifying genes. Additional steps are to predict the function of those genes and associate any other information such as where (and when) the gene might be expressed. The goal of the proposed project is to provide a public database that will provide a central repository of gene predictions and associated annotation. The project will provide data integration such that predictions and annotations for the same gene (as defined by co-localizing to the same genomic location) will be linked. Associated annotation will be extended to include functional predictions and expression profiles. The intended users of the database are researchers seeking to extend their knowledge of a gene starting with an expression profile, a cDNA, or a genetic locus or to search generally for candidates genes. The prototype annotation framework for genomic sequence, GAIA, has been combined with prototypes for a gene index of ESTs and mRNAs, DoTS, and gene integration, EpoDB. The result is a database based on a global schema, GUS, that integrates sequence-centered entries from GenBank, dbEST, and SWISS-PROT and transforms the entries into gene-centered entities. This process includes data cleansing and adding value through annotation of the resultant genes (mRNAs and proteins). A first pass of this resource is on-line with ad hoc boolean queries and integrated visual tools as www.allgenes.org. The resource will provide an integrated set of known and predicted genes from GenBank, gene finders, and assembled ESTs and mRNA. Ontologies will be used to structure the annotations of biological concepts and gene function. Gene expression information will be augmented with RAD (RNA Abundance Database). No other public resource of this nature currently exists. Data currency of this resource will be maintained through periodic updates every 2-3 months. The updates will include integration of previously annotated genes with newly available GenBank and dbEST entries and recalculation of gene similarities, gene location, tissue distribution, and gene function. An annotation interface has been developed to complement and extend computational analysis through manual assessment of predictions for genes and their functions. Radiation hybrid mapping data for mouse sequences will be incorporated as has been done for human ESTs. Links between the genes in GUS and gene expression data in RAD will be established. To respond to the public community, queries to the web interface will be incorporated and bulk files provided in response to users of the allgenes.org site. Planned is the inclusion of on-demand annotation of new contigs.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG001539-06
Application #: 6649208
Study Section: Special Emphasis Panel (ZRG1-SSS-Y (04))
Program Officer: Good, Peter J

Project Start: 1997-02-12
Project End: 2004-08-31
Budget Start: 2003-09-01
Budget End: 2004-08-31
Support Year: 6
Fiscal Year: 2003
Total Cost: $580,163
Indirect Cost

Institution

Name: University of Pennsylvania
Department: Genetics
Type: Schools of Medicine
DUNS #: 042250712

City: Philadelphia
State: PA
Country: United States
Zip Code: 19104

Related projects


NIH 2003 R01 HG	High Throughput Annotation of Genomic DNA Sequence Stoeckert, Christian J. / University of Pennsylvania	$580,163
NIH 2002 R01 HG	High Throughput Annotation of Genomic DNA Sequence Stoeckert, Christian J. / University of Pennsylvania	$566,554
NIH 2001 R01 HG	High Throughput Annotation of Genomic DNA Sequence Stoeckert, Christian J. / University of Pennsylvania	$557,847
NIH 2000 R01 HG	High Throughput Annotation of Genomic DNA Sequence Stoeckert, Christian J. / University of Pennsylvania	$376,401
NIH 1999 R01 HG	High Throughput Annotation of Genomic DNA Sequence Overton, G. / University of Pennsylvania
NIH 1998 R01 HG	High Throughput Annotation of Genomic DNA Sequence Overton, G. / University of Pennsylvania
NIH 1998 R01 HG	High Throughput Annotation of Genomic DNA Sequence Overton, G. / University of Pennsylvania
NIH 1997 R01 HG	High Throughput Annotation of Genomic DNA Sequence Overton, G. / University of Pennsylvania

Publications

Mazzarelli, Joan M; White, Peter; Gorski, Regina et al. (2006) Novel genes identified by manual annotation and microarray expression analysis in the pancreas. Genomics 88:752-61

Schug, Jonathan; Schuller, Winfried-Paul; Kappen, Claudia et al. (2005) Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biol 6:R33

Ananko, E A; Podkolodny, N L; Stepanenko, I L et al. (2005) GeneNet in 2005. Nucleic Acids Res 33:D425-7

Jones, Andrew; Hunt, Ela; Wastling, Jonathan M et al. (2004) An object model and database for functional genomics. Bioinformatics 20:1583-90

Manduchi, E; Grant, G R; He, H et al. (2004) RAD and the RAD Study-Annotator: an approach to collection, organization and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics 20:452-9

Levitsky, Victor G; Katokhin, Alexey V (2003) Recognition of eukaryotic promoters using a genetic algorithm based on iterative discriminant analysis. In Silico Biol 3:81-7

Grant, G R; Manduchi, E; Pizarro, A et al. (2003) Maintaining data integrity in microarray data management. Biotechnol Bioeng 84:795-800

Schug, Jonathan; Diskin, Sharon; Mazzarelli, Joan et al. (2002) Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res 12:648-55

Crabtree, J; Wiltshire, T; Brunk, B et al. (2001) High-resolution BAC-based map of the central portion of mouse chromosome 5. Genome Res 11:1746-57

Overton, G C; Bailey, C; Crabtree, J et al. (1998) The GAIA software framework for genome annotation. Pac Symp Biocomput :291-302

Showing the most recent 10 out of 13 publications

Comments

Be the first to comment on Christian Stoeckert's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: