Now that a working draft sequence of the human genome is in hand and an ongoing effort is in place to provide a draft of the mouse genome, the challenge is to identify the genes encoded by these genomes. Several efforts are underway in this regard including our own using ab initio gene finders and transcribed sequences in the form of mRNAs and ESTs. Gene prediction is the first step in identifying genes. Additional steps are to predict the function of those genes and associate any other information such as where (and when) the gene might be expressed. The goal of the proposed project is to provide a public database that will provide a central repository of gene predictions and associated annotation. The project will provide data integration such that predictions and annotations for the same gene (as defined by co-localizing to the same genomic location) will be linked. Associated annotation will be extended to include functional predictions and expression profiles. The intended users of the database are researchers seeking to extend their knowledge of a gene starting with an expression profile, a cDNA, or a genetic locus or to search generally for candidates genes. The prototype annotation framework for genomic sequence, GAIA, has been combined with prototypes for a gene index of ESTs and mRNAs, DoTS, and gene integration, EpoDB. The result is a database based on a global schema, GUS, that integrates sequence-centered entries from GenBank, dbEST, and SWISS-PROT and transforms the entries into gene-centered entities. This process includes data cleansing and adding value through annotation of the resultant genes (mRNAs and proteins). A first pass of this resource is on-line with ad hoc boolean queries and integrated visual tools as www.allgenes.org. The resource will provide an integrated set of known and predicted genes from GenBank, gene finders, and assembled ESTs and mRNA. Ontologies will be used to structure the annotations of biological concepts and gene function. Gene expression information will be augmented with RAD (RNA Abundance Database). No other public resource of this nature currently exists. Data currency of this resource will be maintained through periodic updates every 2-3 months. The updates will include integration of previously annotated genes with newly available GenBank and dbEST entries and recalculation of gene similarities, gene location, tissue distribution, and gene function. An annotation interface has been developed to complement and extend computational analysis through manual assessment of predictions for genes and their functions. Radiation hybrid mapping data for mouse sequences will be incorporated as has been done for human ESTs. Links between the genes in GUS and gene expression data in RAD will be established. To respond to the public community, queries to the web interface will be incorporated and bulk files provided in response to users of the allgenes.org site. Planned is the inclusion of on-demand annotation of new contigs.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG001539-06
Application #
6649208
Study Section
Special Emphasis Panel (ZRG1-SSS-Y (04))
Program Officer
Good, Peter J
Project Start
1997-02-12
Project End
2004-08-31
Budget Start
2003-09-01
Budget End
2004-08-31
Support Year
6
Fiscal Year
2003
Total Cost
$580,163
Indirect Cost
Name
University of Pennsylvania
Department
Genetics
Type
Schools of Medicine
DUNS #
042250712
City
Philadelphia
State
PA
Country
United States
Zip Code
19104
Mazzarelli, Joan M; White, Peter; Gorski, Regina et al. (2006) Novel genes identified by manual annotation and microarray expression analysis in the pancreas. Genomics 88:752-61
Schug, Jonathan; Schuller, Winfried-Paul; Kappen, Claudia et al. (2005) Promoter features related to tissue specificity as measured by Shannon entropy. Genome Biol 6:R33
Ananko, E A; Podkolodny, N L; Stepanenko, I L et al. (2005) GeneNet in 2005. Nucleic Acids Res 33:D425-7
Jones, Andrew; Hunt, Ela; Wastling, Jonathan M et al. (2004) An object model and database for functional genomics. Bioinformatics 20:1583-90
Manduchi, E; Grant, G R; He, H et al. (2004) RAD and the RAD Study-Annotator: an approach to collection, organization and exchange of all relevant information for high-throughput gene expression studies. Bioinformatics 20:452-9
Levitsky, Victor G; Katokhin, Alexey V (2003) Recognition of eukaryotic promoters using a genetic algorithm based on iterative discriminant analysis. In Silico Biol 3:81-7
Grant, G R; Manduchi, E; Pizarro, A et al. (2003) Maintaining data integrity in microarray data management. Biotechnol Bioeng 84:795-800
Schug, Jonathan; Diskin, Sharon; Mazzarelli, Joan et al. (2002) Predicting gene ontology functions from ProDom and CDD protein domains. Genome Res 12:648-55
Crabtree, J; Wiltshire, T; Brunk, B et al. (2001) High-resolution BAC-based map of the central portion of mouse chromosome 5. Genome Res 11:1746-57
Bailey Jr, L C; Searls, D B; Overton, G C (1998) Analysis of EST-driven gene annotation in human genomic sequence. Genome Res 8:362-76

Showing the most recent 10 out of 13 publications