It is our view that for a genomic database to be complete it must contain at least the following data:1. Molecular mapping information, hierarchically (in resolution) arranged from cytogenetic resolution down to individual nucleotides;2. Genetic information, hierarchically arranged from loci ordered by linkage analysis down to allelic and mutational phenotypic variation;3. Biochemical information regarding gene expression patterns (e.g., tissue and cell cycle specificity), gene expression regulation (e.g., transcription activation), signal transduction (e.g. upstream regulators/downstream targets), and protein activity regulation (e.g., phosphorylation).For this information to be useful for planning experiments key to discovery, the computer system must have the following features:4. Timely, direct access to all relevant genomic data from human and all model organisms, in a highly specific manner such that only those data that are of interest are retrieved (the corollary of access is the ability to directly submit the same type of data to the same, actual or virtual database; this allows rapid flow of data from and to investigators);5. Interactive tools for manipulating and analyzing data and for exploring relationships between data elements within the system (e.g. be able to map, using a computer algorithm, a newly determined cDNA sequence (or restriction enzyme digested plasmid) to an existing genomic sequence (or predIcted restriction sites) provided by one of the large sequencing centers);6. Ability to collect and store logical collections of related data (individual projects) in a way that facilitates analysis and for sharing with colleagues involved in the project (from postdocs and technicians in the same lab to groups at different institutions);7. The software on which the system is built must be capable of rapid change and extension, to be able to handle changes in understanding of existing data and the development of new technologies involved in genomic investigations;8. The software must be technically powerful but relatively easy to use, uniform in general operation, say, like a Macintosh computer or a powerful word processor.We have developed an advanced prototype of a system that has the basic functionality outlined above. This system is in test and evaluation at several locations. During the next budget period, we need to optimize the performance of the system and to develop new software and strategies to address the eight issues listed above.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG000203-04
Application #
2208626
Study Section
Genome Study Section (GNM)
Project Start
1991-08-01
Project End
1995-05-31
Budget Start
1994-08-01
Budget End
1995-05-31
Support Year
4
Fiscal Year
1994
Total Cost
Indirect Cost
Name
Cold Spring Harbor Laboratory
Department
Type
DUNS #
065968786
City
Cold Spring Harbor
State
NY
Country
United States
Zip Code
11724
Wang, J T; Marr, T G; Shasha, D et al. (1996) Complementary classification approaches for protein sequences. Protein Eng 9:381-6
Zhang, M Q; Marr, T G (1995) Alignment of molecular sequences seen as random path analysis. J Theor Biol 174:119-29
Zhang, M Q; Marr, T G (1994) Fission yeast gene structure and recognition. Nucleic Acids Res 22:1750-9
Wang, J T; Marr, T G; Shasha, D et al. (1994) Discovering active motifs in sets of related protein sequences and using them for classification. Nucleic Acids Res 22:2769-75
Stamm, S; Zhang, M Q; Marr, T G et al. (1994) A sequence compilation and comparison of exons that are alternatively spliced in neurons. Nucleic Acids Res 22:1515-26
Mizukami, T; Chang, W I; Garkavtsev, I et al. (1993) A 13 kb resolution cosmid map of the 14 Mb fission yeast genome by nonrandom sequence-tagged site mapping. Cell 73:121-32
Zhang, M Q; Marr, T G (1993) Genome mapping by nonrandom anchoring: a discrete theoretical analysis. Proc Natl Acad Sci U S A 90:600-4