The modENCODE project is a key sequel to the sequencing of the fly and worm genomes, and will have an enormous impact on our understanding of biological processes in all higher eukaryotes, including human. In order to manage the diverse, large-scale datasets that will be produced by modENCODE, we propose to create a data coordinating center (DCC) to track the data, integrate it with other information sources, and make it available to the research community in a timely and open fashion. This proposal brings together four groups with highly relevant backgrounds: The Micklem group, through its work on the InterMine system and FlyMine database, has extensive experience in integrating diverse types of data into high-performance data mining systems. The Stein and Lewis groups bring to the project an intimate familiarity with the C. elegans and D. melanogaster genomes, their reagents and research communities, and are well-positioned by their work with the WormBase and FlyBase databases to liaise with those MODs. The Kent group is responsible for the DCC for the Human ENCODE pilot project, and has extensive practical knowledge of developing and managing projects of this sort. We will assemble a team of three data managers stationed at CSHL and at Berkeley, who have a background in the bioinformatics of C. elegans and/or D. melanogaster. The managers will liaise with their contacts at the data provider sites to determine data file formats, milestones and quality control procedures for their datasets. They will also liaise with representatives from NCBI to coordinate modENCODE activities with the primary data repositories at GenBank and GEO. Data providers will upload their data sets to a staging server where they will be able to preview their data on an instance of the GBrowse genome browser. The data managers will QC the data before approving its transfer to the production database. Data will be integrated in the production database using InterMine, and from there released to the public on a monthly schedule. Researchers will be able to access the data via the GBrowse genome browser, bulk downloads, and via complex queries and reports mediated by InterMine and the BioMart data warehousing system. All major software systems used by the proposed DCC will be based on open source tools from the Generic Model Organism Database (GMOD), human ENCODE, and other sources. Throughout the project, Lewis and Stein will work close with FlyBase and/or WormBase to ensure that data collected by modENCODE becomes an integral part of the relevant model organism database. In addition we will dedicate a significant part of a data manager's effort to transfer data from modENCODE into the MODs during the last year of the project.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Biotechnology Resource Cooperative Agreements (U41)
Project #
3U41HG004269-05S1
Application #
8249234
Study Section
Special Emphasis Panel (ZHG1-HGR-P (J1))
Program Officer
Feingold, Elise A
Project Start
2007-05-04
Project End
2014-03-31
Budget Start
2011-04-01
Budget End
2014-03-31
Support Year
5
Fiscal Year
2011
Total Cost
$1,326,619
Indirect Cost
Name
Ontario Institute for Cancer Research
Department
Type
DUNS #
205540219
City
Toronto
State
ON
Country
Canada
Zip Code
M5 0-A3
Kalderimis, Alex; Lyne, Rachel; Butano, Daniela et al. (2014) InterMine: extensive web services for modern biology. Nucleic Acids Res 42:W468-72
Trinh, Quang M; Jen, Fei-Yang Arthur; Zhou, Ziru et al. (2013) Cloud-based uniform ChIP-Seq processing tools for modENCODE and ENCODE. BMC Genomics 14:494
Kuhn, Robert M; Haussler, David; Kent, W James (2013) The UCSC genome browser and associated tools. Brief Bioinform 14:144-61
Meyer, Laurence R; Zweig, Ann S; Hinrichs, Angie S et al. (2013) The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res 41:D64-9
Contrino, Sergio; Smith, Richard N; Butano, Daniela et al. (2012) modMine: flexible access to modENCODE data. Nucleic Acids Res 40:D1082-8
Dreszer, Timothy R; Karolchik, Donna; Zweig, Ann S et al. (2012) The UCSC Genome Browser database: extensions and updates 2011. Nucleic Acids Res 40:D918-23
Fujita, Pauline A; Rhead, Brooke; Zweig, Ann S et al. (2011) The UCSC Genome Browser database: update 2011. Nucleic Acids Res 39:D876-82
Washington, Nicole L; Stinson, E O; Perry, Marc D et al. (2011) The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details. Database (Oxford) 2011:bar023
McKay, Sheldon J; Vergara, Ismael A; Stajich, Jason E (2010) Using the Generic Synteny Browser (GBrowse_syn). Curr Protoc Bioinformatics Chapter 9:Unit 9.12
Kuhn, R M; Karolchik, D; Zweig, A S et al. (2009) The UCSC Genome Browser Database: update 2009. Nucleic Acids Res 37:D755-61