The Open Science Data Framework (OSDF) will provide the architecture and software necessary for conducting bioinformatics analyses in a federated cloud-enabled data and computational environment. The OSDF consists of a data store and data exchange procedures with an Application Program Interface (API) to support data submissions, retrievals and analysis for the user community. OSDF will support a diverse set of users, including (1) sequence generators that need to store and process raw data (2) tool and pipeline developers that need access to reference data sets, and (3) web-based resources that need real-time querying of reference data. The genomics community will be able to use this resource to process human genomic, transcriptomics, and metagenomic data to conduct analyses that include human variation detection, transcriptome analysis, epigenetic analysis, and microbiome analysis. To accomplish these goals we propose to: 1) establish the OSDF software stack and;2) ensure the usability of this data by integrating OSDF with established community supported pipelines in Cloud-enabled virtual machines;3) create two OSDF Instances where we will host publicly available genomic, transcriptomics, and metagenomic data from the 1000 Genomes Project, MG-RAST, and Human Microbiome Project and some of the intermediate and final analysis results;4) provide adequate documentation and training to the user community to use the system.
With the technological innovations and improvements in genome sequencing in the past decade sequencing is becoming cheaper and will soon become an integral part of medical research and practice. However, the computational resources needed to process this sequence data have not kept pace. With the OSDF researchers will be able to share and reuse expensive analysis results thereby reducing the overall costs of conducting translational research that utilizes genomic data.