Development of a Software Pipeline for Sequence Data

Bloom, Toby

Abstract

Next-gen sequencing technologies are generating an incredible amount of data in a very short time span. While the raw sequence data is submitted to NCBI, at present there is no standard pipeline at NIH that can process this vast amount of data in a uniform, robust, fast and accurate manner to produce the variant calls needed for further biological research. For large collaborative projects, such as 1000 genomes or TCGA, it is critical to the quality of the results that all data for the project be processed consistently, through a single, validated analysis pipeline. The pipelines must be able to validate the data. recalibrate error rates, merge data for each sample across multiple sources and technologies, align to reference, and call SNP's and structural variants. Further, if increases in data production continue along current trajectories, these pipelines will need to process terabases of data per day. At present, every large project is coordinating its own pipeline infrastructure and analysis processes, or alternatively, reconciling results generated through inconsistent processes. Furthermore, next-generation technologies make it possible for small labs to generate huge datasets with only one or two instruments. But those labs are likely not equipped with the IT and informatics infrastructure needed to make full use of these data. They will therefore need to process their data at some external location to make the potential of these instruments a reality. We propose to build and deploy a massively-parallel, high-throughput analysis pipeline infrastructure to be managed by NCBI, and hosted at Amazon Web Services (the Amazon """"""""cloud""""""""). We will further develop several pre-configured analysis pipeline workflows to run common types of sequence analysis on that infrastructure. Users will be able to modify and extend the pre-configured pipeline workflows, or design and deploy new pipelines as new types of sequencing analyses develop, using tools we provide. Those new pipelines will be able to incorporate analysis algorithms implemented in a variety of programming languages, and will be able to use available compute resources to run as much as possible in parallel, thus reducing the time to delivery of results. Finally, we will provide a catalog of algorithm implementations, already configured to run within the pipeline infrastructure, from which new pipeline workflows can be constructed. These components will include quality recalibration steps, snp detectors, and indel detection algorithms.

Public Health Relevance

Next-generation sequencing technologies are generating an incredible amount of data in a very short time span, and analysis pipelines are needed to process this raw data to produce usable biological information. We propose to build and deploy a massively-parallel, high-throughput analysis pipeline infrastructure to be managed by NCBI. We will further develop several pre-configured analysis pipeline workflows to run common types of sequence analysis on that infrastructure.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: High Impact Research and Research Infrastructure Programs (RC2)
Project #: 3RC2HG005546-02S1
Application #: 8542292
Study Section: Special Emphasis Panel (ZHG1-HGR-N (O1))
Program Officer: Bonazzi, Vivien

Project Start: 2009-09-30
Project End: 2013-08-31
Budget Start: 2010-09-01
Budget End: 2013-08-31
Support Year: 2
Fiscal Year: 2012
Total Cost: $200,000
Indirect Cost: $84,393

Institution

Name: Broad Institute, Inc.
Department
Type
DUNS #: 623544785

City: Cambridge
State: MA
Country: United States
Zip Code: 02142

Related projects


NIH 2012 RC2 HG	Development of a Software Pipeline for Sequence Data Bloom, Toby / Broad Institute, Inc.	$200,000
NIH 2010 RC2 HG	Development of a Software Pipeline for Sequence Data McCarroll, Steven Andrew / Broad Institute, Inc.	$597,624
NIH 2009 RC2 HG	Development of a Software Pipeline for Sequence Data Bloom, Toby / Broad Institute, Inc.	$621,047

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: