Next-gen sequencing technologies are generating an incredible amount of data in a very short time span. While the raw sequence data is submitted to NCBI, at present there is no standard pipeline at NIH that can process this vast amount of data in a uniform, robust, fast and accurate manner to produce the variant calls needed for further biological research. For large collaborative projects, such as 1000 genomes or TCGA, it is critical to the quality of the results that all data for the project be processed consistently, through a single, validated analysis pipeline. The pipelines must be able to validate the data. recalibrate error rates, merge data for each sample across multiple sources and technologies, align to reference, and call SNP's and structural variants. Further, if increases in data production continue along current trajectories, these pipelines will need to process terabases of data per day. At present, every large project is coordinating its own pipeline infrastructure and analysis processes, or alternatively, reconciling results generated through inconsistent processes. Furthermore, next-generation technologies make it possible for small labs to generate huge datasets with only one or two instruments. But those labs are likely not equipped with the IT and informatics infrastructure needed to make full use of these data. They will therefore need to process their data at some external location to make the potential of these instruments a reality. We propose to build and deploy a massively-parallel, high-throughput analysis pipeline infrastructure to be managed by NCBI, and hosted at Amazon Web Services (the Amazon """"""""cloud""""""""). We will further develop several pre-configured analysis pipeline workflows to run common types of sequence analysis on that infrastructure. Users will be able to modify and extend the pre-configured pipeline workflows, or design and deploy new pipelines as new types of sequencing analyses develop, using tools we provide. Those new pipelines will be able to incorporate analysis algorithms implemented in a variety of programming languages, and will be able to use available compute resources to run as much as possible in parallel, thus reducing the time to delivery of results. Finally, we will provide a catalog of algorithm implementations, already configured to run within the pipeline infrastructure, from which new pipeline workflows can be constructed. These components will include quality recalibration steps, snp detectors, and indel detection algorithms.

Public Health Relevance

Next-generation sequencing technologies are generating an incredible amount of data in a very short time span, and analysis pipelines are needed to process this raw data to produce usable biological information. We propose to build and deploy a massively-parallel, high-throughput analysis pipeline infrastructure to be managed by NCBI. We will further develop several pre-configured analysis pipeline workflows to run common types of sequence analysis on that infrastructure.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
High Impact Research and Research Infrastructure Programs (RC2)
Project #
1RC2HG005546-01
Application #
7852966
Study Section
Special Emphasis Panel (ZHG1-HGR-N (O1))
Program Officer
Bonazzi, Vivien
Project Start
2009-09-30
Project End
2011-08-31
Budget Start
2009-09-30
Budget End
2010-08-31
Support Year
1
Fiscal Year
2009
Total Cost
$621,047
Indirect Cost
Name
Broad Institute, Inc.
Department
Type
DUNS #
623544785
City
Cambridge
State
MA
Country
United States
Zip Code
02142