Informatics Tools for High-Throughput Sequences Data Analysis

Banks, Eric

Abstract

The Genome Analysis Toolkit (GATK) is a suite of best-in-class, widely-used, well-supported, open-source tools for processing and analysis of next-generation DNA sequencing (NGS) data. These tools currently include a multiple sequence realigner, a covariate-correcting base quality score recalibrator, multi-sample SNP, INDEL, and CNV genotypers, machine learning algorithms for false positive identification, variant evaluation modules, somatic SNP and indel callers, and hundreds of other tools. Underlying all of these tools is our structured programming framework (GATK-Engine) that uses the functional programming philosophy of MapReduce to make writing feature-rich, efficient and robust analysis tools easy. By centralizing common data management infrastructure, all GATK-based tools benefit from the engine's correctness, CPU and memory efficiency, as well as automatic distributed and shared memory parallelization, essential capabilities given the massive and growing size of NGS datasets. The GATK currently supports all of the major sequencing technologies including lllumina. Life Sciences 454, and ABI SOLID, from hybrid capture of exomes to 1000s of low-pass samples in the 1000 Genomes Project. Our emphasis on technology-agnostic processing tools has helped to popularize the now standard SAM/BAM and VCFs formats for representing NGS data and variation calls, respectively. In this RFA we propose to continue to develop the GATK-Engine and data processing tools to (1) achieve complete and accurate variation discovery and genotyping for all major sequencing study designs and NGS technologies (2) optimize the GATK-Engine and pipelining infrastructure to operate efficiently on distributed data sets at the scale of tens of thousands of samples (3) extend the GATK data processing tools to support the upcoming sequencing technologies of Complete Genomics, lon Torrent, and Pacific Biosciences as well as we do current technologies, (4) expand significantly our educational and support structures to ensure that the longtail of future NGS users can benefit from the best-practice data processing and analysis tools in the GATK.

Public Health Relevance

The proposed project aims to continue to develop the Genome Analysis Toolkit (GATK), a suite of widely used and mission-critical tools for analyzing the next-generation DNA sequencing data. With this grant we will improve these tools, make them more robust, and extend them to new sequencing technologies. This is essential to realize the potential of DNA sequencing to understand human history, diversity, and to discover new loci associated with human disease, leading to new biologic hypotheses and new drug targets.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project--Cooperative Agreements (U01)
Project #: 5U01HG006569-03
Application #: 8601147
Study Section: Special Emphasis Panel (ZHG1-HGR-M (O3))
Program Officer: Sofia, Heidi J

Project Start: 2012-02-01
Project End: 2015-12-31
Budget Start: 2014-01-01
Budget End: 2014-12-31
Support Year: 3
Fiscal Year: 2014
Total Cost: $909,000
Indirect Cost: $367,181

Institution

Name: Broad Institute, Inc.
Department
Type
DUNS #: 623544785

City: Cambridge
State: MA
Country: United States
Zip Code: 02142

Related projects


NIH 2015 U01 HG	Informatics Tools for High-Throughput Sequences Data Analysis Banks, Eric / Broad Institute, Inc.	$967,608
NIH 2014 U01 HG	Informatics Tools for High-Throughput Sequences Data Analysis Banks, Eric / Broad Institute, Inc.	$909,000
NIH 2013 U01 HG	Informatics Tools for High-Throughput Sequences Data Analysis Banks, Eric / Broad Institute, Inc.	$964,551
NIH 2012 U01 HG	Informatics Tools for High-Throughput Sequences Data Analysis Depristo, Mark A. / Broad Institute, Inc.	$1,010,000

Publications

Castel, Stephane E; Levy-Moonshine, Ami; Mohammadi, Pejman et al. (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biol 16:195

Van der Auwera, Geraldine A; Carneiro, Mauricio O; Hartl, Chris et al. (2013) From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 43:11.10.1-33

1000 Genomes Project Consortium; Abecasis, Goncalo R; Auton, Adam et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56-65

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: