The Genome Analysis Toolkit (GATK) is a suite of best-in-class, widely-used, well-supported, open-source tools for processing and analysis of next-generation DNA sequencing (NGS) data. These tools currently include a multiple sequence realigner, a covariate-correcting base quality score recalibrator, multi-sample SNP, INDEL, and CNV genotypers, machine learning algorithms for false positive identification, variant evaluation modules, somatic SNP and indel callers, and hundreds of other tools. Underlying all of these tools is our structured programming framework (GATK-Engine) that uses the functional programming philosophy of MapReduce to make writing feature-rich, efficient and robust analysis tools easy. By centralizing common data management infrastructure, all GATK-based tools benefit from the engine's correctness, CPU and memory efficiency, as well as automatic distributed and shared memory parallelization, essential capabilities given the massive and growing size of NGS datasets. The GATK currently supports all of the major sequencing technologies including lllumina. Life Sciences 454, and ABI SOLID, from hybrid capture of exomes to 1000s of low-pass samples in the 1000 Genomes Project. Our emphasis on technology-agnostic processing tools has helped to popularize the now standard SAM/BAM and VCFs formats for representing NGS data and variation calls, respectively. In this RFA we propose to continue to develop the GATK-Engine and data processing tools to (1) achieve complete and accurate variation discovery and genotyping for all major sequencing study designs and NGS technologies (2) optimize the GATK-Engine and pipelining infrastructure to operate efficiently on distributed data sets at the scale of tens of thousands of samples (3) extend the GATK data processing tools to support the upcoming sequencing technologies of Complete Genomics, lon Torrent, and Pacific Biosciences as well as we do current technologies, (4) expand significantly our educational and support structures to ensure that the longtail of future NGS users can benefit from the best-practice data processing and analysis tools in the GATK.

Public Health Relevance

The proposed project aims to continue to develop the Genome Analysis Toolkit (GATK), a suite of widely used and mission-critical tools for analyzing the next-generation DNA sequencing data. With this grant we will improve these tools, make them more robust, and extend them to new sequencing technologies. This is essential to realize the potential of DNA sequencing to understand human history, diversity, and to discover new loci associated with human disease, leading to new biologic hypotheses and new drug targets.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZHG1-HGR-M (O3))
Program Officer
Sofia, Heidi J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Broad Institute, Inc.
United States
Zip Code
Castel, Stephane E; Levy-Moonshine, Ami; Mohammadi, Pejman et al. (2015) Tools and best practices for data processing in allelic expression analysis. Genome Biol 16:195
Van der Auwera, Geraldine A; Carneiro, Mauricio O; Hartl, Chris et al. (2013) From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline. Curr Protoc Bioinformatics 43:11.10.1-33
1000 Genomes Project Consortium; Abecasis, Goncalo R; Auton, Adam et al. (2012) An integrated map of genetic variation from 1,092 human genomes. Nature 491:56-65