Structural variation (SV) is a major driver of genome organization, content, and diversity. Over the last decade, many studies have demonstrated the significance of SV to the genetic architecture of neuropsychiatric disorders (NPDs) such as autism spectrum disorder (ASD), schizophrenia, bipolar disorder, and ADHD. These studies have suggested a significant impact of SV within individual disorders, as well as shared genetic etiology across a spectrum of NPDs. However, despite this etiological relevance, most studies of SV in NPDs have focused on large canonical copy number variation (CNV) using microarray technologies. Population genetic studies have paralleled these efforts, as most SV databases are dominated by array-based CNV data. Several whole-genome sequencing (WGS) references have now been created to characterize SV, such as the 1000 Genomes Project in ~2,500 individuals. These datasets have been invaluable to human genetic research; however, they have captured a small fraction of SV that is accessible to WGS and are limited in ancestral diversity, primarily due to limitations in technologies, algorithms, and sample sizes. These challenges have also reduced the value of these reference for clinical interpretation of SV in diagnostic screening. This study will provide maps of canonical and complex SVs on a scale >50-fold that of the 1000 Genomes Project by systematically analyzing aggregated WGS datasets in the genome aggregation consortium (gnomAD). We will integrate our completed prototype of a scalable tool for cloud-based SV discovery within the universally accessible Genome Analysis Toolkit (GATK- SV;
Aim 1). GATK-SV will provide an open source framework that can capture a spectrum of canonical and complex SV, within the capabilities of short-read WGS, and will include a module for extensibility to long-read WGS. We will apply these methods across the aggregation of diverse ancestries in gnomAD, a WGS extension of our Exome Aggregation Consortium (ExAC) (Aim 2). The gnomAD dataset currently includes 85,000 WGS samples, and this resource will exceed 150,000 genomes by the conclusion of Aim 2. We will use this reference to define genomic regions recalcitrant to SV and provide systematic measures of SV constraint. We will then perform WGS association analyses across >60,000 genomes in individuals with NPDs, including ASD, schizophrenia, and bipolar disorder cases (Aim 3). In combination with the gnomAD SV maps and the integration of microarray-based CNV aggregation, these analyses will be well powered to quantify the relative risk conferred by SV in each individual disorder, and to explore shared risk across the NPD spectrum.
Each aim will apply innovative approaches to yield novel products, and we will freely distribute these tools, maps, and analyses without restriction. Importantly, these data will also provide benchmarked references for diagnostic interpretation across diverse ancestries, and an analytical framework for future population-scale genomic medicine initiatives.

Public Health Relevance

This project will provide new tools (GATK-SV) to the basic and clinical research communities to characterize changes in genome organization from whole genome sequencing data, and apply this tool to the aggregation of >150,000 genomes to derive broad reference maps for open distribution to the research community through an already built and widely accessed web interface. We will then use these resources to quantify the impact of structural variation across complex neuropsychiatric disorders.

National Institute of Health (NIH)
National Institute of Mental Health (NIMH)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Arguello, Alexander
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Broad Institute, Inc.
United States
Zip Code