Structural variation (SV) is a major driver of genome organization, content, and diversity. Over the last decade, many studies have demonstrated the significance of SV to the genetic etiology of neuropsychiatric disorders (NPDs) such as autism spectrum disorder (ASD), schizophrenia, bipolar disorder, and ADHD. These studies have suggested a variable impact of SV within individual disorders, as well as a shared genetic etiology across a spectrum of NPDs. However, despite their relevance to NPDs, virtually all previous association studies of SV have been restricted to large canonical copy number variation (CNV) detected using microarray technologies. Population genetic studies have paralleled these efforts, as most SV databases are dominated by array-based CNV data. Early sequencing-based reference maps, such as the 1000 Genomes Project, have used low- coverage whole-genome sequencing (WGS) to characterize other classes of SV in ~2,500 individuals. Collectively, these existing datasets have only captured a small fraction of SV that exists in each human genome and in the greater population, primarily due to limitations in SV ascertainment technologies, algorithms, and sample sizes. This study will provide maps of canonical and complex SVs on a scale 20-40 fold greater than the 1000 Genomes Project by systematically analyzing aggregated high-coverage WGS datasets. We will first develop a scalable tool for cloud-based SV discovery within the universally accessible Genome Analysis Toolkit (GATK-SV;
Aim 1). GATK-SV will build upon our existing integration of twelve algorithms in a framework that can capture all known classes of canonical and complex SV, and will include a module for extensibility to long- read WGS. We will then apply these methods across the aggregation of diverse WGS datasets in gnomAD, a whole-genome extension of our Exome Aggregation Consortium (ExAC) project (Aim 2). The gnomAD dataset currently includes 58,000 WGS samples, and we anticipate this resource to exceed 100,000 genomes by the conclusion of Aim 2. We will use this dataset to define regions of the genome that are recalcitrant to alteration and provide the first systematic measures of SV constraint. Finally, we will perform association analyses among >25,000 genomes in individuals with NPDs, including ASD, schizophrenia, and bipolar disorder (Aim 3). In combination with our gnomAD reference SV map, these analyses will thus be adequately powered to quantify the relative risk conferred by SV in each individual disorder and to explore shared risk between disorders. These studies are significant, as our team brings together complementary expertise to address substantial voids in the fields of computational genomics, population genetics, and neuropsychiatric genetics.
Each aim will apply innovative approaches to yield novel products, and we will freely distribute these tools, maps, and analyses to the global community without restriction. Importantly, these studies will also provide infrastructure for clinical diagnostic interpretation of WGS and a framework for future population-scale genomic medicine initiatives.
This project will develop new tools to characterize changes in genome structure from genome sequencing data (GATK-SV), and apply this tool to the aggregation of 50,000-100,000 genomes to derive reference maps. We will then use these resources to quantify the impact of structural variation across complex neuropsychiatric disorders. We will disseminate all tools and maps to the global community without use restriction, as we have done previously with the GATK tool and exome aggregation using an established infrastructure.