The identification of structural variants (SVs) including deletions, insertions, duplications, and inversions from human whole-genome sequencing (WGS) data is essential for genomic research and precision medicine. However, SV discovery remains a challenge because no single sequencing technology or computer algorithm effectively captures the full spectrum of SVs. The Investigators of this project have made substantial advances toward comprehensive SV discovery by combining analyses from long- and short-read sequencing platforms, as well as incorporating other technologies such as jumping libraries, linked-read sequencing, and chromosomal strand-specific sequencing. Application of this approach to the genomes of three father-mother-child trios identified approximately threefold more SVs than could be detected using standard short-read WGS alone. This project builds on the Investigators? ongoing work to develop optimized, integrated multi-technology computational pipelines for the comprehensive identification of SVs in human genomes.
In Aim 1, computational methods will be developed for SV detection in WGS datasets generated using the multiple genomic technologies described above, and the combination of computational methods yielding the most comprehensive and accurate SV callset will be established as computational pipelines that will be packaged for broad sharing. This work will focus on family trios and unrelated individuals from all 26 populations of the 1000 Genomes Project. Use of trios will also enable determination of SV mutation rates for the different SV classes.
Aim 2 will develop novel SV calling methods that address the challenging task of SV detection in short-read-only WGS datasets. This work will focus on genomes sequenced by large-scale NHGRI-funded initiatives that aim to identify genetic variants associated with disease, such as the Centers for Common Disease Genomics (CCDG) and Centers for Mendelian Genomics (CMG). Analyses of these short-read WGS datasets will yield a gold standard for genome-wide SV datasets and serve as a resource that can be used to genotype common variants across the larger number of CCDG, CMG, and other short-read WGS datasets. Execution of this project will generate deep coverage WGS and multi- technology genomic datasets, as well as new SV callsets, for individuals across 26 populations around the world. This data will be made widely available through an open FTP site. SV datasets for patient samples from CCDG and CMG will be accessible through dbGaP and enable a more comprehensive association of genetic variants with human diseases. All computational pipelines will be made available in a portable framework to promote wide adoption by other users. Overall, this project will establish SV reference sets spanning many human populations around the world in which all SVs (and small insertions and deletions) have been sequence resolved and correctly phased along the entire length of the chromosomes. This will serve as a valuable community resource for benchmarking SV discovery and genotyping across WGS datasets in the clinical and genomic research domains.
Identifying and cataloging the occurrence of structural variants (SVs)?including deletions, insertions, duplications, and inversions?in human chromosomes is essential for understanding genetic disease risk and developing precision medicine approaches to disease treatment. However, SV discovery remains a challenge because no single genomic sequencing technology or computer algorithm effectively captures the full spectrum of SVs. With this project we will enable comprehensive SV discovery in human chromosomes by combining multiple cutting-edge genome-analyzing technologies and computational tools to generate a novel SV reference set encompassing ethnically diverse populations, which will serve as a valuable resource to clinicians and researchers who investigate the causes of genetic diseases.