Understanding the genetic basis of human disease requires a comprehensive assessment of the full spectrum of human genetic variation. Genome structural variation, including larger deletions, insertions, and inversions (>50 bp), has been more difficult to characterize due to the association with repetitive DNA. The majority of structural variation, including common structural variants or SVs, has not yet been discovered using short-read whole-genome datasets and standard SV callers. Advances in sequencing technology over the last three years, however, have made the systematic discovery of this variation possible for the first time. This proposal focuses on the discovery, sequence resolution, and genotyping of the most complex and under-ascertained forms of human genetic variation, including multi-copy number variants (mCNVs), inversions, and intermediate- size insertions and deletions. We target a diversity panel of 34 human genomes and partition long-read single- molecule, real-time sequencing data using 10X linked reads and Strand-seq data in order to fully phase and sequence-resolve SVs on each human haplotype. Using these long-read sequence data, we further develop a computational graph-based approach to distinguish and assemble distinct copies underlying large mCNVs mapping to high-identity segmental duplications. Finally, we take advantage of the sequence structure, including breakpoints and sequence differences among the copies, to more accurately genotype these variants in a diversity panel of >2,800 human genomes where short-read whole-genome sequence data are already available. The work will develop new methods to characterize more complex forms of human genetic variation and provide fundamental insight into their diversity, mechanism of origin, and mutational properties. This research has the additional benefit that it will improve genome assembly, characterize new human genome sequence, identify a large class of missing genetic variation, and provide us with the ability to systematically explore this form of human genetic variation as part of disease-association studies.
This proposal focuses on the discovery, sequencing, and genotyping of more complex structural variation that has been overlooked as part of standard whole-genome sequencing efforts. The work will help complete our understanding of the full spectrum of human genetic variation and develop methods to associate such variants with disease.