Understanding the genetic basis of human disease requires a comprehensive assessment of the full spectrum of human genetic variation. Genome structural variation, including larger deletions, insertions, and inversions (>50 bp), has been more difficult to characterize due to the association with repetitive DNA. The majority of structural variation, including common structural variants or SVs, has not yet been discovered using short-read whole-genome datasets and standard SV callers. Advances in sequencing technology over the last three years, however, have made the systematic discovery of this variation possible for the first time. This proposal focuses on the discovery, sequence resolution, and genotyping of the most complex and under-ascertained forms of human genetic variation, including multi-copy number variants (mCNVs), inversions, and intermediate- size insertions and deletions. We target a diversity panel of 34 human genomes and partition long-read single- molecule, real-time sequencing data using 10X linked reads and Strand-seq data in order to fully phase and sequence-resolve SVs on each human haplotype. Using these long-read sequence data, we further develop a computational graph-based approach to distinguish and assemble distinct copies underlying large mCNVs mapping to high-identity segmental duplications. Finally, we take advantage of the sequence structure, including breakpoints and sequence differences among the copies, to more accurately genotype these variants in a diversity panel of >2,800 human genomes where short-read whole-genome sequence data are already available. The work will develop new methods to characterize more complex forms of human genetic variation and provide fundamental insight into their diversity, mechanism of origin, and mutational properties. This research has the additional benefit that it will improve genome assembly, characterize new human genome sequence, identify a large class of missing genetic variation, and provide us with the ability to systematically explore this form of human genetic variation as part of disease-association studies.

Public Health Relevance

This proposal focuses on the discovery, sequencing, and genotyping of more complex structural variation that has been overlooked as part of standard whole-genome sequencing efforts. The work will help complete our understanding of the full spectrum of human genetic variation and develop methods to associate such variants with disease.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG010169-02
Application #
9778893
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
2018-09-06
Project End
2022-06-30
Budget Start
2019-07-01
Budget End
2020-06-30
Support Year
2
Fiscal Year
2019
Total Cost
Indirect Cost
Name
University of Washington
Department
Genetics
Type
Schools of Medicine
DUNS #
605799469
City
Seattle
State
WA
Country
United States
Zip Code
98195