Structural variation (SV), is a diverse class of genome variation that includes copy number variants (CNVs) such as deletions and duplications, as well as balanced rearrangements, such as inversions and reciprocal translocations. A typical human genome harbors >4,000 SVs larger than 300bp and their large size increases the potential to delete or duplicate genes, disrupt chromatin structure, and alter expression. Despite their prevalence and potential for phenotypic consequence, SVs remain notoriously difficult to detect and genotype with high accuracy. Much of this difficulty is driven by the fact DNA sequence alignment ?signals? indicating SVs are far more complex than for single-nucleotide and insertion deletion variants. Unlike SNP alignments that vary only in allele state, alignments supporting SVs vary in state (supports an alternate structure or not) alignment location, and type. Consequently, the accuracy of SV discovery is much lower than that of SNPs and INDELs. Furthermore, SV pipelines scale poorly and are difficult to run. These challenges are a barrier for single genome analysis and studies of families must invest substantial effort into eliminating a sea of false positives. These problems become exponentially more acute for large-scale sequencing efforts such as TOPmed, the Centers for Common Disease Genetics, and the All of Us program. Software efficiency is key to scalability for such projects. However, of equal importance is comprehensive, accurate discovery. Building upon more than a decade of software development experience and analyzing SV in diverse disease contexts, we have invested significant effort into understanding the causes of the insufficient accuracy for SV discovery. These efforts, together with our research and development experience in this area, give us unique insight into improving the accuracy and scalability of SV discovery. Our goal is to narrow the accuracy gap between SNP/INDEL variation and structural variation discovery. These developments will empower studies of human genomes in diverse contexts and will therefore have broad impact. Our goals are to: 1. Develop a deep learning model to correct systematic variation in sequence depth. This new machine learning model will correct systematic biases in DNA sequence depth and dramatically improve the discovery of deletions and duplications. 2. Improve the speed, scalability, and accuracy of SV detection and genotyping. Using new algorithms, we will bring the accuracy of SV detection much closer to that of SNP and INDEL discovery and allow accurate SV discovery to be deployed at scale. 3. Create a map of genomic constraint for SV from population-scale genome analysis. We will deploy our new methods to detect and genotype structural variation among tens of thousands of human genomes. The resulting SV map will empower the creation of a model of genomic constraint for SV and enable new software to predict deleterious SVs, especially in the noncoding genome.

Public Health Relevance

Single-nucleotide DNA changes paint an incomplete picture of a human?s genome. A more complete picture must include a genome's structural variation (SV), an important class of genome variation that includes copy number variants (CNVs) such as deletions and duplications. However, existing methods have poor accuracy. As the genetics community transitions to large-scale genome sequencing studies, there is an acute need for improved SV discovery methods. This proposal introduces a series of algorithmic and software innovations that will empower SV discovery, genotyping, and interpretation in large-scale human disease studies.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG010757-02
Application #
10153847
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Sofia, Heidi J
Project Start
2020-05-01
Project End
2024-02-29
Budget Start
2021-03-01
Budget End
2022-02-28
Support Year
2
Fiscal Year
2021
Total Cost
Indirect Cost
Name
University of Utah
Department
Genetics
Type
Schools of Medicine
DUNS #
009095365
City
Salt Lake City
State
UT
Country
United States
Zip Code
84112