Accurately detecting structural variation in the genome is a challenging task. Many approaches have been developed over the last few decades, yet it is estimated that tens of thousands of variants are still being missed in a given sample. Many of these variants are missed due to the limitations of using short-read sequencing to identify large variants. Although many of these missed variants are located within complex regions of the genome, it has been shown that some still have clinical relevance making their discovery important. New platforms have been developed for sequencing the genome using long-reads and show promise for overcoming many of these limitations creating the ability to identify the full spectrum of simple and complex structural variants. Because this technology is relatively young, new computational approaches to support the analysis of long-read sequencing data can aid in the discovery of these variants which are still being missed. In addition to detecting novel variation in samples with long-read sequencing data, computational approaches can be developed to leverage these novel variant calls to reanalyze the hundreds of thousands of short-read datasets currently available. In this proposal, we plan to develop new computational approaches to identify novel structural variation in the genome.
In Aim 1, we will apply a recurrence approach to analyze long read sequencing datasets utilizing deep neural networks.
In Aim 2, we will develop a tool to derive profiles of structural variants predicted in long- reads which can be used to identify and genotype structural variants calls in short read data-sets. Together, these approaches will allow researchers to accurately characterize structural variation in both long and short- read datasets.

Public Health Relevance

Structural variation has been implicated in numerous human diseases but there are still tens of thousands of variants being overlooked in the genome. The proposed research aims to detect novel variation by developing new computational tools to analyze data generated by state-of-the-art sequencing methods. These tools will aid in the discovery of variants associated with human health.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Predoctoral Individual National Research Service Award (F31)
Project #
1F31HG010569-01
Application #
9755117
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Cubano, Luis Angel
Project Start
2019-04-01
Project End
2022-03-31
Budget Start
2019-04-01
Budget End
2020-03-31
Support Year
1
Fiscal Year
2019
Total Cost
Indirect Cost
Name
University of Michigan Ann Arbor
Department
Biostatistics & Other Math Sci
Type
Schools of Medicine
DUNS #
073133571
City
Ann Arbor
State
MI
Country
United States
Zip Code
48109