The whole genome sequencing of large cohorts of individuals is quickly becoming a common tool for researchers to investigate the genetic basis of many disease phenotypes. The primary goals are to discover the underlying genetic variation that cause or contribute to these diseases as well as to correctly identify these variants in a diagnostic setting. These differences typicall consist of single base changes (SNPs), but can also encompass larger, more complex chromosomal rearrangements in the form of structural variation (SV) which are much more difficult to detect even with modern sequencing technologies. A number of approaches have been published that have studied this problem, but even the largest scale endeavors have only focused on deletion events and reported a sensitivity of <70%. Complex chromosomal rearrangements are even less well studied. Thus, it is paramount that accurate methods are developed which can detect all types of SVs at high specificity from sequence data. This proposal aims to improve the overall ability of researchers to identify and analyze genetic variation from whole genome sequences. An important, and often overlooked, aspect of SV discovery is the fact that typical paired-end, read depth, and split read approaches will identify different sets of non-overlapping variants at varying degrees of accuracy.
In Aim 1, we will develop a unified SV discovery algorithm that can incorporate all of these different sources of information in a probabilistic fashion. Such a method would be useful for research, in particular with the identification of rare variants, as well as clinical applications which require a great del of accuracy and have thus far been limited to older karyotyping and microarray approaches. This would identify the majority of structural variants, however there are many regions in genomic sequences which are complex in nature, defined as consisting of multiple neighboring or overlapping chromosomal rearrangements that are challenging to resolve with typical SV detection approaches.
In Aim 2, we propose methods to resolve these complex regions and assess their frequency and impact. Furthermore, a crucial step in medical genetics is the comparison of identified genetic mutations to databases of known pathogenic and benign variants. This is currently problematic with SVs, as they have often been originally reported with varying degrees of breakpoint resolution that can hamper the correct assignment of the variant. This issue is compounded further in more complex regions with multiple breakpoints, for which simplistic comparison methods do not work well.
In Aim 3, we will develop and implement a system that describes and utilizes variant profiles to identify whether an individual's sequence data contains a variant of interest. Overall, this project will advance our understanding of the human genome as well as provide tools for use in the general research and clinical communities.

Public Health Relevance

The rearrangement of chromosomal material in the form of structural variation is directly responsible for many disease phenotypes, however our ability to detect and resolve these events from whole genome sequence data is currently limited. We propose a number of strategies for improving the detection and analysis of structural genomic variation between individuals and resolving their underlying structure and function. These approaches will have direct application to the clinical diagnosis of such events and the future of personalized genomics.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Michigan Ann Arbor
Biostatistics & Other Math Sci
Schools of Medicine
Ann Arbor
United States
Zip Code
Zhao, Xuefang; Weber, Alexandra M; Mills, Ryan E (2017) A recurrence-based approach for validating structural variation using long-read sequencing technology. Gigascience 6:1-9
Hovelson, Daniel H; Liu, Chia-Jen; Wang, Yugang et al. (2017) Rapid, ultra low coverage copy number profiling of cell-free DNA as a precision oncology screening strategy. Oncotarget 8:89848-89866
Zhao, Xuefang; Emery, Sarah B; Myers, Bridget et al. (2016) Resolving complex structural genomic rearrangements using a randomized approach. Genome Biol 17:126
Chun, Sang Y; Rodriguez, Caitlin M; Todd, Peter K et al. (2016) SPECtre: a spectral coherence--based classifier of actively translated transcripts from ribosome profiling sequence data. BMC Bioinformatics 17:482
Sudmant, Peter H; Rausch, Tobias; Gardner, Eugene J et al. (2015) An integrated map of structural variation in 2,504 human genomes. Nature 526:75-81
1000 Genomes Project Consortium; Auton, Adam; Brooks, Lisa D et al. (2015) A global reference for human genetic variation. Nature 526:68-74
Dayama, Gargi; Emery, Sarah B; Kidd, Jeffrey M et al. (2014) The genomic landscape of polymorphic human nuclear mitochondrial insertions. Nucleic Acids Res 42:12640-9
Brand, Harrison; Pillalamarri, Vamsee; Collins, Ryan L et al. (2014) Cryptic and complex chromosomal aberrations in early-onset neuropsychiatric disorders. Am J Hum Genet 95:454-61
Park, Hansoo; Kim, Dohoon; Kim, Chun-Hyung et al. (2014) Increased genomic integrity of an improved protein-based mouse induced pluripotent stem cell method compared with current viral-induced strategies. Stem Cells Transl Med 3:599-609