The whole genome sequencing of large cohorts of individuals is quickly becoming a common tool for researchers to investigate the genetic basis of many disease phenotypes. The primary goals are to discover the underlying genetic variation that cause or contribute to these diseases as well as to correctly identify these variants in a diagnostic setting. These differences typicall consist of single base changes (SNPs), but can also encompass larger, more complex chromosomal rearrangements in the form of structural variation (SV) which are much more difficult to detect even with modern sequencing technologies. A number of approaches have been published that have studied this problem, but even the largest scale endeavors have only focused on deletion events and reported a sensitivity of <70%. Complex chromosomal rearrangements are even less well studied. Thus, it is paramount that accurate methods are developed which can detect all types of SVs at high specificity from sequence data. This proposal aims to improve the overall ability of researchers to identify and analyze genetic variation from whole genome sequences. An important, and often overlooked, aspect of SV discovery is the fact that typical paired-end, read depth, and split read approaches will identify different sets of non-overlapping variants at varying degrees of accuracy.
In Aim 1, we will develop a unified SV discovery algorithm that can incorporate all of these different sources of information in a probabilistic fashion. Such a method would be useful for research, in particular with the identification of rare variants, as well as clinical applications which require a great del of accuracy and have thus far been limited to older karyotyping and microarray approaches. This would identify the majority of structural variants, however there are many regions in genomic sequences which are complex in nature, defined as consisting of multiple neighboring or overlapping chromosomal rearrangements that are challenging to resolve with typical SV detection approaches.
In Aim 2, we propose methods to resolve these complex regions and assess their frequency and impact. Furthermore, a crucial step in medical genetics is the comparison of identified genetic mutations to databases of known pathogenic and benign variants. This is currently problematic with SVs, as they have often been originally reported with varying degrees of breakpoint resolution that can hamper the correct assignment of the variant. This issue is compounded further in more complex regions with multiple breakpoints, for which simplistic comparison methods do not work well.
In Aim 3, we will develop and implement a system that describes and utilizes variant profiles to identify whether an individual's sequence data contains a variant of interest. Overall, this project will advance our understanding of the human genome as well as provide tools for use in the general research and clinical communities.
The rearrangement of chromosomal material in the form of structural variation is directly responsible for many disease phenotypes, however our ability to detect and resolve these events from whole genome sequence data is currently limited. We propose a number of strategies for improving the detection and analysis of structural genomic variation between individuals and resolving their underlying structure and function. These approaches will have direct application to the clinical diagnosis of such events and the future of personalized genomics.