Structural variations (SVs) analysis is very important because they are a major source of genetic variations and account for a wide range of phenotypes in many species. To better understand their contribution to diversity, divergence, and a variety of phenotypic traits, we should address two critical issues for SV analysis: accurate SV characterization and understanding their formation mechanisms. Without accurate SV results, we may miss the SV events that account for the phenotypes. Without understanding their formation mechanisms, we may not distinguish the phenotype associated SVs from other SVs. As the sequencing technology evolves, many new sequencing platforms such as PacBio, Oxford Nanopore, and 10X Genomics with longer sequencing reads appeared and have demonstrated great potential. However, the computational algorithms for SV analysis are inadequate for organisms both with and without a reference genome and SV mechanism analysis was merely based on short (<10bp mostly) breakpoint junction sequences due to technical limitations. As more of such data is being generated, there is an urgent need to fill in the gap by developing more accurate and efficient algorithms for SV discovery and establishing an innovative way to investigate SV formation mechanisms. The long-term goal of the laboratory is to comprehensively characterize all forms of SVs and understand their functional consequences and formation mechanisms. The goals of the next three years are to develop efficient algorithms to SV analysis for organisms both with and without a reference. We will focus on large insertions, inversions, and complex SVs which are always underrepresented. For organisms with a reference, we will develop a de novo assembly evaluation method to optimize existing tools and/or develop new assembly methods. Given these toolkits, the goals for the following two years are to study the SV formation mechanisms based on global genomic architecture. Our central hypothesis is that there may be some hotspots, signatures around the SV locus either inherited from paternal or maternal genomes causing the rearrangement formation susceptibility. We will test the hypothesis based on investigating a global and haplotype picture of SVs using the new sequencing platforms. It is expected that the research will contribute a suite of robust methods on the long-read sequencing data to identify all forms of SVs with high sensitivity and precision. Besides, it is expected that this work will provide novel insights into SV formation mechanisms. The proposed work is innovative in that the proposed computational approach will greatly improve the sensitivity and precision for SV detection using long sequencing reads under the circumstances of both with and without a reference genome. Also, the outcomes of this work may vertically advance the SV mechanism research. The proposed research is significant because it will facilitate the discovery of pathogenic variations and the establishment of the association between genotype and phenotype. It may also popularize the usage of new sequencing platforms to address novel scientific questions.
The proposed research is relevant to public health because comprehensive characterization of structural variations (SVs) using the new sequencing data and establish SV mechanism analysis will enable the discovery of disease causing mutations. Ultimately, it is expected to target the SV mutations and interfere SV associated genes or genomic regions to improve the health for all of us. The development of advanced analytic methods for genomic sequencing data from new technologies to understanding the mechanisms of genetic variations including SV is relevant to the part of NIH?s mission that pertains to seeking fundamental knowledge that will enhance health and reduce illness.