Title: Representing structural haplotypes and complex genetic variation in pan-genome graphs. PROJECT SUMMARY A pan-genome graph (PGG) reference must faithfully reflect structural haplotypes that differ in copy number, order, and orientation, which are currently poorly represented in a linear reference sequence. This effort focuses on the most copy variable and complex regions, including segmental duplications (SDs), inversions, short tandem repeats/variable number tandem repeats (copy-number-variable repeats, CNVRs) and combinations thereof that are frequently excluded or collapsed in reference genomes. The overarching goal of this project is to develop the tool infrastructure enabling the construction of whole-chromosome reference haplotypes that include all of these difficult classes of sequence. There are four specific aims. First, we will develop methods to construct PGGs from haplotype-phased de novo assemblies, ensuring the graph reflects both copy number variation and repeat structure, including CNVRs and SD. Second, we will develop software that will expand SD assembly methods to facilitate the curation of SD loci in PGGs. We will use SD assembly to detect variants specific to individual copies of a duplication, called paralog-specific variants (PSVs), and provide software to reconstruct local haplotype paths through the PGG that describe the different copies. Third, we will design novel methods to exploit single-cell template strand DNA sequencing data (Strand-seq) mapped to PGGs in order to thread chromosome-length structural haplotypes through the graph. Therefore, our software tool will allow the physical resolution of haplotypes comprising the full spectrum of structural variation, including inversions and inverted duplications. By virtue of the PSVs, the structural haplotypes will also embed sequence-resolved SDs. Fourth, we will develop a scalable open-source software framework to systematically assess how the inclusion of single-nucleotide variants, short indels, and structural variant classes in the PGG affects variant detection with short-read data. This will enable the optimization of the complexity encoded in the PGG for short-read variant detection. It will additionally provide a comprehensive view on polymorphic and fixed k-mers in human populations. We will develop tools to detect allele-specific k-mers and demonstrate how that enables the rapid genotyping of variants in the PGG based on k-mer composition of a short-read dataset. Once the framework for enhanced genome representation is established, we will focus on improving efficiency, scalability, and computational ease to cater to the needs of a broad range of users in genetics and genome science. This proposal will ensure that the most complex regions of the human genome are encoded into the PGG and that underlying genetic variation is ultimately assessed for association with disease. ?
Advancing reference genome representations to comprehensively reflect the complement of genetic sequences found in human is essential for mitigating current reference biases and for including a more complete set of variation in future disease studies. While this promises to be especially beneficial for analyzing difficult-to-characterize classes of variation, faithfully representing such variation in graphs has, paradoxically, received little attention yet and corresponding tools are lacking. In this project, we will develop tools to construct such graphs, to thread chromosome-length reference haplotypes through them, and to leverage them for rapid variant detection from short-read sequencing data, enabling their immediate application in large-scale disease studies.