Genome wide association studies (GWAS) have revealed thousands of loci associated with hundreds of complex human traits and diseases, but the underlying biological mechanisms of most of these associations remain unclear. The majority of associated variants are in noncoding regions and presumed to influence the trait through regulation of gene expression. Regulatory variants are often close to their target genes, but they may also be located up to hundreds of kilobases away in distal regulatory elements. Long-range haplotype phasing is important to study the effects of distal regulatory variants on genes and their subsequent influence on traits. Haplotype information is most commonly obtained through family data or the statistical phasing of genotypes from arrays or short-read whole genome sequencing. However, new long-read sequencing technologies determine phase directly by sequencing DNA fragments of 10 kilobases or longer. Statistical phasing methods can be applied to variants called from long reads to infer even longer haplotypes, but existing methods for phasing variants from long reads do not take advantage of information available in large external reference panels, which can improve phasing for modestly sized samples. Long-read sequencing also improves detection of structural variants including copy number variants (CNVs), which have been implicated in numerous diseases. However, the functional consequences of CNVs have been understudied compared to single nucleotide variants (SNVs) due to their absence from SNV genotyping arrays and the challenges of calling CNVs from short-read sequence data. Long-read sequencing therefore enables a more comprehensive study of the effects of CNVs on gene expression and individual-level traits. The goal of this project is to develop statistical methods for the analysis of long-read sequence data.
In Specific Aim 1, we will extend existing methods for the statistical phasing of variants from genotype arrays or short reads to obtain long-range phasing of variants from long reads.
In Specific Aim 2, we will develop a framework that integrates phased genetic data with molecular profiles including gene expression and chromatin accessibility to study the regulatory effects of CNVs. We expect this research project to lead to improved methods for the analysis of long-read sequence data that can be used by the wider genetics community.
Many genetic variants influence human traits and diseases by regulating gene expression levels, but sometimes these variants are located far away from the genes that they influence. This can make it difficult for researchers to understand how the variant is influencing the trait or disease. The proposed project aims to provide researchers with methods to sequencing data generated using cutting edge technology to study how variants, including larger structural variants, interact with one another over long distances to influence gene expression and human health.