Genome-wide association studies (GWAS) have identified thousands of genetic loci associated with complex traits, but determining the causal variants, target genes, and biological mechanisms responsible for each signal has proven challenging. Furthermore, standard GWAS based on single nucleotide polymorphisms (SNPs) have been limited by failure to explain the majority of heritability for most traits studied and an inability to capture multi-allelic variants such as copy number variants (CNVs) and repeats not tagged by SNPs. We focus on the role of genetic variation at repetitive regions of the genome. Specifically, we consider two repeat types: short tandem repeats (STRs), consisting of repeated motifs of 1-6bp; and variable number tandem repeats (VNTRs), with motifs of 7+bp. We collectively refer to STRs and VNTRs as tandem repeats (TRs). TRs encompass approximately 2 million loci comprising over 3% of the genome. They exhibit rapid mutation rates and are one of the largest sources of genetic variation. Growing evidence suggests that TRs are likely to account for part of the ?missing heritability? of GWAS. However, due to bioinformatic and experimental challenges in studying repeats, the genome-wide role of TRs in human traits remains mostly unexplored. We hypothesize that TR variants are key drivers of complex traits. We recently identified thousands of STRs predicted to causally regulate gene expression (termed expression STRs, or eSTRs) and revealed that eSTRs potentially act through a variety of mechanisms including modulating nucleosome positioning and DNA or RNA secondary structure. We additionally identified specific eSTRs likely underlying published GWAS signals for height and schizophrenia. Furthermore, other groups have recently discovered TRs as causal drivers of complex traits including malaria resistance, cancer risk, and bipolar disorder. While these findings offer intriguing evidence that thousands of TRs contribute to human phenotypes, they have several limitations. These include: the range of TRs that can be accurately genotyped from next- generation sequencing (NGS); a lack of sufficiently large NGS datasets for most traits for performing association analyses; and limited understanding of the potential mechanisms by which TRs participate in gene regulation. Here, we leverage (i) our recently developed TR genotyping tools and (ii) our published haplotype panel allowing imputation of TRs into available SNP-array datasets, to systematically evaluate the contribution of TRs to gene regulation and complex traits in humans. We will first generate a comprehensive catalog of TRs associated with gene regulation (Aim 1) and establish a framework for validating TR effects using massively parallel reporter assays and genome editing (Aim 2). We will then impute more than 2 million TRs into large existing GWAS datasets and perform fine-mapping to identify TRs associated with a range of complex traits and deeply characterize several TRs predicted to be causal drivers of GWAS signals (Aim 3). This project will fill an important gap in our knowledge of the genetic architecture of complex traits.
Genome-wide association studies (GWAS) have identified thousands of genetic loci associated with human traits, but have largely ignored the contribution of complex genetic variants such as repeats. This project aims to develop a framework for incorporating analysis of repetitive regions of the genome into existing GWAS datasets and to apply these tools to characterize the role of repeats in a variety of human phenotypes.