A major result from recent genome wide association studies (GWAS) is that the majority of genetic variants driving common human diseases lie in regulatory, rather than protein-coding, regions. Massive efforts to map epigenomic features such as localization of histone modifications (HMs) and transcription factors (TFs) have paved the way toward understanding the regulatory genome. However, dissecting the impact of an individual non-coding variant remains an unsolved challenge. A variety of computational methods have been proposed, such as quantitative trait loci (QTL) studies and machine learning techniques. However, these methods still do not provide conclusive information about causality of any specific non-coding mutation and lack gold-standard experimental results for evaluation. Several techniques are used to experimentally test the impact of individual regulatory variants. For example, massively parallel reporter assays (MPRA) synthesize thousands of oligonucleotides encoding mutated versions of putative regulatory elements placed in plasmids upstream of reporter genes. However, a major limitation is that tested sequences are outside of their endogenous chromosomal locus, and hence do not necessarily provide physiological relevance. CRISPR enables targeted editing of genomic DNA. Indeed, CRISPR is widely used, but studies of individual point mutations have been primarily on a small scale and are usually limited to a handful of variants or to a single gene. The major throughput challenge in studying a specific variant using genome editing is in tying genotype to phenotype. Introducing individual mutations exhibits low efficiency, and thus there is a need for enrichment of the genotype or phenotype of interest prior to assessing the impact of a mutation on a phenotype, such as gene expression. Current enrichment methods either disrupt the physiological context or are low throughput. Recent efforts overcame these challenges using pooled editing to analyze thousands of mutations simultaneously, but were limited to variants in protein coding regions. This proposal aims to develop a novel technique merging multiplexed genome editing of putative regulatory variants followed by chromatin immunoprecipitation sequencing (ChIP-seq) to simultaneously measure the impact of hundreds of non-coding variants on regulatory potential in their native genomic context. The key insight of the proposed approach is that mutations impacting epigenomic features can be measured both in genomic DNA and in phenotypic readouts such as ChIP-seq of TFs or HMs, avoiding the need for a selection step to connect genotypes with phenotypes. ?Aim 1 develops the pooled editing technique on a pilot set of previously validated regulatory variants. ?Aim 2 scales this approach to interrogate thousands of mutations at once. ?Aim 3 integrates experimental predictions with state of the art machine learning methods to evaluate and optimize computational methods for regulatory variant effect prediction.
Recent studies have demonstrated that the majority of genetic changes in the population contributing to common human diseases, such as schizophrenia, heart disease, and diabetes, lie in regions of the genome that do not code for proteins, but rather regulate the expression of genes. Despite massive efforts to map regulatory regions across dozens of human cell types, it is still difficult to predict the effect of an individual non-coding mutation. This project develops a high-throughput genome editing technique to simultaneously measure the impact of hundreds of non-coding mutations on regulatory potential in their native genomic context, with the ultimate goal of interpreting genetic changes leading to human disease.