Genome-wide association studies (GWAS) and whole genome sequencing of complex diseases have revealed a plethora of disease risk variants, most of which lie in noncoding regions of DNA without easily interpretable function. A main functional mechanism of noncoding variants is to alter chromatin accessibility to transcription factors (TFs), thereby influencing gene expression. Predicting the effects of noncoding variants on TF binding and gene expression on a large scale is thus important but remains challenging. Available computational tools for predicting regulatory variants largely rely on TF-binding motif models and/or local chromatin modification features. Here, we aim to develop a novel computational framework to address two major limitations of these methods. First, given that known disease causal noncoding variants often reside outside of TF binding motifs, how can we improve the prediction of TF binding variants outside of motifs? For this, we plan to integrate TF ChIP-seq data with features that are important for TF binding but have not been considered in previous methods, in particular the DNA breathing dynamics (AIM1). DNA breathing reflects local transient opening of the DNA double helix due to thermal fluctuations. We have shown that genetic variants can affect nearby (up to a few hundred base pairs) DNA breathing dynamics that affect TF binding. Using TF ChIP-seq data, we will train models that predict specific TF binding variants in or outside TF motifs, incorporating DNA breathing dynamics with other features such as DNA shapes and cooperative TF binding. Secondly, given that chromatin features only show modest (<2-fold) enrichment of genetic variants associated with complex diseases or traits, how can we improve the prediction of regulatory variants? For this, we will build a computation model, considering the allele-specific chromatin accessibility (ASCA; i.e., two alleles of a heterozygous individual show read imbalance in chromatin accessibility assays) as a functional readout of a regulatory variant (AIM2). We have shown that neuronal ASCA SNPs are highly enriched for those implicated by schizophrenia (SZ) GWAS. Using neuronal ASCA data, we will train models that predict variants with regulatory effects, taking advantage of our TF-specific classifiers (from AIM1). As a proof of concept, the models will be applied to a large SZ GWAS dataset to predict putative causal regulatory variants. We will validate the effects of the predicted top-ranking regulatory SZ variants on gene expression in a well-powered hiPSC sample by combining multiplex CRISPR-based SNP editing and single-cell RNA-seq analysis (AIM3). For SNPs showing the strongest regulatory effects, we will further use CRISPR editing to verify the SNP effect on gene expression and disease-relevant neuronal phenotypes. Accurately predicting TF-affecting noncoding variants will enable better understanding of the large number of noncoding variants implicated in complex disorders and help formulate testable biological hypotheses, ultimately facilitating the development of targeted therapeutics.
We will develop novel computational methods and a cost-effective functional validation approach to systematically infer the effect of disease-associated noncoding variants on transcription factor binding and gene expression. Identifying the functional noncoding variants that are associated with disease risk will help illuminate causal molecular mechanisms, facilitating the clinical translation of genetic findings into disease risk prediction and treatment.