Accurate predictions of transcription factor (TF)-DNA interactions across the human genome are critical for deciphering transcriptional regulatory networks in healthy and diseased cells, as well as for understanding the phenotypic effects of polymorphisms in non-coding genomic regions. However, the most widely used model of TF-DNA binding affinity, the position weight matrix (PWM), is known to provide only an approximation of the true sequence specificity of TFs, because it assumes independence among the base pairs in TF binding sites. More complex binding models have been proposed, but their improvement over PWMs was marginal, either because of limitations of the training data (i.e. due to strong biases, noise, artifacts, or confounding factors) or because the models were not flexible enough to capture complex dependencies in TF binding sites. As a result, current DNA binding models have a limited ability to predict the effects of non-coding genetic variation on TF binding, and they cannot be used to resolve functional differences between closely related TFs with similar DNA binding domains but distinct regulatory roles in the cell. The objective of this application is to overcome these limitations by generating high quality data that will be used to train flexible statistical models to generate TF-DNA binding affinity predictions with accuracies similar to experimental in vitro assays. The central hypothesis, based on preliminary results and previous work, is that both better affinity data and better statistical models are needed in order to predict TF-DNA interactions in human cells with significantly higher accuracy than current models. High quality binding affinity data for 40 human TFs will be generated in Aim 1 using a unique combination of in vitro assays carefully designed to minimizes bias and noise, thus making the data ideal for training complex models. Novel TF-DNA binding models will be developed in Aim 2 using state- of-the-art statistical methods: support vector regression, nonparametric Bayes modeling, and conditional tensor factorization. The models will be tested experimentally in vitro, and by leveraging in vivo data from the ENCODE project.
In Aim 3, the new binding models will be used in two applications: 1) to predict the quantitative effects of non-coding single nucleotide polymorphisms on TF binding affinities and TF binding levels, and 2) to predict differential in vivo DNA binding of closely related TFs with similar DNA binding domains but distinct regulatory functions in the cell. Such applications are not possible using current models. Overall, we anticipate that the binding affinity models developed in this project will allow for much more accurate predictions of regulatory TF-DNA interactions than possible using current models, which is significant because it will lead to a better understanding of gene regulatory programs and their misregulation during disease, including understanding the cascade of events that link genetic variation to human disease.

Public Health Relevance

The computational models and the experimental approaches developed in this project will lead to a better understanding of how transcription factors recognize their specific DNA sites across the genome, and how these interactions are disrupted by mutations or polymorphisms in the binding sites. Given that many genetic variations associated with complex human diseases are located in non-coding regions of the genome, our models can be used to accurately predict the effect of such variations on transcription factor-DNA binding, and to prioritize them for further investigation into the genetic causes of human diseases.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM117106-04
Application #
9546780
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
2015-09-25
Project End
2020-08-31
Budget Start
2018-09-01
Budget End
2019-08-31
Support Year
4
Fiscal Year
2018
Total Cost
Indirect Cost
Name
Duke University
Department
Biostatistics & Other Math Sci
Type
Schools of Medicine
DUNS #
044387793
City
Durham
State
NC
Country
United States
Zip Code
27705
Shen, Ning; Zhao, Jingkang; Schipper, Joshua L et al. (2018) Divergence in DNA Specificity among Paralogous Transcription Factors Contributes to Their Differential In Vivo Binding. Cell Syst 6:470-483.e8
Afek, A; Tagliafierro, L; Glenn, O C et al. (2018) Toward deciphering the mechanistic role of variations in the Rep1 repeat site in the transcription regulation of SNCA gene. Neurogenetics 19:135-144
Shats, Igor; Deng, Michael; Davidovich, Adam et al. (2017) Expression level is a key determinant of E2F1-mediated cell fate. Cell Death Differ 24:626-637
Zhao, Jingkang; Li, Dongshunyi; Seo, Jungkyun et al. (2017) Quantifying the Impact of Non-coding Variants on Transcription Factor-DNA Binding. Res Comput Mol Biol 10229:336-352
Frank, Christopher L; Manandhar, Dinesh; Gordân, Raluca et al. (2016) HDAC inhibitors cause site-specific chromatin remodeling at PU.1-bound enhancers in K562 cells. Epigenetics Chromatin 9:15