Current techniques for genome analysis are mainly based on the one-dimensional DNA sequence, comprised of the letters A, C, G, and T. However, proteins recognize DNA as a three-dimensional (3D) object. Nuances in DNA shape at single nucleotide resolution play a crucial role in the binding specificity of transcription facors (TFs), including those involved in embryonic development and human cancer. This project involves the development of a battery of tools for genome analysis, through the integration of information derived from the DNA sequence and the 3D structure of DNA, or DNA shape. The basis for these novel tools is a high- throughput (HT) method for the prediction of multiple features of local DNA shape at the genomic scale. Data will be made available to the community in the UCSC Genome Browser track format through a web server interface. These tools will enable users to analyze the shape of any number or length of DNA sequences, including whole genomes and the effect of DNA methylation. HT shape predictions will be validated based on X-ray crystallography, NMR spectroscopy, and hydroxyl radical cleavage data. Predictions will be combined with ORChID, an ENCODE project that infers DNA minor groove geometry from hydroxyl radical cleavage experiments. The HT method will be used to study how paralogous TFs select different target sites in vivo despite sharing core-binding motifs or having similar binding properties in vitro. To study this question, we will investigate the effect of flanking sequences on multiple structural features of TF binding sites (TFBSs). The initial focus of this study will be homeodomains and basic helix-loop-helix (bHLH) TFs. Other protein families will later be included and used to construct a comprehensive TFBS database that provides shape features for binding motifs derived from JASPAR and other motif databases. Structural effects of single nucleotide polymorphisms (SNPs) will also be analyzed. Some SNPs are associated with deleterious functions, whereas others have no apparent effect. The HT shape prediction method will be used to predict the function of SNPs in non-coding regions based on DNA shape. We will correlate quantitative effects of SNPs on DNA structure with expression quantitative trait loci (eQTLs) and genome-wide association study (GWAS) signals, to develop a predictive tool for the functional effect of SNPs. The HT shape prediction approach will be used to design DNA sequences with different AT/GC contents but similar shapes. The relative contributions of sequence and shape to binding will be tested with analytic models including multiple linear regression (MLR) and support vector regression (SVR). For systems in which the integration of sequence and shape proves advantageous, novel motif finding tools will be developed based on an extended alphabet that combines sequence with informative structural features, selected by machine learning and feature selection approaches. Sequence+shape motifs will be tested by motif scanning, compared to sequence-only motifs, and integrated into the MEME Suite. The goal of this sequence-shape integration is to increase the accuracy of finding in vivo TFBSs in the genome.

Public Health Relevance

Protein-DNA recognition is a critical yet poorly understood component of gene regulation. This proposal will connect the fields of DNA sequence and structure analysis, which so far have been developed in parallel but largely disconnected from each other. Integration of the one-dimensional DNA sequence at a genome-wide scale with the three-dimensional DNA structure at atomic resolution will lead to the development of novel genome analysis tools and will advance our understanding of genome function, leading to fundamentally new insights into the mechanisms of gene regulation and its impact on human disease.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
5R01GM106056-03
Application #
8998963
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Krasnewich, Donna M
Project Start
2014-02-01
Project End
2018-01-31
Budget Start
2016-02-01
Budget End
2017-01-31
Support Year
3
Fiscal Year
2016
Total Cost
Indirect Cost
Name
University of Southern California
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
072933393
City
Los Angeles
State
CA
Country
United States
Zip Code
90032
Li, Richard Y; Di Felice, Rosa; Rohs, Remo et al. (2018) Quantum annealing versus classical machine learning applied to a simplified computational biology problem. npj Quantum Inf 4:
Rao, Satyanarayan; Chiu, Tsu-Pei; Kribelbauer, Judith F et al. (2018) Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic effects on protein-DNA binding. Epigenetics Chromatin 11:6
Xin, Beibei; Rohs, Remo (2018) Relationship between histone modifications and transcription factor binding is protein family specific. Genome Res :
Wang, Xiaofei; Zhou, Tianyin; Wunderlich, Zeba et al. (2018) Analysis of Genetic Variation Indicates DNA Shape Involvement in Purifying Selection. Mol Biol Evol 35:1958-1967
Azad, Robert N; Zafiropoulos, Dana; Ober, Douglas et al. (2018) Experimental maps of DNA structure at nucleotide resolution distinguish intrinsic from protein-induced DNA deformations. Nucleic Acids Res 46:2636-2647
Chiu, Tsu-Pei; Rao, Satyanarayan; Mann, Richard S et al. (2017) Genome-wide prediction of minor-groove electrostatic potential enables biophysical modeling of protein-DNA binding. Nucleic Acids Res 45:12565-12576
Yang, Lin; Orenstein, Yaron; Jolma, Arttu et al. (2017) Transcription factor family-specific DNA shape readout revealed by quantitative specificity models. Mol Syst Biol 13:910
Li, Jinsen; Sagendorf, Jared M; Chiu, Tsu-Pei et al. (2017) Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. Nucleic Acids Res 45:12877-12887
Ma, Wenxiu; Yang, Lin; Rohs, Remo et al. (2017) DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Bioinformatics 33:3003-3010
Sagendorf, Jared M; Berman, Helen M; Rohs, Remo (2017) DNAproDB: an interactive tool for structural analysis of DNA-protein complexes. Nucleic Acids Res 45:W89-W97

Showing the most recent 10 out of 32 publications