Title: Genome analysis based on the integration of DNA sequence and shape PI: Rohs, Remo (USC);Co-I: Noble, William Stafford (UW);Co-I: Tullius, Thomas D. (BU) PROJECT SUMMARY Current techniques for genome analysis are mainly based on the one-dimensional DNA sequence, comprised of the letters A, C, G, and T. However, proteins recognize DNA as a three-dimensional (3D) object. Nuances in DNA shape at single nucleotide resolution play a crucial role in the binding specificity of transcription factors (TFs), including those involved in embryonic development and human cancer. This project involves the development of a battery of tools for genome analysis, through the integration of information derived from the DNA sequence and the 3D structure of DNA, or """"""""DNA shape"""""""". The basis for these novel tools is a high- throughput (HT) method for the prediction of multiple features of local DNA shape at the genomic scale. Data will be made available to the community in the UCSC Genome Browser track format through a web server interface. These tools will enable users to analyze the shape of any number or length of DNA sequences, including whole genomes and the effect of DNA methylation. HT shape predictions will be validated based on X-ray crystallography, NMR spectroscopy, and hydroxyl radical cleavage data. Predictions will be combined with ORChID, an ENCODE project that infers DNA minor groove geometry from hydroxyl radical cleavage experiments. The HT method will be used to study how paralogous TFs select different target sites in vivo despite sharing core-binding motifs or having similar binding properties in vitro. To study this question, we will investigate the effect of flanking sequences on multiple structural features of TF binding sites (TFBSs). The initial focus of this study will be homeodomains and basic helix-loop-helix (bHLH) TFs. Other protein families will later be included and used to construct a comprehensive TFBS database that provides shape features for binding motifs derived from JASPAR and other motif databases. Structural effects of single nucleotide polymorphisms (SNPs) will also be analyzed. Some SNPs are associated with deleterious functions, whereas others have no apparent effect. The HT shape prediction method will be used to predict the function of SNPs in non-coding regions based on DNA shape. We will correlate quantitative effects of SNPs on DNA structure with expression quantitative trait loci (eQTLs) and genome-wide association study (GWAS) signals, to develop a predictive tool for the functional effect of SNPs. The HT shape prediction approach will be used to design DNA sequences with different AT/GC contents but similar shapes. The relative contributions of sequence and shape to binding will be tested with analytic models including multiple linear regression (MLR) and support vector regression (SVR). For systems in which the integration of sequence and shape proves advantageous, novel motif finding tools will be developed based on an extended alphabet that combines sequence with informative structural features, selected by machine learning and feature selection approaches. Sequence+shape motifs will be tested by motif scanning, compared to sequence-only motifs, and integrated into the MEME Suite. The goal of this sequence-shape integration is to increase the accuracy of finding in vivo TFBSs in the genome.

Public Health Relevance

Protein-DNA recognition is a critical yet poorly understood component of gene regulation. This proposal will connect the fields of DNA sequence and structure analysis, which so far have been developed in parallel but largely disconnected from each other. Integration of the one-dimensional DNA sequence at a genome-wide scale with the three-dimensional DNA structure at atomic resolution will lead to the development of novel genome analysis tools and will advance our understanding of genome function, leading to fundamentally new insights into the mechanisms of gene regulation and its impact on human disease.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Krasnewich, Donna M
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Southern California
Schools of Arts and Sciences
Los Angeles
United States
Zip Code
Li, Richard Y; Di Felice, Rosa; Rohs, Remo et al. (2018) Quantum annealing versus classical machine learning applied to a simplified computational biology problem. npj Quantum Inf 4:
Rao, Satyanarayan; Chiu, Tsu-Pei; Kribelbauer, Judith F et al. (2018) Systematic prediction of DNA shape changes due to CpG methylation explains epigenetic effects on protein-DNA binding. Epigenetics Chromatin 11:6
Xin, Beibei; Rohs, Remo (2018) Relationship between histone modifications and transcription factor binding is protein family specific. Genome Res :
Wang, Xiaofei; Zhou, Tianyin; Wunderlich, Zeba et al. (2018) Analysis of Genetic Variation Indicates DNA Shape Involvement in Purifying Selection. Mol Biol Evol 35:1958-1967
Azad, Robert N; Zafiropoulos, Dana; Ober, Douglas et al. (2018) Experimental maps of DNA structure at nucleotide resolution distinguish intrinsic from protein-induced DNA deformations. Nucleic Acids Res 46:2636-2647
Li, Jinsen; Sagendorf, Jared M; Chiu, Tsu-Pei et al. (2017) Expanding the repertoire of DNA shape features for genome-scale studies of transcription factor binding. Nucleic Acids Res 45:12877-12887
Ma, Wenxiu; Yang, Lin; Rohs, Remo et al. (2017) DNA sequence+shape kernel enables alignment-free modeling of transcription factor binding. Bioinformatics 33:3003-3010
Sagendorf, Jared M; Berman, Helen M; Rohs, Remo (2017) DNAproDB: an interactive tool for structural analysis of DNA-protein complexes. Nucleic Acids Res 45:W89-W97
Tangprasertchai, Narin S; Di Felice, Rosa; Zhang, Xiaojun et al. (2017) CRISPR-Cas9 Mediated DNA Unwinding Detected Using Site-Directed Spin Labeling. ACS Chem Biol 12:1489-1493
Li, Jun; Dantas Machado, Ana Carolina; Guo, Ming et al. (2017) Structure of the Forkhead Domain of FOXA2 Bound to a Complete DNA Consensus Site. Biochemistry 56:3745-3753

Showing the most recent 10 out of 32 publications