Title: Genome analysis based on the integration of DNA sequence and shape PI: Rohs, Remo (USC);Co-I: Noble, William Stafford (UW);Co-I: Tullius, Thomas D. (BU) PROJECT SUMMARY Current techniques for genome analysis are mainly based on the one-dimensional DNA sequence, comprised of the letters A, C, G, and T. However, proteins recognize DNA as a three-dimensional (3D) object. Nuances in DNA shape at single nucleotide resolution play a crucial role in the binding specificity of transcription factors (TFs), including those involved in embryonic development and human cancer. This project involves the development of a battery of tools for genome analysis, through the integration of information derived from the DNA sequence and the 3D structure of DNA, or "DNA shape". The basis for these novel tools is a high- throughput (HT) method for the prediction of multiple features of local DNA shape at the genomic scale. Data will be made available to the community in the UCSC Genome Browser track format through a web server interface. These tools will enable users to analyze the shape of any number or length of DNA sequences, including whole genomes and the effect of DNA methylation. HT shape predictions will be validated based on X-ray crystallography, NMR spectroscopy, and hydroxyl radical cleavage data. Predictions will be combined with ORChID, an ENCODE project that infers DNA minor groove geometry from hydroxyl radical cleavage experiments. The HT method will be used to study how paralogous TFs select different target sites in vivo despite sharing core-binding motifs or having similar binding properties in vitro. To study this question, we will investigate the effect of flanking sequences on multiple structural features of TF binding sites (TFBSs). The initial focus of this study will be homeodomains and basic helix-loop-helix (bHLH) TFs. Other protein families will later be included and used to construct a comprehensive TFBS database that provides shape features for binding motifs derived from JASPAR and other motif databases. Structural effects of single nucleotide polymorphisms (SNPs) will also be analyzed. Some SNPs are associated with deleterious functions, whereas others have no apparent effect. The HT shape prediction method will be used to predict the function of SNPs in non-coding regions based on DNA shape. We will correlate quantitative effects of SNPs on DNA structure with expression quantitative trait loci (eQTLs) and genome-wide association study (GWAS) signals, to develop a predictive tool for the functional effect of SNPs. The HT shape prediction approach will be used to design DNA sequences with different AT/GC contents but similar shapes. The relative contributions of sequence and shape to binding will be tested with analytic models including multiple linear regression (MLR) and support vector regression (SVR). For systems in which the integration of sequence and shape proves advantageous, novel motif finding tools will be developed based on an extended alphabet that combines sequence with informative structural features, selected by machine learning and feature selection approaches. Sequence+shape motifs will be tested by motif scanning, compared to sequence-only motifs, and integrated into the MEME Suite. The goal of this sequence-shape integration is to increase the accuracy of finding in vivo TFBSs in the genome.

Public Health Relevance

Protein-DNA recognition is a critical yet poorly understood component of gene regulation. This proposal will connect the fields of DNA sequence and structure analysis, which so far have been developed in parallel but largely disconnected from each other. Integration of the one-dimensional DNA sequence at a genome-wide scale with the three-dimensional DNA structure at atomic resolution will lead to the development of novel genome analysis tools and will advance our understanding of genome function, leading to fundamentally new insights into the mechanisms of gene regulation and its impact on human disease.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
1R01GM106056-01A1
Application #
8632246
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Krasnewich, Donna M
Project Start
2014-02-01
Project End
2018-01-31
Budget Start
2014-02-01
Budget End
2015-01-31
Support Year
1
Fiscal Year
2014
Total Cost
$334,303
Indirect Cost
$108,370
Name
University of Southern California
Department
Biology
Type
Schools of Arts and Sciences
DUNS #
072933393
City
Los Angeles
State
CA
Country
United States
Zip Code
90089
Chiu, Tsu-Pei; Yang, Lin; Zhou, Tianyin et al. (2015) GBshape: a genome browser database for DNA shape annotations. Nucleic Acids Res 43:D103-9
Dantas Machado, Ana Carolina; Zhou, Tianyin; Rao, Satyanarayan et al. (2015) Evolving insights on how cytosine methylation affects protein-DNA binding. Brief Funct Genomics 14:61-73
Slattery, Matthew; Zhou, Tianyin; Yang, Lin et al. (2014) Absence of a simple code: how transcription factors read the genome. Trends Biochem Sci 39:381-99
Barozzi, Iros; Simonatto, Marta; Bonifacio, Silvia et al. (2014) Coregulation of transcription factor binding and nucleosome occupancy through DNA features of mammalian enhancers. Mol Cell 54:844-57