The regulation of transcription is central to the proper functioning of all cells. Identifying the NA binding sites for all transcription factors (TFs) would greatly facilitate our understanding of regulatory networks and variations in gene expression, both normal and in disease states, that accompany genetic differences. New high-throughput technologies are generating data about the DNA binding specificity of transcription factors at a greatly increased rate, but good computational methods are required to maximize the biological information extracted from those data. In the previous funding period we developed new, and improved, methods for the analysis of three different types of high- throughput specificity data. In this proposal we will expand on those methods in several ways, including methods for analyzing additional types of data and the development of more complex models that are required for the adequate representation of the specificity of some factors. More complex models are needed for TFs whose specificity is not well represented by position weight matrices (PWMs) which impose the constraint that the positions within the binding site contribute independently to the binding. We will develop models for TFs that allow for higher-order interactions as well as for TFs that can bind in alternative modes and require multiple, independent models to represent them. The improved models will be compared to in vivo location analysis for TFs to better assess which binding sites are indirect or require cooperative binding with other factors. We also take advantage of greatly increased data to develop improved recognition models that can predict the specificity of TFs based on the protein sequence and aid in the design of new factors with novel specificity. This will be done initially for homeodomain and zinc finger proteins, the two largest families of TFs in eukaryotic genomes and the ones with the most available specificity information. We will also take advantage of the vast information available for bacterial genomes to develop specificity models for various bacterial TF families. A new experimental method will be employed to more comprehensively assess the non-independent interactions between protein residues and binding site base-pairs, which should lead to further improvements in recognition modeling. We continue collaborating with experimental biologists, which helps them use our programs and further their research goals, and helps us identify the limitations of the current methods and fosters improvements. We also have a new collaboration that seeks to improve upon methods for predicting specificity in protein-DNA interactions based on molecular modeling, combining their expertise in thermodynamic and structural modeling with our extensive models of TF binding specificity.

Public Health Relevance

Transcription factors control the expression of genes and are essential to the proper functioning of cells. Identifying the DNA sequences that they bind to can lead to a better understanding of the normal regulatory network and how it can be altered in genetic variation and disease. Recent technological advances have greatly increased the data about transcription factor binding sites, but good computer programs are required to maximize the biological information obtained from those experiments. We are developing improved computational methods to extract the most important information from high-throughput experiments with the goal of enhancing our understanding and modeling of normal control of gene expression and its variation. We are also using that information to help in the design of novel transcription factors with desired characteristics.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-GGG-M (50))
Program Officer
Pazin, Michael J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Washington University
Schools of Medicine
Saint Louis
United States
Zip Code
Ruan, Shuxiang; Stormo, Gary D (2018) Comparison of discriminative motif optimization using matrix and DNA shape-based models. BMC Bioinformatics 19:86
Chang, Yiming K; Zuo, Zheng; Stormo, Gary D (2018) Quantitative profiling of BATF family proteins/JUNB/IRF hetero-trimers using Spec-seq. BMC Mol Biol 19:5
Hu, Caizhen; Malik, Vikas; Chang, Yiming Kenny et al. (2017) Coop-Seq Analysis Demonstrates that Sox2 Evokes Latent Specificities in the DNA Recognition by Pax6. J Mol Biol 429:3626-3634
Roy, Basab; Zuo, Zheng; Stormo, Gary D (2017) Quantitative specificity of STAT1 and several variants. Nucleic Acids Res 45:8199-8207
Xiao, Shu; Lu, Jia; Sridhar, Bharat et al. (2017) SMARCAD1 Contributes to the Regulation of Naive Pluripotency by Interacting with Histone Citrullination. Cell Rep 18:3117-3128
Zuo, Zheng; Roy, Basab; Chang, Yiming Kenny et al. (2017) Measuring quantitative effects of methylation on transcription factor-DNA binding affinity. Sci Adv 3:eaao1799
Ruan, Shuxiang; Stormo, Gary D (2017) Inherent limitations of probabilistic models for protein-DNA binding specificity. PLoS Comput Biol 13:e1005638
Ruan, Shuxiang; Swamidass, S Joshua; Stormo, Gary D (2017) BEESEM: estimation of binding energy models using HT-SELEX data. Bioinformatics 33:2288-2295
Chang, Yiming K; Srivastava, Yogesh; Hu, Caizhen et al. (2017) Quantitative profiling of selective Sox/POU pairing on hundreds of sequences in parallel by Coop-seq. Nucleic Acids Res 45:832-845
Stormo, Gary D; Roy, Basab (2016) DNA Structure Helps Predict Protein Binding. Cell Syst 3:216-218

Showing the most recent 10 out of 109 publications