The regulation of transcription is central to the proper functioning of all cells. Identifying the NA binding sites for all transcription factors (TFs) would greatly facilitate our understanding of regulatory networks and variations in gene expression, both normal and in disease states, that accompany genetic differences. New high-throughput technologies are generating data about the DNA binding specificity of transcription factors at a greatly increased rate, but good computational methods are required to maximize the biological information extracted from those data. In the previous funding period we developed new, and improved, methods for the analysis of three different types of high- throughput specificity data. In this proposal we will expand on those methods in several ways, including methods for analyzing additional types of data and the development of more complex models that are required for the adequate representation of the specificity of some factors. More complex models are needed for TFs whose specificity is not well represented by position weight matrices (PWMs) which impose the constraint that the positions within the binding site contribute independently to the binding. We will develop models for TFs that allow for higher-order interactions as well as for TFs that can bind in alternative modes and require multiple, independent models to represent them. The improved models will be compared to in vivo location analysis for TFs to better assess which binding sites are indirect or require cooperative binding with other factors. We also take advantage of greatly increased data to develop improved recognition models that can predict the specificity of TFs based on the protein sequence and aid in the design of new factors with novel specificity. This will be done initially for homeodomain and zinc finger proteins, the two largest families of TFs in eukaryotic genomes and the ones with the most available specificity information. We will also take advantage of the vast information available for bacterial genomes to develop specificity models for various bacterial TF families. A new experimental method will be employed to more comprehensively assess the non-independent interactions between protein residues and binding site base-pairs, which should lead to further improvements in recognition modeling. We continue collaborating with experimental biologists, which helps them use our programs and further their research goals, and helps us identify the limitations of the current methods and fosters improvements. We also have a new collaboration that seeks to improve upon methods for predicting specificity in protein-DNA interactions based on molecular modeling, combining their expertise in thermodynamic and structural modeling with our extensive models of TF binding specificity.

Public Health Relevance

Transcription factors control the expression of genes and are essential to the proper functioning of cells. Identifying the DNA sequences that they bind to can lead to a better understanding of the normal regulatory network and how it can be altered in genetic variation and disease. Recent technological advances have greatly increased the data about transcription factor binding sites, but good computer programs are required to maximize the biological information obtained from those experiments. We are developing improved computational methods to extract the most important information from high-throughput experiments with the goal of enhancing our understanding and modeling of normal control of gene expression and its variation. We are also using that information to help in the design of novel transcription factors with desired characteristics.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Good, Peter J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Washington University
Schools of Medicine
Saint Louis
United States
Zip Code
Zuo, Zheng; Stormo, Gary D (2014) High-resolution specificity from DNA sequencing highlights alternative modes of Lac repressor binding. Genetics 198:1329-43
Gupta, Ankit; Christensen, Ryan G; Bell, Heather A et al. (2014) An improved predictive recognition model for Cys(2)-His(2) zinc finger proteins. Nucleic Acids Res 42:4800-12
Visvikis, Orane; Ihuegbu, Nnamdi; Labed, Sid A et al. (2014) Innate host defense requires TFEB-mediated transcription of cytoprotective and antimicrobial genes. Immunity 40:896-909
Patel, Ronak Y; Stormo, Gary D (2014) Discriminative motif optimization based on perceptron training. Bioinformatics 30:941-8
Zhu, Cong; Gupta, Ankit; Hall, Victoria L et al. (2013) Using defined finger-finger interfaces as units of assembly for constructing zinc-finger nucleases. Nucleic Acids Res 41:2455-65
Lin, Huawen; Miller, Michelle L; Granas, David M et al. (2013) Whole genome sequencing identifies a deletion in protein phosphatase 2A that affects its stability and localization in Chlamydomonas reinhardtii. PLoS Genet 9:e1003841
Enuameh, Metewo Selase; Asriyan, Yuna; Richards, Adam et al. (2013) Global analysis of Drosophila Cys?-His? zinc finger proteins reveals a multitude of novel recognition motifs and binding determinants. Genome Res 23:928-40
Spivak, Aaron T; Stormo, Gary D (2012) ScerTF: a comprehensive database of benchmarked position weight matrices for Saccharomyces species. Nucleic Acids Res 40:D162-8
Ihuegbu, Nnamdi E; Stormo, Gary D; Buhler, Jeremy (2012) Fast, sensitive discovery of conserved genome-wide motifs. J Comput Biol 19:139-47
Gupta, Ankit; Christensen, Ryan G; Rayla, Amy L et al. (2012) An optimized two-finger archive for ZFN-mediated gene targeting. Nat Methods 9:588-90

Showing the most recent 10 out of 86 publications