Understanding how genes are turned on and off, and how their precise levels of expression are regulated, is critical to describing the connection between genetic variations and human health. On-going community-wide efforts promise to catalog vast amounts of information (data) about genomic states in a variety of conditions, including specific disease states. Such catalogs are expected to help identify key regulators of condition-specific gene expression. However, the ultimate dream of `reading' the DNA sequence and accurately predicting expression levels in any given cell is likely to remain elusive. We propose to develop advanced computational tools that will help biologists and genome scientists realize this final goal of predicting gene expression levels from sequence. The first and main goal of this proposal is to build a software system that will help a biologist model how gene expression relates to regulatory sequences. Here, `model' refers to describing the relationship between sequence and expression in a quantitative language, with a very high level of accuracy. The proposed software system, to be called `GEM' (Gene Expression Modeling), will consolidate our efforts in this direction for the last five years, and also incorpoate novel biochemical aspects to the model. In a departure from the norm in this field, the proposed software will present to the biologist all models consistent with the collected data, and not just the single most agreeable model. In other words, the scientist will get to see all possible interpretations of their data in terms of gene regulatory interactions in the cell.
The second aim i s devoted to presenting the model to the scientist in easily interpretable formats, including a variety of visual representations. The goal here is to connect the typically quantitative and abstract form of the above-mentioned models to the more tangible notions the biologist has about gene regulation mechanisms.
The third aim of this proposal is to help the biologist improve the models created in Aim 1, either by hypothesizing the existence of hitherto unknown regulators of the gene, or by generating additional data. The software system will use rigorous statistical methods and objective criteria to help the biologist decide which experiments should be most productive in advancing their understanding of the gene regulatory system.
All specific aims will be evaluated on four important regulatory systems from insects and mammals.

Public Health Relevance

Understanding how genes are turned on and off, and how their precise levels of expression are regulated, is critical to describing the connection between genetic variations and human health. We propose to build a software system to help a biologist build quantitative descriptions of the molecular interactions that control gene expression. Such descriptions will help us realize the goal of predicting the impact of DNA mutations on gene expression levels, and consequently on an individual's predisposition or response to disease conditions.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Ravichandran, Veerasamy
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Illinois Urbana-Champaign
Biostatistics & Other Math Sci
Biomed Engr/Col Engr/Engr Sta
United States
Zip Code
Khoueiry, Pierre; Girardot, Charles; Ciglar, Lucia et al. (2017) Uncoupling evolutionary changes in DNA sequence, transcription factor occupancy and enhancer activity. Elife 6:
Samee, Md Abul Hassan; Lydiard-Martin, Tara; Biette, Kelly M et al. (2017) Quantitative Measurement and Thermodynamic Modeling of Fused Enhancers Support a Two-Tiered Mechanism for Interpreting Regulatory DNA. Cell Rep 21:236-245
Yang, Wei; Sinha, Saurabh (2017) A novel method for predicting activity of cis-regulatory modules, based on a diverse training set. Bioinformatics 33:1-7
Peng, Pei-Chen; Sinha, Saurabh (2016) Quantitative modeling of gene expression using DNA shape features of binding sites. Nucleic Acids Res 44:e120
Samee, Md Abul Hassan; Lim, Bomyi; Samper, NĂºria et al. (2015) A Systematic Ensemble Approach to Thermodynamic Modeling of Gene Expression from Sequence Data. Cell Syst 1:396-407