We propose to produce computationally predicted and experimentally improved single-base-pair resolution maps of genome regulatory elements and their higher-level architectures with ENCODE consortium data. To accomplish this goal, we will accomplish four Aims:
Aim 1 will discover genome regulatory elements at single base pair resolution by simultaneously modeling ChIP-seq data, DNase-seq data, and genome sequence to discover where regulators bind to the genome along with explanatory DNA sequence motifs;
Aim 2 will use integrative analysis to learn probabilistic models of enhancer grammars that include symbol spacing models;
Aim 3 will develop active learning methods to precisely design synthetic enhancer sequences to construct Enhancer Grammar Activity Models (EGAMs) that explain the consequences of different forms of enhancer grammar on gene regulation, and will also learn regulatory factors that are associated with unlinked motifs;
Aim 4 will discover regulatory networks that describe how chromatin and gene expression state is established based on regulator activity, and relate human disease associated genomic variation to potential disease mechanisms. The results of our Aims will be validated with both experimental and computational studies.

Public Health Relevance

We will develop and use new methods to understand the language of the genome - the words and sentences of symbols that describe how cells function both in health and disease. Because the language is complicated, we will use new experimental methods to write and test thousands of genomic sentences for function in a dish. Our ultimate goal is to improve human health by understanding how disease related changes in our genome cause things to go wrong.

National Institute of Health (NIH)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZHG1)
Program Officer
Gilchrist, Daniel A
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Massachusetts Institute of Technology
Organized Research Units
United States
Zip Code
Hrvatin, SiniĊĦa; Deng, Francis; O'Donnell, Charles W et al. (2014) MARIS: method for analyzing RNA following intracellular sorting. PLoS One 9:e89459
Hashimoto, Tatsunori B; Edwards, Matthew D; Gifford, David K (2014) Universal count correction for high-throughput sequencing. PLoS Comput Biol 10:e1003494
Sherwood, Richard I; Hashimoto, Tatsunori; O'Donnell, Charles W et al. (2014) Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nat Biotechnol 32:171-8
Mahony, Shaun; Edwards, Matthew D; Mazzoni, Esteban O et al. (2014) An integrated model of multiple-condition ChIP-Seq data reveals predeterminants of Cdx2 binding. PLoS Comput Biol 10:e1003501