Genome-wide studies have demonstrated that trans-acting factors, including transcription factors, chromatin regulators and other chromatin-associated factors, are frequently mutated in cancer, reaffirming that aberrant gene regulation is a key mechanism in oncogenesis. The way in which these trans-acting factors regulate transcription on a genome-wide basis is poorly understood, motiving ever increasing number of ChIP-seq and DNase-seq experiments to map genome-wide transcription factor binding (cistrome) and chromatin status (epigenome). Novel and significant biological insights have been gained through the analysis of ChIP-seq and DNase-seq data integrated with other published ChIP-seq and DNase-seq data sets as well as expression profiles. Most cancer biologists, however, find computational data analysis and integration of cistrome and epigenome data to be the major bottleneck of such studies due to the lack of informatics expertise and infrastructure. The objective of this proposal is to develop the informatics technologies to improve the acquisition, analysis, integration and reuse of ChIP-seq and DNase-seq data so as to allow experimental cancer biologists to model transcriptional and epigenetic gene regulation in cancer research. Specifically, we propose to develop informatics technologies to address three critical aspects of epigenome and cistrome data analysis. First, we will implement software to automate data collection, processing and quality control, enabling diverse types of unpublished and public ChIP-seq and DNase-seq data to be analyzed and converted into statistics and formats that can be readily used for integrative analysis. Second, we will develop systems to allow gene expression data to be interpreted with cistrome and epigenome data in order to elucidate regulatory mechanisms. Third, we will develop tools to quickly and accurately identify informative public datasets and to infer combinatorial rules of regulation and interactions. Finally, we will develop the infrastructure and interface to host the algorithms and tools developed in the first three aims, and provide the experimental cancer biologists with a flexible and intuitive user experience. We will design our software to interact easily with complementary software systems and databases. The software developed in this proposal will be freely available open-source, and we will work with our collaborators and users to improve its functions and user interface.

Public Health Relevance

Decades of research have shown that cancer is essentially a disease of aberrant gene regulation. Although there are powerful new genomic technologies to study gene regulation, the resulting high throughput data creates significant computational challenges for experimental cancer biologists. This project will develop comprehensive informatics technologies, including the algorithms, database, and computing infrastructure, to model gene regulation in mammalian systems. The technologies we propose will allow cancer biologists to conduct exploratory and integrated analyses, search and reuse other relevant public data, interpret results and generate hypotheses on the mechanism of gene regulation in different cancer systems without programming expertise or informatics resources. The research team has an excellent track record in both computational algorithm development and innovative cancer research, so the proposal is expected to generate a valuable resource to accelerate many cancer gene regulation studies.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1-SRLB-4 (O1))
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Dana-Farber Cancer Institute
United States
Zip Code