How trans-acting factors regulate genome-wide gene expression in cancer is poorly understood, motivating an increasing number of ChIP-seq, DNase-seq, and ATAC-seq (simplified as ?cistrome?) experiments to map genome-wide transcription factor binding sites and chromatin status. Significant biological insights have been gained through the computational analysis of cistrome data, especially when integrated with other published cistrome and gene expression data sets. However, most cancer biologists find computational data analysis and integration of cistrome and epigenome data to be the single most limiting bottleneck in their cancer gene regulation studies due to the lack of informatics expertise and computational infrastructure relative to the extraordinary volume of publicly available data. We have previously developed Cistrome Analysis Pipeline (AP) and Cistrome Data Browser (DB) to overcome this challenge. The objective of this proposal is to expand the functionality of Cistrome AP and DB to improve the collection, management, analysis, integration, visualization, and dissemination of cistrome and related data types. A flexible and intuitive user experience will empower experimental cancer biologists to create more insightful models of transcriptional and epigenetic gene regulation in cancer research. Specifically, we propose to improve and extend our existing Cistrome Analysis Pipeline ( and Cistrome Data Browser ( infrastructure and interface by developing informatics technologies that address four critical aspects of cistrome data analysis. First, we will design, develop, and deploy software through a user-friendly interface to improve automated data collection, processing, and annotation. This will enable unpublished and public cistrome data to be jointly analyzed and converted into formats and statistical expressions that can be used for integrative analysis. Second, we will develop methods to use all available cistrome data to impute TF binding/cell-type combinations that are not represented in public repositories. Third, we will develop systems to allow gene expression data to be integrated with cistrome data to elucidate regulatory mechanisms. Fourth, we will develop interactive web based tools for the visualization of hundreds of cistrome samples at multiple resolutions. Finally, we will engage in outreach activities to improve improve Cistrome functions, user interface, interoperability with other tools, and promote the Cistrome data use in cancer research.

Public Health Relevance

Cancer is essentially a disease of aberrant gene regulation and the powerful new genomic technologies for studying gene regulation have produced large volumes of data that impose difficult computational challenges on experimental cancer biologists. We propose to develop comprehensive open-source informatics technologies to model cancer gene regulation. These technologies will allow cancer biologists to conduct exploratory and integrated analyses, to search and reuse other relevant public data, and to interpret results and generate hypotheses on the mechanism of gene regulation in different cancer systems without programming expertise or informatics resources.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Resource-Related Research Projects--Cooperative Agreements (U24)
Project #
Application #
Study Section
Special Emphasis Panel (ZCA1)
Program Officer
Li, Jerry
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Dana-Farber Cancer Institute
United States
Zip Code