Massively high throughput DNA sequencing is quickly changing the study of gene regulation in cancer. Large- scale efforts such as the NIH-funded Encyclopedia of DNA Elements (ENCODE) have exploited sequencing to map genome-wide chromatin features in human cancer cell lines using transformative technologies such as Chromatin Immunoprecipitation sequencing (ChIP-seq) and DNase I hypersensitivity sequencing (DHS-seq), and have made great strides toward a comprehensive database of gene regulatory elements in the human genome. The majority of cancer genomics projects focusing on patient samples use DNA methylation profiling, and we and others have shown that integration of these methylation profiles with ENCODE data can enable the identification of biologically-relevant epigenomic changes. However, the software tools required are not readily available to most cancer biologists. The reference maps themselves require a domain knowledge of gene regulatory features that is beyond the scope of many clinical research groups, and the publically available datasets are too often the result of heterogeneous and frequently shifting analysis pipelines. We will develop automated tools for unifying the various gene regulatory databases, and develop powerful yet user-friendly methylation workflows using the open-source R/BioConductor framework and our open-source, web-based Galaxy system. Standard workflows will use the methods we have developed for the TCGA project to import and analyze large numbers of raw methylation data files from either the Illumina Infinium or Bisulfite-seq platforms. We will also allow import of arbitrary sample metadata so users can perform two-way or multi-way comparisons between cancer subtypes or clinical covariates. Our workflows will be driven by the most current understanding of the chromatin landscape, which includes using histone modifications and DNase hypersensitivity data to define focal chromatin state, and Hi-C (nuclear conformation) and replication timing to define nuclear topological domains. Recent work by our lab and others suggests that methylation changes at cis-regulatory elements such as enhancers and insulators are driven primarily by binding of individual transcription factors, and thus reflect direct targeting of genes by specific transcriptional networks. We will use combined ChIP-seq and DNA binding motif analyses available from ENCODE to analyze user methylation data at the level of the individual protein/DNA interaction site. Finally, because the success of this effort will be measured by the degree of adoption within the cancer genomics community, we will engage several large- scale cancer genomics groups to act as beta testers and help us improve our workflows.
Accumulating evidence suggests that cancer is often a disease driven by epigenetic defects. DNA methylation profiling is the most powerful technology for identifying epigenetic defects in patient populations, and the most exciting new discoveries have been made by incorporating data from massive public databases of gene regulation. Many innovative software tools have been developed for this purpose by our laboratories and others, but they are difficult or impossible to use for non-programmers. We will use this grant to extend and develop these tools into simple, web-based workflows aimed at clinical cancer researchers.