The NIH Roadmap Epigenomics Program has produced reference epigenomic maps derived from a variety of human primary cells and tissues, including pluripotent cell types and in vitro differentiated forms, highly purified primar cells, and a range of fetal and adult tissues. The goal of the proposed project is to develop, validate and apply unsupervised machine learning methods to the joint analysis of these epigenomic maps along with (1) data generated by the NIH ENCODE Consortium, (2) a variety of publicly available data sets that characterize the three-dimensional structure of DNA in the nucleus, and (3) information about evolutionary conservation, represented by cross-species DNA alignments.
The first aim of the project will use data imputation methods to carry out virtual functional genomics experiments. The proposed method is based on techniques developed in the context of recommender systems, but is extended to model dependencies along the genomic axis. By simultaneously analyzing the pattern of biochemical activity across a range of cell types and assay types, the proposed imputation method will accurately predict the results of an assay, such as ChIP-seq for a particular histone modification in a particular cell type, that has not yet been carried out. We will systematically apply this method to Roadmap Epigenomics and ENCODE data, filling in missing experiments in the matrix of cell types and assay types. The remaining three specific aims extend and apply our existing system for semi-automated genome annotation, Segway, which integrates a wide variety of functional genomics data into a human interpretable labeling of genomic elements. These analyses will be performed on real data as well as the virtual experiments from Aim 1. We propose a novel, graph-based regularization scheme and show how, using this approach, we can use Segway to perform integrated analysis of data across cell types and integrate 3D genome architecture information from assays such as Hi-C. We also propose a post-processing method to exploit patterns of evolutionary conservation to identify functionally important labels in the resulting annotations. The primary deliverables will include novel software for imputation and annotation, as well as publicly available sets of virtual experiments and genome annotations.
The NIH has recently expended substantial effort to generate raw data that characterizes the human epigenome across a variety of cell types. This proposal uses machine learning methods to help make sense of this large collection of epigenomic maps, combining the maps with data generated by the NIH ENCODE Consortium, information about the 3D structure of DNA, and information about evolutionary conservation. The project will produce novel computational methods as well as two primary analysis products: virtual experiments for combinations of assays and cell types that have not yet been carried out and annotations that identify various types of biochemical and functional activity along the human genome.