The DNA sequence of the human genome informs us as to the composition of proteins that make up healthy cells, but also altered compositions that create diseased cells. How protein production is controlled through the regulation of the genes that encode them is of critical importance for healthy and diseased cells. Knowing precisely where gene regulatory proteins bind, and are organized throughout the genome, including their interactions with each other, informs us as to how genes are regulated and mis-regulated. Since there are potentially thousands of different kinds of regulatory proteins and thousands of different kinds of human cell types and environmental responses that are a product of various subsets of regulatory proteins, the entire ?universe? of gene regulatory events is quite substantial and consequently, quite costly to identify. One of the main bottlenecks in analysis of genomic data is efficient and scalable visualization approaches. The PEGR open source platform will provide programmatic access to any number of human cell sequenced datasets, from any stage of NGS processing, with the pipeline analysis results available for high-throughput machine learning testing and development. This project will empower discovery through the automated analysis and visualization of results from both small- and large-scale datasets. This architecture will include the following features: 1) a secure, cloud-based, metadata management system that instills best practices of experimental rigor, reproducibility, and data sharing; 2) automated Galaxy-based epigenomic data processing pipelines that provide easy-to-use ?wizards? for standardized processing of common epigenomic data types; and 3) an easily de-ployable, open source software package as a means to disseminate data, tools, and discoveries via cloud services.
Proteins that bind to the human genome control the genes that govern human health. Precise identification of their positional organization informs us of how cells normally behave, and how they will respond to stress, disease, and therapies. This project aims to develop a novel integration of computation tools for analyzing high-resolution protein-DNA binding assays and characterizing spatial dependencies between genomic events.