The DNA sequence of the human genome informs us as to the composition of proteins that make up healthy cells, but also altered compositions that create diseased cells. How protein production is controlled through the regulation of the genes that encode them is of critical importance for healthy and diseased cells. Knowing precisely where gene regulatory proteins bind, and are organized throughout the genome, including their interactions with each other, informs us as to how genes are regulated and mis-regulated. Since there are potentially thousands of different kinds of regulatory proteins and thousands of different kinds of human cell types and environmental responses that are a product of various subsets of regulatory proteins, the entire ?universe? of gene regulatory events is quite substantial and consequently, quite costly to identify. One of the main bottlenecks in analysis of genomic data is efficient and scalable visualization approaches. The PEGR open source platform will provide programmatic access to any number of human cell sequenced datasets, from any stage of NGS processing, with the pipeline analysis results available for high-throughput machine learning testing and development. This project will empower discovery through the automated analysis and visualization of results from both small- and large-scale datasets. This architecture will include the following features: 1) a secure, cloud-based, metadata management system that instills best practices of experimental rigor, reproducibility, and data sharing; 2) automated Galaxy-based epigenomic data processing pipelines that provide easy-to-use ?wizards? for standardized processing of common epigenomic data types; and 3) an easily de-ployable, open source software package as a means to disseminate data, tools, and discoveries via cloud services.

Public Health Relevance

Proteins that bind to the human genome control the genes that govern human health. Precise identification of their positional organization informs us of how cells normally behave, and how they will respond to stress, disease, and therapies. This project aims to develop a novel integration of computation tools for analyzing high-resolution protein-DNA binding assays and characterizing spatial dependencies between genomic events.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Research Project (R01)
Project #
3R01GM125722-03S1
Application #
10166093
Study Section
Program Officer
Krasnewich, Donna M
Project Start
2018-01-19
Project End
2021-12-31
Budget Start
2020-01-01
Budget End
2020-12-31
Support Year
3
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Pennsylvania State University
Department
Biochemistry
Type
Schools of Arts and Sciences
DUNS #
003403953
City
University Park
State
PA
Country
United States
Zip Code
16802