The identity and health of a particular type of cell depends on the proteins carrying out its functions, and how the functions change depends on the cell's environment. Many of the responses in a cell are due to changes in what proteins are produced; cells turn on protein production, or turn it off, when a particular group of proteins, regulatory proteins, interact with the DNA. Regulatory proteins may act alone or in groups to turn genes on and off, and the timing and number of combinations in these binding events can be very complicated. There are experiments that show where a specific protein attaches to the DNA, giving us clues that regulation might be occurring; with improving laboratory methods we are able to find the DNA binding locations (profiles) of more and more proteins very specifically. This research will develop special computational and statistical methods for analyzing all of this data together, so that we can see what combinations of proteins act together to regulate genes. From these results we will understand much more about how gene regulation is working in functioning cells. All products produced by this research will be made freely available and accessible to other researchers and the public. Undergraduate and graduate students will be trained to use these bioinformatics research techniques, and strong efforts will be made to recruit students from under-represented groups. Interesting exercises will be created, based on the research methods and results, for undergraduate bioinformatics students and for students in workshops, as well as lessons suitable for high school students studying genetic regulation in their biology courses.
Current bioinformatics analysis techniques do not fully capture the structural information provided by the shape of read distributions produced in high-resolution genomic assays. For example, careful analysis of cross-linking patterns in collections of ChIP-exo datasets can potentially inform which proteins are interacting with one another in higher-order protein-DNA complexes. This project aims to develop a suite of shape-aware machine-learning tools for the analysis of high-resolution protein-DNA binding data that will: 1) deconvolve distinct genomic interaction modes from a single dataset; 2) detect and correct experimental artifacts and biases that arise in the new assays; 3) characterize the organization of higher-order protein-DNA complexes across multiple data types; and 4) detect changes in genomic event locations and interaction modes across multiple experimental conditions. This project will therefore enable integrative models of diverse protein-DNA complexes, directly impacting our understanding of gene regulation in a wide variety of organisms.