The health and function of any given cell type depend on various sets of proteins interacting with one another and the genome’s DNA to regulate gene activities. Experimental approaches can profile where specific proteins attach to the genome, providing insight into regulatory relationships between proteins and genes in a given cell type. Such experiments are expensive and laborious, however, and provide limited insights into gene regulatory activities in the large portion of the genome that is composed of repetitive DNA. This project will develop new machine learning software methods that will predict where regulatory proteins bind to the genome in currently problematic settings. Specifically, these investigators will train neural network methods to recognize features of gene regulatory sites from existing experimental data and transfer that knowledge to predict gene regulatory sites in the genomes of other species, repetitive DNA areas, and in other cell types. Our new software methods will therefore unlock new layers of insight into gene regulation in healthy and diseased cells. All software produced by this project will be made freely available and accessible to the general research community. This project directly supports computationally intensive training and research opportunities in machine learning for graduate and undergraduate students who are working at the interface of computer science and biology. Strong efforts will be made to recruit students from under-represented groups. The education goals of this project will support the development of broader education initiatives in bioinformatics and genomics. This project will develop discovery-oriented bioinformatics research modules for use in teaching genetics and developmental biology concepts in high-school science classes. These research modules will be implemented in collaboration with Pennsylvanian high-school science teachers and students and will offer a new way to engage students in inquiry-based science. The PI will also develop curriculum proposals for a new degree program in bioinformatics at Penn State University.

This project will develop neural network-based transfer learning approaches that predict transcription factor (TF) binding sites across three domains where TF binding activities are difficult to assay. Aim 1 will focus on predicting TF binding sites across species. Neural networks will be trained on observed TF binding data from one species, and used to predict where the same TF binds in the same cell type in other species. A new domain adaptation strategy will be developed that addresses systematic biases resulting from shifts in the genomic makeup of different species. Transferring TF binding information across species will enable the study of regulatory evolution and innovation in many species without the need for expensive TF ChIP-seq experiments. Aim 2 will apply related domain adaptation approaches to predict TF binding sites within transposable elements and other repetitive regions. In this application, neural networks will be trained on observed TF binding data from uniquely mappable portions of the genome and will be applied to impute binding sites from partial signals in low-mappability regions. Predicting TF binding in low-mappability regions will provide a new way to study the regulatory contributions of transposable elements and other currently ignored parts of the genome. Finally, Aim 3 will predict where a TF would bind if it were expressed in a new chromatin environment. This last application differs from approaches that aim to impute unobserved TF binding signals from concurrent chromatin features; the goal is rather to use information from a preexisting chromatin environment to predict the future binding patterns of an induced TF. Developing the first principled approach for predicting where a TF would bind in new chromatin environments will be the first step towards predicting which regulatory perturbations can be used to transform cellular phenotypes. Predictions from all three aims will be tested in ongoing collaborations focused on understanding TF-driven cell identity specification in hematopoiesis and neuronal differentiation, thus providing new insights into how TFs select their regulatory targets during development. The results of the project will be available from http://mahonylab.org.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Application #
2045500
Program Officer
Peter McCartney
Project Start
Project End
Budget Start
2021-06-01
Budget End
2026-05-31
Support Year
Fiscal Year
2020
Total Cost
$503,308
Indirect Cost
Name
Pennsylvania State University
Department
Type
DUNS #
City
University Park
State
PA
Country
United States
Zip Code
16802