Simulation is a fundamental tool in modern science and engineering. It enables experiments that would otherwise be physically impossible, allows for testing under rigorously controlled conditions, and is an essential technique for ensuring reproducibility in the scientific process. In protein evolution, simulation is the key to understanding processes that act on time scales that exceed a human life span. Twenty years of research have produced a rich repertoire of algorithms and software for simulating protein sequence evolution. However, current models do not capture the processes by which multidomain proteins evolve. Multidomain proteins are mosaics of sequence fragments that encode protein modules: structural or functional units called domains. Because of their modular nature, multidomain proteins play central roles in cellular communication with other cells and with the environment. In human health, multidomain families are implicated in tissue repair, apoptosis, inflammation response, and innate immunity. Since roughly 30% of bacterial proteins and 50% of vertebrate proteins contain two or more domains, the critical benefits of simulation currently cannot be exploited for at least a third of all proteins on this planet.

This project addresses this gap through the design and implementation of software for simulating multidomain evolution. This project advances research infrastructure through the development and distribution of the first evolutionary simulator for multidomain proteins. This project also contributes to building a broadly inclusive scientific work force through research experiences for undergraduates and classroom teaching. Undergraduate researchers will be recruited from women in Carnegie Mellon's undergraduate program and from groups that are underrepresented in STEM through educational activities at the University of Puerto Rico. Simulation is an essential technique in molecular evolution, where the processes of interest typically act on the time scales that are beyond the reach of human endeavor. Simulation is essential for comparison and validation of evolutionary algorithms and software, for evaluating competing hypotheses, and for assessing how a system will respond to perturbations. Despite the growth of sophisticated methodology for simulating the evolution of amino acid sequences, no simulators currently exist that model the process of domain insertion, duplication, and deletion by which multidomain protein families evolve. Although these processes have the potential to generate any combination of domains, only a tiny fraction of possible domain combinations are observed in nature, suggesting that domain order and co-occurrence are stringently constrained. The goal of this project is the development of algorithms and prototype software for multidomain simulation based on realistic models of these constraints. This simulator benefits from a novel combination of Markov chain Monte Carlo technology and data-driven estimates of event probabilities. The data-driven event module enables quick and easy redeployment of the simulator for use in different taxonomic and protein function contexts. Repeated cycles of empirical testing and simulator redesign will promote better understanding of nature's protein design principles, more precise models and methods, and more accurate evolutionary inferences. The resulting software will be available at www.cs.cmu.edu/_durand/Lab/multidomain.html.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
1759943
Program Officer
Peter McCartney
Project Start
Project End
Budget Start
2018-04-15
Budget End
2022-03-31
Support Year
Fiscal Year
2017
Total Cost
$742,448
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213