In developing variants of natural proteins with improved properties and activities, protein engineers are confronted with large, complex design spaces. The degrees of freedom for producing variants mirror nature but can be specifically targeted experimentally, choosing parent proteins, replacements for some amino acids (site-directed mutation), and locations for crossing over between parents (site-directed recombination). A set of choices, constituting a design, can be evaluated by multiple disparate criteria, including consistency with evolutionary information, energetic favorability with respect to a three-dimensional structure, and incorporation of specific characteristics distinguishing functional subclasses. Unfortunately, the different evaluation metrics may be complementary or even contradictory, and the prior information on which they are based is incomplete, so that the metrics are only more or less accurate in predicting the real-life quality of the designs.

The overall goal of this project is to develop efficient methods to characterize complex protein design spaces and optimize high-quality designs for experimental evaluation. A combinatorial protein engineering approach will be pursued, experimentally constructing a library of related variants and assaying them for properties of interest. Potential scores will evaluate a possible library (without explicitly enumerating its members) with respect to prior information from sequence, structure, and functional subclass. To account for disparate evaluation metrics, design algorithms will focus on the identification of Pareto optimal designs, those for which no other design is as good or better with respect to all desired criteria. To account for incomplete prior information, design algorithms will trade off between exploitation of the prior information and broader exploration of the design space, seeking to identify a diverse set of designs, each with a diverse set of variants. Markov Chain Monte Carlo sampling algorithms will characterize the overall design space by generating choices for the degrees of freedom and evaluating the designs with the potential scores, using the scores and diversity metrics to appropriately explore the space. Exact algorithms will more precisely focus on regions of interest, dividing and conquering the design space and employing combinatorial optimization algorithms to identify Pareto optimal designs.

The design space approach provides a powerful new mechanism to address protein engineering applications, enabling the engineer to explicitly evaluate and optimize for trade-offs among important criteria and considerations. Interactive tools will help engineers navigate through the regions of interest, visualize designs and perform "what-if" analyses, and compare and contrast Pareto optimal designs. A design space repository will enable sharing of analyses and underlying data. The tools and repository will support protein engineering for a range of activities in the national interest, including biosensors, production of novel biological therapeutics and novel enzymes for green chemical synthesis, energy extraction, and bioremediation. As part of the project, the mechanism will be put to use in the engineering of soluble and robust cytochrome P450s that employ the inexpensive and non-toxic hydrogen peroxide to hydroxylate steroids and multi-ring compounds that mimic estrogenic (feminizing) steroids in the environment without the need for living cells or protein cofactors. Such enzymes would be valuable as tools for chemical synthesis, waste treatment, and bioremediation.

This project provides an ideal venue to impart cross-disciplinary training to students by illustrating how computational techniques can be fruitfully integrated with experimentation in answering important biological questions. Aspects of the project will be used in both undergraduate and graduate courses, from an introductory biology course to an advanced bioinformatics course. The project itself will provide the opportunity for inter-disciplinary research training for graduates and undergraduates, including those from underrepresented groups.

Project Report

The focus of this project is to develop and apply new methods for uncovering good designs in complex design spaces characterized by high degrees of freedom and multiple quality measures. The motivating context is protein design, where the degrees of freedom are choices of site-directed mutations and site-directed recombination breakpoints, and the quality measures are derived from analysis of protein sequence, structure, and function. Recognizing that different quality measures may be complementary or even competing, the goal adopted here is to identify the Pareto optimal designs -- those for which no other design is as good or better with respect to all criteria. The project yielded a portfolio of powerful new computational methods for efficiently assessing diverse criteria and optimizing them in different design settings. It also demonstrated the utility of these methods in a wide range of protein design applications, impacting a broad set of life science researchers and engineers. Two general algorithmic approaches were developed to uncover all and only the Pareto optimal designs, without explicitly considering the massive number of designs that they dominate. These algorithms were designed and instantiated for a variety of protein design contexts including optimizing sets of enzymes for a combination of improved activity and overall diversity, optimizing interacting proteins so as to balance strength and specificity of interaction, and optimizing therapeutic proteins in order to mitigate undesired immune recognition while maintaining stability and function. In addition to optimizing individual proteins, the methods were extended to optimize entire combinatorial libraries of proteins and peptides, mixing-and-matching mutations or gene fragments to yield sets of variants that overall are enriched in desired combinations of properties of interest.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1017231
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2010-09-15
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$331,802
Indirect Cost
Name
Dartmouth College
Department
Type
DUNS #
City
Hanover
State
NH
Country
United States
Zip Code
03755