Enzymes are proteins that catalyze reactions. They are used in various applications such as detergents to clean stains, sensors to detect blood sugar levels, industrial processes such as production of high fructose corn syrup. Enzymes can be engineered to improve their performance using a method called directed evolution (DE). This is a process of repeated gene sequence mutation and enzyme screening to determine the change in performance caused by the mutated sequences. A side-product of DE experiments is an abundance of unused data. This project will incorporate machine learning (ML) into the DE workflow. This and other data will train the ML algorithm to recognize and ultimately predict protein structures that are useful in creating the enzyme activity desired. The objective is to be able to make novel and useful enzymes more rapidly and at lower cost.

The diversity of life arises in great measure from the ability of proteins to evolve and adapt. The permutations of the 20 proteinogenic amino acids allow for supra-astronomical numbers of possible protein sequences, the vast majority of which do not fold or encode a useful function. We propose that machine learning (ML) models can use information gained from directed evolution (DE) experiments to improve the efficiency of searching this space for functional proteins. While techniques such as DE implicitly rely on an underlying structure of the functional landscape of protein sequence space, explicitly modeling this structure would allow for far more efficient search algorithms. We recently demonstrated a data-driven, ML approach to guiding DE experiments which accounts for the epistatic nature of protein mutations and enables multiple beneficial mutations to be incorporated in a single generation of mutation and screening. We aim to further develop this workflow to address multiple tasks simultaneously, specifically to predict enzyme activity across multiple substrates. To accomplish this, we propose to incorporate information about multiple substrates into ML-guided directed evolution and use validated encodings for proteins and substrates developed for separate predictive tasks. However, these encodings are not optimized to work together for ML. We therefore propose to develop new encodings that describe the components of an enzymatic system in a cohesive and synergistic manner. Finally, while directed evolution has successfully adapted enzymes for human applications, this process currently requires expert knowledge and intensive trial-and-error for each engineering task. The requirement for expert knowledge is clearest when approaching the formidable challenge of finding starting activity for DE. Using ML, we will build a model that can predict the non-natural carbene/nitrene transfer activities of P450s against target substrates, based on what is known of the natural substrate(s) of these enzymes.

This award is cofounded by the Cellular and Biochemical Engineering Program in the Division of Chemical, Bioengineering, Environmental and Transport Systems and the Systems and Synthetic Biology Program in the Division of Molecular and Cellular Biosciences.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
California Institute of Technology
United States
Zip Code