Therapeutic antibodies, specialized enzymes for drug manufacturing, small molecule drug screening agents, and other proteins have been instrumental in advancing biotechnology and medicine. Protein therapeutics alone represents a rapidly growing $100+ billion market with broad applications in the treatment of cancer, inflammatory and metabolic diseases, and numerous other disorders. Most of the antibodies and other protein therapeutics developed in the last several years have been engineered, leading to improvements in important properties such as efficacy, binding affinity, expression, stability, and immunogenicity. However, improving protein properties through sequence modification remains a challenging task. Artificial intelligence (AI), which has been enormously successful in several fields (e.g., image recognition, self-driving cars, natural language processing), is now being applied to protein engineering and has the potential to transform this field as well. AI and machine learning (ML) can take advantage of large and diverse datasets to identify correlations, predict beneficial mutations, and explore novel protein sequences in ways that are not possible using other techniques. Other advantages include the ability to simultaneously optimize multiple protein properties and explore sequence space more efficiently. In Phases I and II of this project, we developed the ProtaBank database as a central repository to store, organize, and annotate protein mutation data spanning a broad range of properties. ProtaBank is the largest and only database actively collecting such a comprehensive set of sequence mutation data and is growing rapidly due to the wealth of data being generated with advanced automation and next-generation sequencing techniques. ProtaBank's depth and breadth makes it an ideal data source to train ML models. This proposal aims to create the ProtaBank AI Platform to enable the use of AI and ML tools to apply the data in ProtaBank to engineer proteins. The platform will provide fully customizable computational tools and will invoke protein-specific knowledge to properly prepare data for use with ML models. An interface to popular ML frameworks will be provided so that scientists can use these techniques to discover new predictive algorithms and enhance their ability to design proteins with the desired properties.
Specific aims i nclude: (1) integrating peer validated ML methods and proprietary technology for protein engineering into the ProtaBank AI Platform, (2) developing dynamic ML dataset creation tools, (3) expanding and improving the ProtaBank database by reaching out to scientists to contribute data, (4) enhancing our data deposition tools, and (5) integrating ProtaBank with the Protein Data Bank structure database and other databases. !
Protein engineering has enabled significant advances in health care by playing a key role in the development of antibodies and other protein therapeutics (e.g., for the treatment of cancer, inflammatory and metabolic diseases, and other disorders), highly selective enzymes for drug manufacturing, and novel proteins for use in diagnostics and the identification of new small molecule drugs. This project will enable the power of artificial intelligence (AI) to be applied to accelerate the engineering of proteins with new and improved properties. AI approaches can capitalize on the large amounts of protein mutation data being generated and stored in our recently developed ProtaBank protein mutation database to transform the way in which protein therapeutics and reagents are discovered and developed.!