Engineered proteins such as therapeutic antibodies, specialized enzymes for drug manufacturing, and proteins used to identify new small molecule drugs are making significant contributions to improve health care. Protein therapeutics alone represent a $100+ billion market that is rapidly growing and has broad applications in the treatment of cancer, metabolic diseases, and other disorders. These advances have been made possible, in part, by the free and easy access to data in the form of nucleotide sequences (GenBank) and protein structures (Protein Data Bank, PDB). Both of these databases have grown exponentially and continue to organize and structure data in a manner that would be hard for individual groups or companies to maintain on their own. A new type of data is emerging in the protein engineering community that is not stored in GenBank or the PDB-engineered protein sequences and their associated experimental assay data. The protein engineering community is at a relatively early stage of development compared to the sequence or structure determination communities. Thus, the time is ripe to develop a database to organize the data from protein engineering studies into a cohesive and comprehensive dataset. We will call this database PEBank. In Phase I, PEBank development will include: (1) drafting a specification for Version 1.0, with feedback from representatives from GenBank and the PDB, that describes the types of data to be stored and lays out the organizational hierarchy of the data; (2) implementing a prototype of Version 1.0 of PEBank and garnering feedback from the protein engineering community; (3) implementing a cloud-based version of PEBank; and (4) creating web-based utilities for depositing, viewing, and analyzing data. In Phase II, we will continue development of PEBank by: (1) creating a version that will allow write privileges and hosting it on Amazon Web Services; (2) providing support for PEBank users; (3) developing a secure limited-access version of PEBank that will hold customer-specific proprietary data; (4) developing tools that will validate the intregrity of the data and policies to handle invalid data; (5) developing web-enabled search tools to extract data from PEBank; (6) testing data deposit and viewing, and making PEBank available to the academic community; and (7) developing advanced analysis tools for finding statistical correlations between various data elements. We will also begin to use the analysis tools and PEBank data to optimize the predictive capability of our computational protein design software; this will include improving the underlying score functions and developing dynamic design tools that integrate database interrogation with the sequence optimization process. When complete, PEBank will allow protein engineers around the world to access protein engineering data in a standard format that can be easily accessed, searched, and shared; this data can be used to inform their designs and to develop more predictive protein design tools, thus accelerating the development of new and improved proteins for therapeutic, diagnostic, and other health-related applications.
Engineered proteins such as therapeutic antibodies, specialized enzymes for drug manufacturing, and proteins used to identify new small molecule drugs are making significant contributions to improve health care. The goal of the proposed research is to create a comprehensive, web-enabled database, called PEBank, to store and organize the wealth of data that are generated by protein engineering projects. PEBank will allow scientists around the world to access protein engineering data in a consistent format to inform their protein engineering projects and develop better methods for engineering proteins relevant to human health.