This proposal is entitled "Binding-Site Modeling with Multiple-Instance Machine-Learning." One of the most challenging and longest studied problems in computer-aided drug design has been affinity prediction of small molecule ligands for their cognate protein targets. Despite decades of work, quantitative structure-activity re- lationship prediction (QSAR) approaches still suffer from poor accuracy, especially when predicting outside of closely related series of molecules. Even with high-quality structures of target proteins, approaches grounded in physics are also far from robust and accurate enough for reliable use in drug lead optimization. This proposal will build upon a foundation in multiple-instance machine learning applied to computer-aided drug design problems and develop a robust, accurate, and practically applicable affinity prediction methodology. The methodology requires only ligand structures and associated activity data for training, and it induces a virtual protein binding site composed of molecular fragments. The virtual binding pocket (or "pocketmol") is used in conjunction with a scoring function developed originally for molecular docking. The pocketmol configuration is chosen such that the optimal conformation and alignment of a ligand (based on the docking scoring function), yields scores for training ligands that are close to the known experimental values. Feasibility has been demon- strated in papers involving both membrane-bound receptors and enzymes. However, multiple challenges remain and are the subject of the proposed research. There are three key issues. First, there exist many pocketmols that satisfy the requirements of fitting the training data, so general solutions must be developed to address the inductive bias of the learning procedure as well as model selection after the procedure. Second, since any particular model is the product of a learning process, it will have some domain of applicability, with some new molecules likely to be predicted well and others poorly. Further, the model will be better informed by learning with certain new molecules but not others. We must develop solutions for estimating confidence of predictions for new molecules as well as for identifying particular molecules that will be highly informative. Third, the operational application of these methods involves model building, guided chemical synthesis, and iterative refinement of models. Convincing validation will require application on temporal series of molecules synthesized for multiple targets of pharmaceutical interest. The proposed work will develop novel methods to address these challenges and will establish extensive validation on multiple pharmaceutically relevant temporal series of small molecules that were the subject of real-world lead-optimization exercises.

Public Health Relevance

The dominant mode of therapeutic discovery involves the design me-too drugs that are very similar in structure and effect to existing drugs. In order to address the unmet medical needs of an aging population, novel therapeutics must be developed, and this will require much more creativity in the design process. The proposed research will develop a predictive computational framework to aid in active design of structurally novel drug molecules during the drug discovery lead optimization process.

National Institute of Health (NIH)
National Institute of General Medical Sciences (NIGMS)
Research Project (R01)
Project #
Application #
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brazhnik, Paul
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California San Francisco
Schools of Pharmacy
San Francisco
United States
Zip Code
Spitzer, Russell; Cleves, Ann E; Varela, Rocco et al. (2014) Protein function annotation by local binding site surface similarity. Proteins 82:679-94
Yera, Emmanuel R; Cleves, Ann E; Jain, Ajay N (2014) Prediction of off-target drug effects through data fusion. Pac Symp Biocomput :160-71