This project at the J. Craig Venter Institute is developing a client-cloud application for protein-protein docking on the Azure platform. The project is designing a self-scaling computational docking server deployed in the cloud and driven by a cross-platform desktop molecular viewer. The Azure infrastructure is being explored for numerical computations with a coarse-grained parallelism. A fault-tolerant cloud component is being built to expose both programmatic and Web form interfaces. Economically optimal scaling strategies are being studied to address the fluctuating demand within each user session, using instance pooling across sessions and predictive instance deployment. A popular 3D molecular viewer PyMol is being extended with protein-protein docking operations that can be executed by the cloud component in parallel, thus making possible the interactive user-driven simulation protocols. By using a cloud platform, this project works to make the interactive protein-protein docking simulations available to any group of biologists irrespective of location, without forcing them to maintain substantial local computing resources required for such numerically intensive methodology. The docking simulations are used to uncover the details of protein-protein interactions that form the basis of functions in any living organism. The project?s code and executables are being released under an open-source license.
This project has developed a protein-protein docking package PRODDL and a framework for executing it on a wide range of traditional parallel computational platforms as well as computational clouds such as Microsoft Azure. The project addressed the following intellectual merits: Proteins act as molecular machines in all living organisms, and many of their functions depend on proteins interacting with each other forming protein complexes. Predicting the shape of these complexes can provide key insights for biology. Because such simulation procedures are computationally expensive, they have to be carried out in parallel on hundreds of computing nodes. That requirement greatly complicates the setup and management of the simulations. This project has designed an open source and free implementation that: Can run in a wide range of different parallel computing environments, including in commercial clouds where the users are freed from the upfront investment into the hardware components Makes it easy to setup up a large parallel computation and drive it to conclusion despite possible failures of individual computational units Can be launched and controlled over the Internet through three different mechanisms: - Web form interface; - sequences of typed commands; - graphical molecular viewer The project has used its techniques for setting up and executing workflows in a related discipline of annotating genomes from mass spectrometry data. The broader impacts of the project: The docking tools developed by the project have been deployed on a public computational Web server based on a Galaxy environment widely used in the biological community. Together with flexible options for controlling the computations remotely, the protein docking tool provides biologists with an ability to iteratively refine their hypotheses about the protein interactions with minimal setup and expense. Fast protein docking procedure that can be controlled from a graphical molecular viewer is a valuable classroom tool. Software products (all open-source): PRODDL-C - client-cloud service framework for parallel distributed execution of computational workflows in the Azure Cloud Services and on Linux compute clusters or workstations. https://bitbucket.org/andreyto/proddl-c Protein Protein Docking with Discriminative Learning v.2 (PRODDL2) https://bitbucket.org/andreyto/proddl2 PGP: Parallel Proteogenomics Pipeline. Automated proteogenomics pipeline designed to run in a range of different parallel computing environments. https://bitbucket.org/andreyto/proteogenomics PRODDL2 tool deployed in a public computation Web server http://mgtaxa.jcvi.org/