Our current capacity to generate chemical and structural biological data far exceeds our capability to meaningfully assimilate it. The data describes molecules and biological macromolecules and associated properties. A principle common to the structure of all chemical and biological macromolecular entities is the composition of objects related by energetic interaction. A natural representation of all such entities is a graph composed of nodes related by edges. We have developed powerful, scalable techniques that operate on graph databases for efficient similarity searching (Closure-tree), identification of statistically significant subgraphs (GraphRank), and query specification (GraphQL). These techniques are naturally applied to chemical and structural biological data, which are naturally represented as graphs. We have demonstrated the validity of the approach in prior work, and the feasibility in our phase 1 research. The overall goal of this project is to deliver powerful innovative problem solving tools to medicinal chemists, structural biologists, and drug discovery researchers synthesizing ever increasing amounts of chemical, biochemical, structural biological, cell biological, and clinical data. Phase 1 of this project is ongoing and highly successful. We have successfully demonstrated that the Closure- tree and GraphRank algorithms are effective on chemical compound databases of realistic, industrial size. We have developed methods to exploit our knowledge of the nature of chemical databases. Using these methods we have improved similarity query performance time by over an order of magnitude. We have identified several specific aims to purse in Phase 2 of our research. We have rapidly established a professional software development and research infrastructure and developed the tools necessary to support progress toward the goal of solving important problems hindering medicinal chemists and structural biologists conducting modern drug discovery research for the development of new therapeutics. We will pursue four specific aims in our Phase 2 research. (1) We will develop specific additional functionality for Closure-tree and GraphRank, and integrate GraphQL into our chemical and structural bioinformatics tool set. The results of this aim will be used to (2) develop methods and functionality to represent chemical, structural biology, systems biology, and glycobiology data as graphs. Building on these results, we will (3) apply our tool set to specific relevant research problems such as HIV-1 Protease inhibition, Avian Flu neuraminidase inhibition, and p53-protein interactions. Finally, we will (4) assemble a state-of-the-art chemical and structural biological informatics tool set with detailed documentation and relevant case studies. The outcome of this research will be powerful, innovative new tools in the hands of medicinal chemists, structural biologists, and modern drug discovery researchers in academia and the pharmaceutical industry. The tools address significant obstacles in the drug development process and will enable new discoveries and greatly advance the practice of cheminformatic and structural biological data analysis. Through a carefully developed market analysis described in our commercialization plan, we show a growing market for our tools and competitive advantages. Application of our techniques will have significant impact on the interpretation of structural biological data, on pharmaceutical research and modern drug discovery chemistry, and on human health care through the design of new drugs.
Graph-based representation of chemical compounds results in a more accurate realization of the chemical space. The use of recent techniques in graph querying and mining will enable data analysis that can scale to millions of compounds. The developed system will integrate information on chemical compounds with biological activity and protein interaction networks, thus enabling cheaper and faster drug discovery. ? ? ?