For developing the POC software, we used a rapid prototyping, cloud-based approach based on Python code running in a Docker container in Amazon?s Elastic Compute Cloud (EC2). We used Git for distributed source code control, distributed project management, and code deployment. We implemented a blackboard-like software module (Orangeboard) that provides a knowledge-graph object model (including information about source database and edge types for seven different types of relationships) and the ability to load the graph into Neo4j using a high-performance bulk-transfer (parameterized Cypher) and protocol (Bolt). We implemented Python classes to provide RESTful querying functionality for 14 different knowledge sources (Monarch/BioLink, DisGeNET, Disease Ontology, GeneProf, miRBase, miRGate, MyGene.info, OMIM, Pathway Commons 2, Pharos, human phenotype ontology, Reactome, Monarch/SciGraph, and UniProt). We implemented client-side HTTP request/response caching as well as non-persistent method-level caching in Python, to accelerate knowledge graph expansion. We implemented a BioNetExpander class that can iteratively expand a knowledge graph (in Orangeboard) from one or more seed nodes. This approach is flexible with respect to future types of queries and can accommodate future selective rules for node extension. Using BioNetExpander we are able to expand a knowledge graph from 21 seed diseases to 20,000 nodes and 800,000 relationships, in an hour. To enable path scoring, we implemented a Python class for obtaining path topological characteristics and metadata, for a given path in the Neo4j graph. We then implemented Python-based scripts for querying for paths between genetic conditions and the 21 diseases (Q1), and for the 1,000 drug/disease pairs (Q2) in the Neo4j knowledge graph (using Cypher). We benchmarked path-finding performance of this system and found that a typical shortest-paths query with two fixed endpoints takes 50 ms, and thus, this approach should have low query-response latency. In order to leverage PubMed abstract co-occurrence information in scoring a path in the knowledge graph, we used high-performance software (from Dr. Liang Huang?s lab) for indexing PubMed and are in the process of obtaining Normalized Google Distance (NGD) scores for pairs of genetic conditions and diseases (for Q1) and for pairs of drugs and diseases (for Q2). With the knowledge graph in hand, are in the process of refining our path-scoring approaches for Q1 & Q2 in preparation for the POC demo.

Agency
National Institute of Health (NIH)
Institute
National Center for Advancing Translational Sciences (NCATS)
Project #
1OT2TR002520-01
Application #
9613383
Study Section
Special Emphasis Panel (ZTR1)
Program Officer
Colvis, Christine
Project Start
2017-12-29
Project End
2019-12-28
Budget Start
2017-12-29
Budget End
2019-12-28
Support Year
1
Fiscal Year
2018
Total Cost
Indirect Cost
Name
Oregon State University
Department
Type
Schools of Public Health
DUNS #
053599908
City
Corvallis
State
OR
Country
United States
Zip Code
97331