The vast amounts of biomedical information that accumulate in modern databases (such as MEDLINE of the National Library of Medicine) impose a great difficulty for efficient wide surveying by researchers who try to evaluate new information considering existing biomedical literature even when advanced information search engines are used. Automated hypotheses generation systems are designed to help scientists to overcome these difficulties and accelerate their research. The pandemic situation with COVID-19 is precisely one of the cases when such systems can play an extremely important role in coping with the coronavirus. Using two different AI approaches, we have developed two systems to discover plausible hypotheses in the biomedical domain. In this project, we will will deploy the COVID-19 customized hypothesis generation and knowledge discovery system, massively run it on any relevant to this research queries, and publish the results (including trained AI models, and discovered information) in the open domain for broad scientific community with the goal to accelerate the COVID-19 research. This work focuses heavily on addressing fundamental knowledge discovery questions by modeling and formulating scientific hypotheses using the publicly available information in the biomedical domain. However, in general, these methods are not restricted to any specific information domain, i.e., they can be broadly used to discover knowledge in texts. Although our experimental work will be related to COVID-19, the methods can be applied with some reservations to any literature-based analysis. For example, in the Materials Science Initiative, one of the goals is to establish a systematic understanding of the material properties and discover new materials which can be done by analyzing using the massive corpus of papers. In the legal world, identifying related patents can be done using a similar hypothesis modeling methodology.
In the heart of the proposed approach lies a big multi-modal and multi-relational semantic knowledge network of all biomedical objects extracted from a variety of heterogeneous databases of the National Library of Medicine. These objects include but are not limited to scientific papers, abstracts, keywords, phrases, elements of thesaurus, genes, proteins, mutations, pathways, diseases, and diagnoses. We will leverage two systems, namely MOLIERE and AGATHA, that are based on structural and deep learning, respectively. We will customize them using the rapidly updated dataset of new papers that has not been yet processed by the National Library of Medicine but already exists in the open domain such as in various preprint archives and reports. The MOLIERE system is based on the network analysis techniques applied on the graph constructed using the low-dimensional embedding of the papers with the result interpretation methods that are based on the probabilistic topic modeling. The AGATHA system processes texts at much finer granularity, and creates a semantic knowledge network using more accurate embedding techniques followed by the deep learning training for knowledge discovery. Two systems complement each other. While the AGATHA is of higher quality, the MOLIERE is more interpretable. A combination of both will be leveraged in this research.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.