Science disciplines have been generating huge volumes of research publications, which are of tremendous value but far beyond researchers' capacity to digest and analyze. There is a critical need to automatically (with the help of widely available, general knowledge-bases) transform research text into structured information networks on which advanced search and analytics tools can be developed to facilitate researchers and practitioners to quickly locate knowledge, make inferences, and even generate new scientific hypotheses.
This project aims at developing a new data-to-network-to-knowledge (D2N2K) paradigm to transform massive, unstructured but interconnected research text data into actionable knowledge, by integrating semi-structured and unstructured data. First, organized heterogeneous information networks (hence called StructNet) are constructed, and then powerful mining mechanisms on such organized networks are developed. With a focus on biomedical sciences, the project investigates the principles, methodologies and algorithms for (i) construction of relatively structured heterogeneous information networks (called MediNet) by mining biomedical research corpora via attribute extraction, relation typing, and claim mining, and (ii) exploration and mining of the networks so constructed via graph OLAP and task-guided embedding. The project develops an extensible framework to facilitate literature-based scientific research. The study on construction and exploration of MediNet not only impacts biomedical research but also consolidates this data-to-network-to knowledge methodology, readily to be transferred to other domains, for automatic transformation of massive unstructured text data in those domains into structured and actionable knowledge.