Flexible NLP system for MEDLINE information extraction

Daraselia, Nikolai

Abstract

This Small Business Innovation and Research Phase I project focuses on the development of the fully automatic system for extraction of the protein function information from MEDLINE abstracts and conversion it into a form of a conceptual graph. All existent protein function databases depend on human experts who cannot keep up with the exponential growth of protein function information freely available in MEDLINE. There is an urgent need for an automatic system capable of extracting protein function information from literature. The system we proposed will be based on advanced natural language processing (NLP) technologies, and uses it as a fast and reliable way to extract information about protein function from human readable sources. To this end, we have developed and tested MedScan - a prototype of such system that parses scientific abstracts and converts protein function information into a form of a conceptual graph. It consists of a preprocessor module selecting candidate sentences from MEDLINE, an NLP module utilizing proprietary linguistic model to parse the selected sentences, and an information extraction module utilizing developed ontology to extract and validate protein function information. The results of MedScan evaluation indicate that it is a feasible candidate for a proposed task. In Phase II, the software system will be developed to assist the researchers to quickly access, search and navigate through the MEDLINE content, and to visualize and analyze the large volumes of protein function data. We will also extend our approach to other areas including pharmacogenomics and extraction of clinically relevant information.