This Small Business Innovation and Research Phase I project focuses on the development of the fully automatic system for extraction of the protein function information from MEDLINE abstracts and conversion it into a form of a conceptual graph. All existent protein function databases depend on human experts who cannot keep up with the exponential growth of protein function information freely available in MEDLINE. There is an urgent need for an automatic system capable of extracting protein function information from literature. The system we proposed will be based on advanced natural language processing (NLP) technologies, and uses it as a fast and reliable way to extract information about protein function from human readable sources. To this end, we have developed and tested MedScan - a prototype of such system that parses scientific abstracts and converts protein function information into a form of a conceptual graph. It consists of a preprocessor module selecting candidate sentences from MEDLINE, an NLP module utilizing proprietary linguistic model to parse the selected sentences, and an information extraction module utilizing developed ontology to extract and validate protein function information. The results of MedScan evaluation indicate that it is a feasible candidate for a proposed task. In Phase II, the software system will be developed to assist the researchers to quickly access, search and navigate through the MEDLINE content, and to visualize and analyze the large volumes of protein function data. We will also extend our approach to other areas including pharmacogenomics and extraction of clinically relevant information.

Agency
National Institute of Health (NIH)
Institute
National Institute of General Medical Sciences (NIGMS)
Type
Small Business Innovation Research Grants (SBIR) - Phase I (R43)
Project #
1R43GM067276-01A1
Application #
6693928
Study Section
Special Emphasis Panel (ZRG1-SSS-2 (10))
Program Officer
Ikeda, Richard A
Project Start
2003-08-01
Project End
2004-01-31
Budget Start
2003-08-01
Budget End
2004-01-31
Support Year
1
Fiscal Year
2003
Total Cost
$100,000
Indirect Cost
Name
Ariadne Genomics, Inc.
Department
Type
DUNS #
118057202
City
Rockville
State
MD
Country
United States
Zip Code
20850
Egorov, Sergei; Yuryev, Anton; Daraselia, Nikolai (2004) A simple and practical dictionary-based approach for identification of proteins in Medline abstracts. J Am Med Inform Assoc 11:174-8