A Document Processing System

Wilbur, Willy

Abstract

A system of C++ language programs has been developed for the purpose of finding the closely related documents in Medline and for the purpose of performing machine learning on sets of documents. The system has a number of unique features: 1) It is based on a number of C++ classes and highly modular so that alterations in the system are relatively simple to perform. 2) The system currently processes PubMed data by extracting from the Sybase repositories using a C++ interface to Sybase. However, a change in the interface portion of the system would allow it to be applied to any large database consisting of discrete textual records. 3) Data processed by the system is stored as compressed file structures, etc. These structures are updatable so that new data may be continually added to the system as it becomes available. 4) Documents are compared with each other using a Bayesian form of analysis. 5) The latest work on this system has involved adding the ability to generate themes using an EM algorithm approach. Also recently code has been multithreaded and memory mapping capabilities added to speed up processing. The system described here is now not only being used to process all of MEDLINE for our research purposes, but also to produce the related documents for arbitrary pieces of text by other groups here in the NLM and outside of the NLM. The system is currently proving useful in testing different retrieval parameters and methods on the PubMedHealth records.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000022-20
Application #: 8344939
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 20
Fiscal Year: 2011
Total Cost: $79,948
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2015 ZIA LM	A Document Processing System Wilbur, Willy / National Library of Medicine
NIH 2014 ZIA LM	A Document Processing System Wilbur, Willy / National Library of Medicine
NIH 2013 ZIA LM	A Document Processing System Wilbur, Willy / National Library of Medicine	$102,732
NIH 2012 ZIA LM	A Document Processing System Wilbur, Willy / National Library of Medicine	$86,768
NIH 2011 ZIA LM	A Document Processing System Wilbur, Willy / National Library of Medicine	$79,948
NIH 2010 ZIA LM	A Document Processing System Wilbur, Willy / National Library of Medicine	$176,283
NIH 2009 ZIA LM	A Document Processing System Wilbur, Willy / National Library of Medicine	$202,712

Publications

Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57

Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:

Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: