Recently we have been involved in four subprojects which use natural language processing techniques: 1) The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. We have proposed an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation. The results are generally a couple of percentage points better than the Schwartz-Hearst algorithm and also allow one to enforce a threshold for those applications where high precision is critical. In recent work we are extending this approach using machine learning to apply it to more abbreviation instances. 2) We are studying paraphrases in MEDLINE abstracts. These come about because an author is describing some entity of interest and uses a phrase like """"""""drug abuse"""""""" and then needing to describe the same entity again a sentence or two latter does not wish to use exactly the same wording again and may use a variant of the phrase such as """"""""drug use"""""""" which in the context of """"""""drug abuse"""""""" has substantially the same meaning. 3) An author disambiguation algorithm has been developed which relies on machine learning based on the assumption that if an author name is infrequent in the data it probably represents the same person in for all documents where it is found. This gives us positive instances. Negative instances are sampled from pairs of documents that have no author in common. Such positive and negative data allows us to do machine learning on all aspects of the document other than the name in question. This allows us to learn how to weight this data for best performance in distinguishing the positive and negative instances from each other. This learning is then applied in individual name cases or spaces to determine which author document pairs represent the same author.

Project Start
Project End
Budget Start
Budget End
Support Year
12
Fiscal Year
2010
Total Cost
$450,501
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57
Kim, Sun; Liu, Haibin; Yeganova, Lana et al. (2015) Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach. J Biomed Inform 55:23-30
Comeau, Donald C; Liu, Haibin; Islamaj Do?an, Rezarta et al. (2014) Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus. Database (Oxford) 2014:
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:
Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:
Liu, Wanli; Islamaj Do?an, Rezarta; Kwon, Dongseop et al. (2014) BioC implementations in Go, Perl, Python and Ruby. Database (Oxford) 2014:
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056
Wilbur, W John; Smith, Larry (2013) A Study of the Morpho-Semantic Relationship in Medline. Open Inf Syst J 6:1-12
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042

Showing the most recent 10 out of 16 publications