Recently we have been involved in four subprojects which use natural language processing techniques: 1) We have developed a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals. 2) We are studying paraphrases in MEDLINE abstracts. These come about because an author is describing some entity of interest and uses a phrase like """"""""drug abuse"""""""" and then needing to describe the same entity again a sentence or two latter does not wish to use exactly the same wording again and may use a variant of the phrase such as """"""""drug use"""""""" which in the context of """"""""drug abuse"""""""" has substantially the same meaning. 3) An author disambiguation algorithm has been developed which relies on machine learning based on the assumption that if an author name is infrequent in the data it probably represents the same person in for all documents where it is found. This gives us positive instances. Negative instances are sampled from pairs of documents that have no author in common. Such positive and negative data allows us to do machine learning on all aspects of the document other than the name in question. This allows us to learn how to weight this data for best performance in distinguishing the positive and negative instances from each other. This learning is then applied in individual name cases or spaces to determine which author document pairs represent the same author.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57
Kim, Sun; Liu, Haibin; Yeganova, Lana et al. (2015) Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach. J Biomed Inform 55:23-30
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Comeau, Donald C; Liu, Haibin; Islamaj Do?an, Rezarta et al. (2014) Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus. Database (Oxford) 2014:
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:
Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:
Liu, Wanli; Islamaj Do?an, Rezarta; Kwon, Dongseop et al. (2014) BioC implementations in Go, Perl, Python and Ruby. Database (Oxford) 2014:
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056
Wilbur, W John; Smith, Larry (2013) A Study of the Morpho-Semantic Relationship in Medline. Open Inf Syst J 6:1-12
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042

Showing the most recent 10 out of 16 publications