Recently we have been involved in three subprojects which use natural language processing techniques: 1) We have developed a machine learning algorithm for abbreviation definition identification in text which makes use of what we term naturally labeled data. Positive training examples are naturally occurring potential abbreviation-definition pairs in text. Negative training examples are generated by randomly mixing potential abbreviations with unrelated potential definitions. The machine learner is trained to distinguish between these two sets of examples. Then, the learned feature weights are used to identify the abbreviation full form. This approach does not require manually labeled training data. We evaluate the performance of our algorithm on the Ab3P, BIOADI and Medstract corpora. Our system demonstrated results that compare favourably to the existing Ab3P and BIOADI systems. We achieve an F-measure of 91.36% on Ab3P corpus, and an F-measure of 87.13% on BIOADI corpus which are superior to the results reported by Ab3P and BIOADI systems. Moreover, we outperform these systems in terms of recall, which is one of our goals. 2) We are studying paraphrases in MEDLINE abstracts. These come about because an author is describing some entity of interest and uses a phrase like "drug abuse" and then needing to describe the same entity again a sentence or two latter does not wish to use exactly the same wording again and may use a variant of the phrase such as "drug use" which in the context of "drug abuse" has substantially the same meaning. 3) An author disambiguation algorithm has been developed which relies on machine learning based on the assumption that if an author name is infrequent in the data it probably represents the same person in all documents where it is found. This gives us positive instances. Negative instances are sampled from pairs of documents that have no author in common. Such positive and negative data allows us to do machine learning on all aspects of the document other than the name in question. This allows us to learn how to weight this data for best performance in distinguishing the positive and negative instances from each other. This learning is then applied in individual name cases or spaces to determine which author document pairs represent the same author.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Liu, Wanli; Islamaj Dogan, Rezarta; Kwon, Dongseop et al. (2014) BioC implementations in Go, Perl, Python and Ruby. Database (Oxford) 2014:
Comeau, Donald C; Liu, Haibin; Islamaj Dogan, Rezarta et al. (2014) Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus. Database (Oxford) 2014:
Islamaj Dogan, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:
Rzhetsky, Andrey; Shatkay, Hagit; Wilbur, W John (2009) How to get the most out of your curation effort. PLoS Comput Biol 5:e1000391
Smith, Larry H; Wilbur, W John (2009) The value of parsing as feature generation for gene mention recognition. J Biomed Inform 42:895-904
Krallinger, Martin; Morgan, Alexander; Smith, Larry et al. (2008) Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol 9 Suppl 2:S1
Sohn, Sunghwan; Comeau, Donald C; Kim, Won et al. (2008) Abbreviation definition identification based on automatic precision estimates. BMC Bioinformatics 9:402