Recently we have been involved in four subprojects which use natural language processing techniques: 1) The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. We have proposed an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation. The results are generally a couple of percentage points better than the Schwartz-Hearst algorithm and also allow one to enforce a threshold for those applications where high precision is critical. 2) A significant fraction of queries in PubMed are multiterm queries and PubMed generally handles them as a Boolean conjunction of the terms. However, analysis of queries in PubMed indicates that many such queries are meaningful phrases, rather than simply collections of terms. We have examined whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that classes of records that contain all the search terms, but not the phrase, qualitatively differ from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching. The important insight here for indexing is that in some cases where the words of a phrase occur in text, but not as the phrase, the phrase may still be an appropriate concept to use in indexing the text. 3) We have developed a spell checking algorithm that does quite accurate correction ( 87%) and handles one or two edits, and more edits if the string to be corrected is sufficiently long. It handles words that are fragmented or merged. Where queries consist of more than a single token the algorithm attempts to make use of the additional information as context to aid the correction process. The algorithm is based on the noisy channel model of spelling correction and makes use of statistics on miss-spellings gathered from approximately one million miss-spelling incidents in the PubMed log files. These incidents were identified as cases where a user entered a query and then within five minutes corrected that query to another term which is close in edit distance and with at least ten times as many hits in the PubMed database. These statistics are not only used in the actual correction process, but were used to simulate miss-spellings in real words and phrases to discover the regions of validity of the method of correction and estimates of its accuracy. Additional work was done on the vocabulary of the PubMed database to remove frequent miss-spellings and improve performance. The algorithm is implemented in the PubMed search engine and there it frequently makes over 200,000 suggestions in a day and about 45% of these suggestions are accepted by users. The algorithm is efficient in adding only about 25% to the average query response time for users and much of this is seen only for misspelled queries. There is the possibility of improving the algorithm by the use of more context around the sites of errors within words. There is also the possibility of improving the algorithm by learning how to make better use of the context supplied by queries consisting of multiple tokens. But in both cases such an effort must consider how to maintain efficiency in the light of a huge vocabulary of phrases (>14 million) and individual words (>2.5 million) recognized by the search engine. There is also the possibility to use phonetic encodings to improve the handling of some of the errors that currently challenge the system. However, preliminary calculations suggest it would be difficult to make a major improvement by using phonetic encodings. 4) We explored a syntactic approach to sentence compression in the biomedical domain, grounded in the context of result presentation for related article search in the PubMed search engine. By automatically trimming inessential fragments of article titles, a system can effectively display more results in the same amount of space. Our implemented prototype operates by applying a sequence of syntactic trimming rules over the parse trees of article titles. Two separate studies were conducted using a corpus of manually compressed examples from MEDLINE: an automatic evaluation using Bleu and a summative evaluation involving human assessors. Experiments show that a syntactic approach to sentence compression is effective in the biomedical domain and that the presentation of compressed article titles supports accurate interest judgments, decisions by users as to whether an article is worth examining in more detail.
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57 |
Kim, Sun; Liu, Haibin; Yeganova, Lana et al. (2015) Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach. J Biomed Inform 55:23-30 |
Comeau, Donald C; Liu, Haibin; Islamaj Do?an, Rezarta et al. (2014) Natural language processing pipelines to annotate BioC collections with an application to the NCBI disease corpus. Database (Oxford) 2014: |
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014: |
Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014: |
Liu, Wanli; Islamaj Do?an, Rezarta; Kwon, Dongseop et al. (2014) BioC implementations in Go, Perl, Python and Ruby. Database (Oxford) 2014: |
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014: |
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056 |
Wilbur, W John; Smith, Larry (2013) A Study of the Morpho-Semantic Relationship in Medline. Open Inf Syst J 6:1-12 |
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042 |
Showing the most recent 10 out of 16 publications