Recently we have been involved in four subprojects which use natural language processing techniques:? ? 1) The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. We have proposed an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation. The results are generally a couple of percentage points better than the Schwartz-Hearst algorithm and also allow one to enforce a threshold for those applications where high precision is critical. ? 2) A significant fraction of queries in PubMed are multiterm queries and PubMed generally handles them as a Boolean conjunction of the terms. However, analysis of queries in PubMed indicates that many such queries are meaningful phrases, rather than simply collections of terms. We have examined whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that classes of records that contain all the search terms, but not the phrase, qualitatively differ from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching. The important insight here for indexing is that in some cases where the words of a phrase occur in text, but not as the phrase, the phrase may still be an appropriate concept to use in indexing the text.? 3) We have developed a spell checking algorithm that does quite accurate correction ( 87%) and handles one or two edits, and more edits if the string to be corrected is sufficiently long. It handles words that are fragmented or merged. Where queries consist of more than a single token the algorithm attempts to make use of the additional information as context to aid the correction process. The algorithm is based on the noisy channel model of spelling correction and makes use of statistics on miss-spellings gathered from approximately one million miss-spelling incidents in the PubMed log files. These incidents were identified as cases where a user entered a query and then within five minutes corrected that query to another term which is close in edit distance and with at least ten times as many hits in the PubMed database. These statistics are not only used in the actual correction process, but were used to simulate miss-spellings in real words and phrases to discover the regions of validity of the method of correction and estimates of its accuracy. Additional work was done on the vocabulary of the PubMed database to remove frequent miss-spellings and improve performance. The algorithm is implemented in the PubMed search engine and there it frequently makes over 200,000 suggestions in a day and about 45% of these suggestions are accepted by users. The algorithm is efficient in adding only about 25% to the average query response time for users and much of this is seen only for misspelled queries. There is the possibility of improving the algorithm by the use of more context around the sites of errors within words. There is also the possibility of improving the algorithm by learning how to make better use of the context supplied by queries consisting of multiple tokens. But in both cases such an effort must consider how to maintain efficiency in the light of a huge vocabulary of phrases (>14 million) and individual words (>2.5 million) recognized by the search engine. There is also the possibility to use phonetic encodings to improve the handling of some of the errors that currently challenge the system. However, preliminary calculations suggest it would be difficult to make a major improvement by using phonetic encodings.? 4) We explored a syntactic approach to sentence compression in the biomedical domain, grounded? in the context of result presentation for related article search in the PubMed search engine. By? automatically trimming inessential fragments of article titles, a system can effectively display more results in the same amount of space. Our implemented prototype operates by applying a sequence of syntactic trimming rules over the parse trees of article titles. Two separate studies were conducted using a corpus of manually compressed examples from MEDLINE: an automatic evaluation using? Bleu and a summative evaluation involving human assessors. Experiments show that a syntactic approach to sentence compression is effective in the biomedical domain and that the presentation of? compressed article titles supports accurate interest judgments, decisions by users as to whether? an article is worth examining in more detail.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000090-10
Application #
7735077
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
10
Fiscal Year
2008
Total Cost
$224,159
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9
Wilbur, W John; Kim, Won; Xie, Natalie (2006) SPELLING CORRECTION IN THE PUBMED SEARCH ENGINE. Inf Retr Boston 9:543-564
Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit (2006) New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 7:356
Kim, Won; Wilbur, W John (2005) A strategy for assigning new concepts in the MEDLINE database. AMIA Annu Symp Proc :395-9
Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1
Smith, L; Wilbur, W J (2004) Retrieving definitional content for ontology development. Comput Biol Chem 28:387-91
Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107
Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84
Aronson, A R; Bodenreider, O; Chang, H F et al. (2000) The NLM Indexing Initiative. Proc AMIA Symp :17-21
Kim, W; Wilbur, W J (2000) Corpus-based statistical screening for phrase identification. J Am Med Inform Assoc 7:499-511

Showing the most recent 10 out of 11 publications