Automatic Bayesian Methods In Text Retrieval

Wilbur, Willy

Abstract

Current work on the project is focusing on developing an improved Bayesian classification model and developing new approaches to active learning with a Bayesian model. 1) We have developed term based active learning methods which provide a different approach to active learning and have shown that they are in many cases more effective then simple uncertainty sampling or error reduction sampling. 2) The PubMed database presents a unique challenge because of its very large size of over 25 million records. Because of this size few machine learning methods can be applied with a reasonable turn-around time. One method that can be applied efficiently is Naive Bayes, but it performs poorly when the different classes to be distinguished exhibit a marked size discrepancy. But such an imbalance is common for the problems one wishes to study in PubMed. In such a situation we have discovered that a training set much smaller than the whole set can be selected by an active learning inspired method. The result yields an almost 200% improvement in the performance of Naive Bayes in classifying documents for MeSH term assignment. The results are significantly better than a KNN method and there is the added advantage that the optimal training sets defined in this way can be used as the training sets for more sophisticated machine learning methods with even better results than those obtained from Naive Bayes. 3) We compute the documents related to a document using a probability calculation based on two Poisson distributions, one for the terms in a document that are more central to the documents content and one for the terms that are more peripheral. These are combined into a probability estimate of the importance of a term in a document based on its relative frequency in the document. This probability estimate is combined with the global IDF weight of a term to account for that terms importance in computing the similarity between two documents. We have known from the time this approach was developed that it worked well. In the last several years data has become available in the TREC genomics track that has allowed us to test this approach by comparing it with the results of the bm25 formula developed by Robertson and colleagues. We find a small but statistically significant advantage for our probabilistic approach. 4) We are currently working on a problem which arises when several different kinds of documents appear in a dataset and one wants to compute neighboring documents for each document. A simple application of the same approach used to find related citations in PubMed does not produce good results. Analysis of the problem shows that there are many records with words in them that are not keyed to the actual focus of the record and that these words mislead the neighboring process. In some cases this is due to a common author of records who users certain word forms frequently in their writing even on very different subjects. In other cases the problem seems to appear when two different drugs have sections on side effects that are quite generic and have a large overlap, etc. We have found our best results with a completely automatic approach which examines how related each word in the body of a record is to words in the records title. This is achieved by removing all words related below a certain low threshold. 5) Some of our latest work uses concepts that appear in multiple article titles to produce document clusters. These are then analyzed using naive Bayesian classification methods to ascertain their significance. Those that are significant are extended using the same Bayesian technique. The result is a set of concepts each represented by a document cluster. This proves to be an effective way to produce significant clusters of relatively small data sets that are difficult to cluster by more standard methods. 6) We have implemented a distributional semantics approach modeled somewhat after the work of Lin and Pantel and have found this useful in finding synonyms for terms. However the method does not produce a quality that can be effectively used for most purposes without human review. We believe the model could be improved if p-values could be computed in addition to scores and are working on an approach to assign such values.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000021-24
Application #: 9160905
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 24
Fiscal Year: 2015
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2015 ZIA LM	Automatic Bayesian Methods In Text Retrieval Wilbur, Willy / National Library of Medicine
NIH 2014 ZIA LM	Automatic Bayesian Methods In Text Retrieval Wilbur, Willy / National Library of Medicine
NIH 2013 ZIA LM	Automatic Bayesian Methods In Text Retrieval Wilbur, Willy / National Library of Medicine	$82,185
NIH 2012 ZIA LM	Automatic Bayesian Methods In Text Retrieval Wilbur, Willy / National Library of Medicine	$86,768
NIH 2011 ZIA LM	Automatic Bayesian Methods In Text Retrieval Wilbur, Willy / National Library of Medicine	$79,948
NIH 2010 ZIA LM	Automatic Bayesian Methods In Text Retrieval Wilbur, Willy / National Library of Medicine	$137,109
NIH 2009 ZIA LM	Automatic Bayesian Methods In Text Retrieval Wilbur, Willy / National Library of Medicine	$128,999

Publications

Yeganova, Lana; Kim, Won; Kim, Sun et al. (2014) Retro: concept-based clustering of biomedical topical sets. Bioinformatics 30:3240-8

Wilbur, W John; Kim, Won (2009) The Ineffectiveness of Within - Document Term Frequency in Text Classification. Inf Retr Boston 12:509-525

Lu, Zhiyong; Kim, Won; Wilbur, W John (2009) Evaluating relevance ranking strategies for MEDLINE retrieval. J Am Med Inform Assoc 16:32-6

Sohn, Sunghwan; Kim, Won; Comeau, Donald C et al. (2008) Optimal training sets for Bayesian prediction of MeSH assignment. J Am Med Inform Assoc 15:546-53

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: