Current work on the project is focusing on developing an improved Bayesian classification model and developing new approaches to active learning with a Bayesian model. 1) In the literature on the Naive Bayes machine learning method there have long been two models that have been used. One is the multivariate Bernoulli model (MBM) and the other the multinomial model (MM). The MBM method only counts the presence or absence of a feature, while the MM method counts the number of times a feature appears in a record. A number of comparisons of the two approaches have been made in the area of text categorization and the MM approach has usually won out. It is our belief that the reason for this is that the MBM model has not been properly optimized. In fact we have found that in the area of text categorization the local term frequency contributes virtually nothing to the performance of the MM model. In support of this contention we have developed a simplified form of the MM model which ignores local term frequency (but is still much closer to the MM than to the MBM) and we find that it performs essentially the same as the MM model. In fact we do not find an advantage for the use of local term frequency in text categorization using MM or in several other models including the SVM. The advantage to ignoring local term frequency is that it greatly simplifies the data storage and the calculations when applying the Naive Bayes approach to a very large database such as PubMed. One aspect of this work which calls for further investigation is what happens when the records are long and the local frequencies can then be much larger. We do not have the final answer, but our initial work with the TREC genomics data (160,000 full text documents) suggests that there is a small advantage in the use of local term frequencies in some models, but the advantage is not in any case over about a 3% improvement in break even point. 2) We have developed term based active learning methods which provide a different approach to active learning and have shown that they are in many cases more effective then simple uncertainty sampling or error reduction sampling. 3) The PubMed database presents a unique challenge because of its very large size of over 18 million records. Because of this size few machine learning methods can be applied with a reasonable turn-around time. One method that can be applied efficiently is Naive Bayes, but it performs poorly when the different classes to be distinguished exhibit a marked size discrepancy. But such an imbalance is common for the problems one wishes to study in PubMed. In such a situation we have discovered that a training set much smaller than the whole set can be selected by an active learning inspired method. The result yields an almost 200% improvement in the performance of Naive Bayes in classifying documents for MeSH term assignment. The results are significantly better than a KNN method and there is the added advantage that the optimal training sets defined in this way can be used as the training sets for more sophisticated machine learning methods with even better results than those obtained from Naive Bayes. 4) We compute the documents related to a document using a probability calculation based on two Poisson distributions, one for the terms in a document that are more central to the documents content and one for the terms that are more peripheral. These are combined into a probability estimate of the importance of a term in a document based on its relative frequency in the document. This probability estimate is combined with the global IDF weight of a term to account for that terms importance in computing the similarity between two documents. We have known from the time this approach was developed that it worked well. In the last several years data has become available in the TREC genomics track that has allowed us to test this approach by comparing it with theresults of the bm25 formula developed by Robertson and colleagues. We find a small but statistically significant advantage for our probabilistic approach.

Project Start
Project End
Budget Start
Budget End
Support Year
18
Fiscal Year
2009
Total Cost
$128,999
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Yeganova, Lana; Kim, Won; Kim, Sun et al. (2014) Retro: concept-based clustering of biomedical topical sets. Bioinformatics 30:3240-8
Wilbur, W John; Kim, Won (2009) The Ineffectiveness of Within - Document Term Frequency in Text Classification. Inf Retr Boston 12:509-525
Lu, Zhiyong; Kim, Won; Wilbur, W John (2009) Evaluating relevance ranking strategies for MEDLINE retrieval. J Am Med Inform Assoc 16:32-6
Sohn, Sunghwan; Kim, Won; Comeau, Donald C et al. (2008) Optimal training sets for Bayesian prediction of MeSH assignment. J Am Med Inform Assoc 15:546-53