Current work on the project is focusing on developing an improved Bayesian classification model and developing new approaches to active learning with a Bayesian model. 1) We have developed term based active learning methods which provide a different approach to active learning and have shown that they are in many cases more effective then simple uncertainty sampling or error reduction sampling. 2) The PubMed database presents a unique challenge because of its very large size of over 19 million records. Because of this size few machine learning methods can be applied with a reasonable turn-around time. One method that can be applied efficiently is Naive Bayes, but it performs poorly when the different classes to be distinguished exhibit a marked size discrepancy. But such an imbalance is common for the problems one wishes to study in PubMed. In such a situation we have discovered that a training set much smaller than the whole set can be selected by an active learning inspired method. The result yields an almost 200% improvement in the performance of Naive Bayes in classifying documents for MeSH term assignment. The results are significantly better than a KNN method and there is the added advantage that the optimal training sets defined in this way can be used as the training sets for more sophisticated machine learning methods with even better results than those obtained from Naive Bayes. 3) We compute the documents related to a document using a probability calculation based on two Poisson distributions, one for the terms in a document that are more central to the documents content and one for the terms that are more peripheral. These are combined into a probability estimate of the importance of a term in a document based on its relative frequency in the document. This probability estimate is combined with the global IDF weight of a term to account for that terms importance in computing the similarity between two documents. We have known from the time this approach was developed that it worked well. In the last several years data has become available in the TREC genomics track that has allowed us to test this approach by comparing it with theresults of the bm25 formula developed by Robertson and colleagues. We find a small but statistically significant advantage for our probabilistic approach. 4) We are currently working on a problem which arises when several different kinds of documents appear in a dataset and one wants to compute neighboring documents for each document. In this situation it is possible that scores will be higher within groups than between groups so that scores do not give a true picture of relatedness. This can happen because within a group terminology may be common that is used but rarely outside the group. We have developed a Bayesian method to test for this kind of terminology and remove it. The removal process requires some care which we exercise by testing terms to be removed by tests to see how specific they are to biology and how useful they are to the users of PubMed. These test together have allowed us to successfully remove terminology within document groups and improve scoring so that it can be used for neighboring successfully.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Yeganova, Lana; Kim, Won; Kim, Sun et al. (2014) Retro: concept-based clustering of biomedical topical sets. Bioinformatics 30:3240-8
Wilbur, W John; Kim, Won (2009) The Ineffectiveness of Within - Document Term Frequency in Text Classification. Inf Retr Boston 12:509-525
Lu, Zhiyong; Kim, Won; Wilbur, W John (2009) Evaluating relevance ranking strategies for MEDLINE retrieval. J Am Med Inform Assoc 16:32-6
Sohn, Sunghwan; Kim, Won; Comeau, Donald C et al. (2008) Optimal training sets for Bayesian prediction of MeSH assignment. J Am Med Inform Assoc 15:546-53