Current work on the project is focusing on developing an improved Bayesian classification model and developing new approaches to active learning with a Bayesian model. 1) We have developed term based active learning methods which provide a different approach to active learning and have shown that they are in many cases more effective then simple uncertainty sampling or error reduction sampling. 2) The PubMed database presents a unique challenge because of its very large size of over 19 million records. Because of this size few machine learning methods can be applied with a reasonable turn-around time. One method that can be applied efficiently is Naive Bayes, but it performs poorly when the different classes to be distinguished exhibit a marked size discrepancy. But such an imbalance is common for the problems one wishes to study in PubMed. In such a situation we have discovered that a training set much smaller than the whole set can be selected by an active learning inspired method. The result yields an almost 200% improvement in the performance of Naive Bayes in classifying documents for MeSH term assignment. The results are significantly better than a KNN method and there is the added advantage that the optimal training sets defined in this way can be used as the training sets for more sophisticated machine learning methods with even better results than those obtained from Naive Bayes. 3) We compute the documents related to a document using a probability calculation based on two Poisson distributions, one for the terms in a document that are more central to the documents content and one for the terms that are more peripheral. These are combined into a probability estimate of the importance of a term in a document based on its relative frequency in the document. This probability estimate is combined with the global IDF weight of a term to account for that terms importance in computing the similarity between two documents. We have known from the time this approach was developed that it worked well. In the last several years data has become available in the TREC genomics track that has allowed us to test this approach by comparing it with theresults of the bm25 formula developed by Robertson and colleagues. We find a small but statistically significant advantage for our probabilistic approach. 4) We are currently working on a problem which arises when several different kinds of documents appear in a dataset and one wants to compute neighboring documents for each document. A simple application of the same approach used to find related citations in PubMed does not produce good results. Analysis of the problem shows that there are many records with words in them that are not keyed to the actual focus of the record and that these words mislead the neighboring process. In some cases this is due to a common author of records who users certain word forms frequently in their writing even on very different subjects. In other cases the problem seems to appear when two different drugs have sections on side effects that are quite generic and have a large overlap, etc. In order to deal with this problem we tried several approaches to filter out useless words. First, we compared the frequency of words in the English Gigaword Corpus with their frequency in PubMed documents in an effort to remove non-biomedical terminology. In the same way we also compared the words in a large set of PubMed user queries to see what terminology may be important to users. By filtering out words that are neither very medical nor very important to users we saw some benefit in computing neighbors. A second approach involved Bayesian calculations to determine which words seemed to be specific to a particular source of PubMed Health documents. This also allowed us to filter terms coming from boiler plate specific to how documents were created in different sources. But the above approaches required a certain amount of hand supervision to avoid mistakes in filtering words. We have since found improved results with a completely automatic approach which examines how related each word in the body of a record is to words in the records title. This is achieved by removing all words related below a certain low threshold.

Project Start
Project End
Budget Start
Budget End
Support Year
22
Fiscal Year
2013
Total Cost
$82,185
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Yeganova, Lana; Kim, Won; Kim, Sun et al. (2014) Retro: concept-based clustering of biomedical topical sets. Bioinformatics 30:3240-8
Wilbur, W John; Kim, Won (2009) The Ineffectiveness of Within - Document Term Frequency in Text Classification. Inf Retr Boston 12:509-525
Lu, Zhiyong; Kim, Won; Wilbur, W John (2009) Evaluating relevance ranking strategies for MEDLINE retrieval. J Am Med Inform Assoc 16:32-6
Sohn, Sunghwan; Kim, Won; Comeau, Donald C et al. (2008) Optimal training sets for Bayesian prediction of MeSH assignment. J Am Med Inform Assoc 15:546-53