Current work on the project is focusing on developing an improved Bayesian classification model and developing new approaches to active learning with a Bayesian model. 1) We have developed term based active learning methods which provide a different approach to active learning and have shown that they are in many cases more effective then simple uncertainty sampling or error reduction sampling. 2) The PubMed database presents a unique challenge because of its very large size of over 25 million records. Because of this size few machine learning methods can be applied with a reasonable turn-around time. One method that can be applied efficiently is Naive Bayes, but it performs poorly when the different classes to be distinguished exhibit a marked size discrepancy. But such an imbalance is common for the problems one wishes to study in PubMed. In such a situation we have discovered that a training set much smaller than the whole set can be selected by an active learning inspired method. The result yields an almost 200% improvement in the performance of Naive Bayes in classifying documents for MeSH term assignment. The results are significantly better than a KNN method and there is the added advantage that the optimal training sets defined in this way can be used as the training sets for more sophisticated machine learning methods with even better results than those obtained from Naive Bayes. 3) We compute the documents related to a document using a probability calculation based on two Poisson distributions, one for the terms in a document that are more central to the documents content and one for the terms that are more peripheral. These are combined into a probability estimate of the importance of a term in a document based on its relative frequency in the document. This probability estimate is combined with the global IDF weight of a term to account for that terms importance in computing the similarity between two documents. We have known from the time this approach was developed that it worked well. In the last several years data has become available in the TREC genomics track that has allowed us to test this approach by comparing it with the results of the bm25 formula developed by Robertson and colleagues. We find a small but statistically significant advantage for our probabilistic approach. 4) We are currently working on a problem which arises when several different kinds of documents appear in a dataset and one wants to compute neighboring documents for each document. A simple application of the same approach used to find related citations in PubMed does not produce good results. Analysis of the problem shows that there are many records with words in them that are not keyed to the actual focus of the record and that these words mislead the neighboring process. In some cases this is due to a common author of records who users certain word forms frequently in their writing even on very different subjects. In other cases the problem seems to appear when two different drugs have sections on side effects that are quite generic and have a large overlap, etc. We have found our best results with a completely automatic approach which examines how related each word in the body of a record is to words in the records title. This is achieved by removing all words related below a certain low threshold. 5) Some of our latest work uses concepts that appear in multiple article titles to produce document clusters. These are then analyzed using naive Bayesian classification methods to ascertain their significance. Those that are significant are extended using the same Bayesian technique. The result is a set of concepts each represented by a document cluster. This proves to be an effective way to produce significant clusters of relatively small data sets that are difficult to cluster by more standard methods. 6) We have implemented a distributional semantics approach modeled somewhat after the work of Lin and Pantel and have found this useful in finding synonyms for terms. However the method does not produce a quality that can be effectively used for most purposes without human review. We believe the model could be improved if p-values could be computed in addition to scores and are working on an approach to assign such values.
Yeganova, Lana; Kim, Won; Kim, Sun et al. (2014) Retro: concept-based clustering of biomedical topical sets. Bioinformatics 30:3240-8 |
Wilbur, W John; Kim, Won (2009) The Ineffectiveness of Within - Document Term Frequency in Text Classification. Inf Retr Boston 12:509-525 |
Lu, Zhiyong; Kim, Won; Wilbur, W John (2009) Evaluating relevance ranking strategies for MEDLINE retrieval. J Am Med Inform Assoc 16:32-6 |
Sohn, Sunghwan; Kim, Won; Comeau, Donald C et al. (2008) Optimal training sets for Bayesian prediction of MeSH assignment. J Am Med Inform Assoc 15:546-53 |