An automatic approach to applying Bayesian methods in text retrieval has been developed. This is a form of relevance weighting of search terms but departs from the usual approach in two ways which complement each other. First, the usual approach involves the assignment of relevance weights to the search terms in a single query based on the documents that are and those that are not relevant to the query. This involves generally a small number of relevant documents and hence a statistical sample that is difficult to use in making any globally significant inferences about the value of the terms involved. We modify the usual approach by taking the average of the importance of a term over all the queries in which it occurs. We study the case when the set of queries is the set of documents so that the global term relevance weight is a well defined concept. Second, the usual approach is limited to the case when one has human judgments of the relevance of documents to queries. This has limited the use of the method to certain test sets where the relevance relation is known or to relevance feedback situations. Our approach is to replace the relevance relation by the relation of high scoring pairs of query and document using the vector cosine method of retrieval. Because the latter is an automatic method we are able to generate the required statistics in an automatic manner. While this latter approach will undoubtedly have more error than human relevance judgments the larger sample size involved in global weighting helps to offset this problem. Local weighting is introduced in an ad hoc manner and the resultant retrieval is found to be somewhat superior to vector cosine retrieval. There are two problems with the model just described. First it does not incorporate local term weighting in a natural Bayesian manner and second it does not provide a correction for document length. We have developed a new model based on cluster concepts that remedies these two problems while allowing the model to remain completely Bayesian. This performs at the same basic level as the one already described in which local weights are treated ad hoc. It does however allow one to see the actual log odds predictions of relevance. These exceed the observed log odds of relevance by 13:1 which gives an interesting perspective on term dependency.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000021-03
Application #
3759303
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
3
Fiscal Year
1994
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Wilbur, W John; Kim, Won (2009) The Ineffectiveness of Within - Document Term Frequency in Text Classification. Inf Retr Boston 12:509-525
Lu, Zhiyong; Kim, Won; Wilbur, W John (2009) Evaluating relevance ranking strategies for MEDLINE retrieval. J Am Med Inform Assoc 16:32-6
Lin, Jimmy; Wilbur, W John (2007) PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics 8:423
Wilbur, W John; Kim, Won; Xie, Natalie (2006) SPELLING CORRECTION IN THE PUBMED SEARCH ENGINE. Inf Retr Boston 9:543-564
Kim, W; Wilbur, W J (2001) Amino acid residue environments and predictions of residue type. Comput Chem 25:411-22
Aronson, A R; Bodenreider, O; Chang, H F et al. (2000) The NLM Indexing Initiative. Proc AMIA Symp :17-21
Wilbur, W J (2000) Boosting nai ve Bayesian learning on a large subset of MEDLINE. Proc AMIA Symp :918-22
Wilbur, W J; Neuwald, A F (2000) A theory of information with special application to search problems. Comput Chem 24:33-42
Wilbur, W J; Hazard Jr, G F; Divita, G et al. (1999) Analysis of biomedical text for chemical names: a comparison of three methods. Proc AMIA Symp :176-80