Bayesian models of document retrieval have a long theoretical history in the subject but have only recently proved practical. Our current online retrieval system is partially Bayesian and we have developed a fully Bayesian model based on cluster concepts which incorporates the document length and local term frequency while allowing the model to be completely Bayesian. This performs at the same basic level as the partially Bayesian model in which local weights are treated ad hoc. It does however allow one to see the actual log odds predictions of relevance. These exceed the observed log odds of relevance by 13.1 which gives an interesting perspective on term dependency. A new model based on the Bayesian approach has been developed which has interesting connections with the vector models of G. Salton. Theoretical details have been worked out. Documents must be indexed by the """"""""real"""""""" objects that they refer to and these real objects become nodes in a system of multiple hierarchies called a specificity network. Each hierarchy is produced by a specificity operator and results in a tree of objects starting at the root with the most general and moving to greater specificity as one progresses towards the leaves. The objects which populate nodes are represented by textual terms or phrases. There may be many representatives of any single object. Programs are to be written to create and store these structures and eventually the stored data will be used to make the process of indexing semiautomatic. Two documents are to be rated as to their similarity depending on the relatedness of the real objects that they reference. The approach is to be tested using the humanly judged material that has been produced for the purpose of probability scaling of online retrieval system raw scores.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000021-04
Application #
5203619
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
4
Fiscal Year
1995
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Wilbur, W John; Kim, Won (2009) The Ineffectiveness of Within - Document Term Frequency in Text Classification. Inf Retr Boston 12:509-525
Lu, Zhiyong; Kim, Won; Wilbur, W John (2009) Evaluating relevance ranking strategies for MEDLINE retrieval. J Am Med Inform Assoc 16:32-6
Lin, Jimmy; Wilbur, W John (2007) PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics 8:423
Wilbur, W John; Kim, Won; Xie, Natalie (2006) SPELLING CORRECTION IN THE PUBMED SEARCH ENGINE. Inf Retr Boston 9:543-564
Kim, W; Wilbur, W J (2001) Amino acid residue environments and predictions of residue type. Comput Chem 25:411-22
Aronson, A R; Bodenreider, O; Chang, H F et al. (2000) The NLM Indexing Initiative. Proc AMIA Symp :17-21
Wilbur, W J (2000) Boosting nai ve Bayesian learning on a large subset of MEDLINE. Proc AMIA Symp :918-22
Wilbur, W J; Neuwald, A F (2000) A theory of information with special application to search problems. Comput Chem 24:33-42
Wilbur, W J; Hazard Jr, G F; Divita, G et al. (1999) Analysis of biomedical text for chemical names: a comparison of three methods. Proc AMIA Symp :176-80