A system of software has been developed for the purpose of finding the closely related documents in Medline. The system consists of over thirty-five programs written in the C language plus a number of utility programs. The system has a number of unique features: 1) It is highly modular so that alterations in the system are relatively simple to perform. 2) The system currently operates on Medline data in the ASN1 format but a change in the interface portion of the system would allow it to be applied to any large database consisting of discrete textual records. 3) The system is designed with a degree of security against loss of data due to operating system crashes or power outages. 4) All data processed by the system is stored in permanent form as inverted file structures, etc. These structures are updateable so that new data may be continually added to the system as it becomes available. 5)Documents are compared with each other using a Bayesian form of analysis and the statistics on which the relevance weighting of terms is based are derived from previous document comparisons. These statistics are updated with each new cycle of processing. 6) The probability that documents are related is computed by the system based on a scaling of the raw scores produced using a set of document pairs that have been judged for relatedness by human judges. This scale is recalculated each time term weights are updated and it is calculated differently for documents with as opposed to documents without abstracts. An analysis of the most glaring failures of the system, as identified on the test set used for scaling of document similarity, has been carried out. This shows that a significant part of the problems experienced may be due to the common occurrence of several areas with different levels of description and different levels of uniformity in description in a single document. The numerical representation of these descriptive areas within a document does not in general correlate with their importance in defining document content.
Wilbur, W John; Kim, Won (2003) The dimensions of indexing. AMIA Annu Symp Proc :714-0 |