A system of C language programs has been developed for the purpose of finding the closely related documents in Medline. The system has a number of unique features: 1) It is highly modular so that alterations in the system are relatively simple to perform. 2) The system currently operates on Medline data in the ASN1 format but a change in the interface portion of the system would allow it to be applied to any large database consisting of discrete textual records. 3) The system is designed with a degree of security against loss of data due to operating system crashes or power outages. 4) All data processed by the system is stored in permanent form as inverted file structures, etc. These structures are updatable so that new data may be continually added to the system as it becomes available. 5) Documents are compared with each other using a Bayesian form of analysis and the statistics on which the relevance weighting of terms is based are derived from previous document comparisons. These statistics are updated with each new cycle of processing. The batch system described is now being used as the source for an online retrieval system that will allow one to type in or import text and search the database with such a query using Bayesian retrieval. This is coupled with the neighboring of the batch system and the boolean capabilities of the Entrez retrieval system. The plan is to have a versatile general access facility for that part of medline relevant to molecular biology. Work is ongoing on this system.
Wilbur, W John; Kim, Won (2003) The dimensions of indexing. AMIA Annu Symp Proc :714-0 |