A system of C++ language programs has been developed for the purpose of finding the closely related documents in Medline and for the purpose of performing machine learning onsets of documents. The system has a number of unique features: 1) It is based on a number of C++ classes and highly modular so that alterations in the system are relatively simple to perform. 2) The system currently operates PubMed data by extracting from the Sybase repositories using a C++ interface to Sybase. However, a change in the interface portion of the system would allow it to be applied to any large database consisting of discrete textual records. 3) All data processed by the system is stored in permanent form as inverted file structures, etc. These structures are updatable so that new data may be continually added to the system as it becomes available. 4) Documents are compared with each other using a Bayesian form of analysis and the statistics on which the relevance weighting of terms is based are derived from previous document comparisons. These statistics are updated with each new cycle of processing. The latest work on this system has involved a study of the optimal form for the retrieval algorithm. - C++ language, database, documents, relevance weighting of terms

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000022-08
Application #
6290480
Study Section
Special Emphasis Panel (CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
8
Fiscal Year
1999
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Wilbur, W John; Kim, Won (2003) The dimensions of indexing. AMIA Annu Symp Proc :714-0