Many different methods have been investigated for the purpose of clustering sets of documents with the hope of improving retrieval. Unfortunately these have generally failed to provide improved retrieval capability. Part of the problem is clearly the fact that a given document often involves more than one subject so that it is not possible to make a clean categorization of the documents into definite categories to the exclusion of others. In order to overcome this problem we have developed methods that are designed to identify a theme among a set of documents. The theme need not encompass the whole of any document. It only needs to exist in some subset of the documents in order to be identifiable. Some of these same documents may participate in the definition of several themes. The method of finding themes is based on the EM algorithm and requires an iterative procedure which converges to themes. The method has been implemented and tested and found to be successful. Recently the method has been improved and we have applied it to detect themes in a database of 52k documents on the subject of AIDS and over 200k documents dealing with the subject of lymphocytes. Recently we completed applying it to all of MEDLINE to define 107,863 themes. These are now being tested to see how they may be used to enhance retrieval in MEDLINE and possibly other databases.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000089-06
Application #
6988463
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
6
Fiscal Year
2004
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Kim, Won; Wilbur, W John (2005) A strategy for assigning new concepts in the MEDLINE database. AMIA Annu Symp Proc :395-9
Shatkay, H; Edwards, S; Wilbur, W J et al. (2000) Genes, themes and microarrays: using information retrieval for large-scale gene analysis. Proc Int Conf Intell Syst Mol Biol 8:317-28