1) Many different methods have been investigated for the purpose of clustering sets of documents with the hope of improving retrieval. Unfortunately these have generally failed to provide improved retrieval capability. Part of the problem is clearly the fact that a given document often involves more than one subject so that it is not possible to make a clean categorization of the documents into definite categories to the exclusion of others. In order to overcome this problem we have developed methods that are designed to identify a theme among a set of documents. The theme need not encompass the whole of any document. It only needs to exist in some subset of the documents in order to be identifiable. Some of these same documents may participate in the definition of several themes. One method of finding themes is based on the EM algorithm and requires an iterative procedure which converges to themes. The method has been implemented and tested and found to be successful. 2) A second approach can be based on the singular value decomposition and essentially is a vector approach. 3) We are also investigating other methods to extract higher level features. One method we are currently studying is to perform machine learning with an SVM or other classifier and score the documents based on this learning. Then PAV can be applied to the resulting scores and this score function can be descretized without the loss of significant information. This allows us to make use of the results as features which can be individually weighted in another classifier. 4) We have developed a new algorithm called the periodic random orbiter algorithm (PROBE) which is applicable to minimize any convex loss function. We have applied it to the MeSH classification problem and it seems to work very well and better than the alternatives on such a large problem. 5) We are currently studying ways to apply SGD to large training sets to achieve better efficiency than can be obtained by more conventional methods.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Shatkay, Hagit; Pan, Fengxia; Rzhetsky, Andrey et al. (2008) Multi-dimensional classification of biomedical text: toward automated, practical provision of high-utility text to diverse users. Bioinformatics 24:2086-93