1) Many different methods have been investigated for the purpose of clustering sets of documents with the hope of improving retrieval. Unfortunately these have generally failed to provide improved retrieval capability. Part of the problem is clearly the fact that a given document often involves more than one subject so that it is not possible to make a clean categorization of the documents into definite categories to the exclusion of others. In order to overcome this problem we have developed methods that are designed to identify a theme among a set of documents. The theme need not encompass the whole of any document. It only needs to exist in some subset of the documents in order to be identifiable. Some of these same documents may participate in the definition of several themes. One method of finding themes is based on the EM algorithm and requires an iterative procedure which converges to themes. The method has been implemented and tested and found to be successful. 2) A second approach can be based on the singular value decomposition and essentially is a vector approach. 3) We are also investigating other methods to extract higher level features. One method of interest is the method known as sparse coding, which is the basis of self-taught learning.

Project Start
Project End
Budget Start
Budget End
Support Year
12
Fiscal Year
2010
Total Cost
$470,088
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57
Kim, Sun; Liu, Haibin; Yeganova, Lana et al. (2015) Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach. J Biomed Inform 55:23-30
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Wilbur, W John; Kim, Won (2014) Stochastic Gradient Descent and the Prediction of MeSH for PubMed Records. AMIA Annu Symp Proc 2014:1198-207
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056
Wilbur, W John; Smith, Larry (2013) A Study of the Morpho-Semantic Relationship in Medline. Open Inf Syst J 6:1-12
Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database (Oxford) 2012:bas026
Kim, Sun; Wilbur, W John (2012) Thematic clustering of text documents using an EM-based approach. J Biomed Semantics 3 Suppl 3:S6
Wilbur, W John; Kim, Won (2011) Improving a gold standard: treating human relevance judgments of MEDLINE document pairs. BMC Bioinformatics 12 Suppl 3:S5
Kim, Won; Wilbur, W John (2011) Improving a Gold Standard: Treating Human Relevance Judgments of MEDLINE Document Pairs. Proc Int Conf Mach Learn Appl 2010:491-498

Showing the most recent 10 out of 14 publications