General and Semi-supervised Machine Learning Applied to Bioinformatics

Wilbur, Willy

Abstract

1) Many different methods have been investigated for the purpose of clustering sets of documents with the hope of improving retrieval. Unfortunately these have generally failed to provide improved retrieval capability. Part of the problem is clearly the fact that a given document often involves more than one subject so that it is not possible to make a clean categorization of the documents into definite categories to the exclusion of others. In order to overcome this problem we have developed methods that are designed to identify a theme among a set of documents. The theme need not encompass the whole of any document. It only needs to exist in some subset of the documents in order to be identifiable. Some of these same documents may participate in the definition of several themes. One method of finding themes is based on the EM algorithm and requires an iterative procedure which converges to themes. The method has been implemented and tested and found to be successful. 2) A second approach can be based on the singular value decomposition and essentially is a vector approach. 3) We are also investigating other methods to extract higher level features. One method we are currently studying is to perform machine learning with an SVM or other classifier and score the documents based on this learning. Then PAV can be applied to the resulting scores and this score function can be descretized without the loss of significant information. This allows us to make use of the results as features which can be individually weighted in another classifier. 4) We have developed a new algorithm called the periodic random orbiter algorithm (PROBE) which is applicable to minimize any convex loss function. We have applied it to the MeSH classification problem and it seems to work very well and better than the alternatives on such a large problem. 5) Stochastic Gradient Descent (SGD) has gained popularity for solving large scale supervised machine learning problems. It provides a rapid method for minimizing a number of loss functions and is applicable to Support Vector Machine (SVM) and Logistic optimizations. However SGD does not provide a convenient stopping criterion. Generally an optimal number of iterations over the data may be determined using held out data. We have compared stopping predictions based on held out data with simply stopping at a fixed number of iterations and found that the latter works as well as the former for a number of commonly studied text classification problems. In particular fixed stopping works well for MeSH predictions on PubMed records. We also surveyed the published algorithms for SVM learning on large data sets, and chose three for comparison: PROBE, SVMperf, and Liblinear and compared them with SGD with a fixed number of iterations. We find SGD with a fixed number of iterations performs as well as these alternative methods and is much faster to compute. As an application we have made SGD-SVM predictions for all MeSH terms and used the Pool Adjacent Violators (PAV) algorithm to convert these predictions to probabilities. Such probabilistic predictions lead to ranked MeSH term predictions superior to previously published results on two test sets 6) We are also investigating methods to create features for machine learning using dependency parses and syntactic parse trees.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000089-17
Application #: 9160914
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 17
Fiscal Year: 2015
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2015 ZIA LM	General and Semi-supervised Machine Learning Applied to Bioinformatics Wilbur, Willy / National Library of Medicine
NIH 2014 ZIA LM	General and Semi-supervised Machine Learning Applied to Bioinformatics Wilbur, Willy / National Library of Medicine
NIH 2013 ZIA LM	General and Semi-supervised Machine Learning Applied to Bioinformatics Wilbur, Willy / National Library of Medicine	$575,296
NIH 2012 ZIA LM	General and Semi-supervised Machine Learning Applied to Bioinformatics Wilbur, Willy / National Library of Medicine	$563,995
NIH 2011 ZIA LM	General and Semi-supervised Machine Learning Applied to Bioinformatics Wilbur, Willy / National Library of Medicine	$599,613
NIH 2010 ZIA LM	General and Semi-supervised Machine Learning Applied to Bioinformatics Wilbur, Willy / National Library of Medicine	$470,088
NIH 2009 ZIA LM	General and Semi-supervised Machine Learning Applied to Bioinformatics Wilbur, Willy / National Library of Medicine	$221,141

Publications

Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57

Kim, Sun; Liu, Haibin; Yeganova, Lana et al. (2015) Extracting drug-drug interactions from literature using a rich feature-based linear kernel approach. J Biomed Inform 55:23-30

Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:

Wilbur, W John; Kim, Won (2014) Stochastic Gradient Descent and the Prediction of MeSH for PubMed Records. AMIA Annu Symp Proc 2014:1198-207

Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056

Wilbur, W John; Smith, Larry (2013) A Study of the Morpho-Semantic Relationship in Medline. Open Inf Syst J 6:1-12

Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database (Oxford) 2012:bas026

Kim, Sun; Wilbur, W John (2012) Thematic clustering of text documents using an EM-based approach. J Biomed Semantics 3 Suppl 3:S6

Wilbur, W John; Kim, Won (2011) Improving a gold standard: treating human relevance judgments of MEDLINE document pairs. BMC Bioinformatics 12 Suppl 3:S5

Kim, Won; Wilbur, W John (2011) Improving a Gold Standard: Treating Human Relevance Judgments of MEDLINE Document Pairs. Proc Int Conf Mach Learn Appl 2010:491-498

Showing the most recent 10 out of 14 publications

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: