Free Text Gene Name Recognition

Wilbur, Willy

Abstract

1) We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about four million name strings distributed over those categories. We have experimented with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories. However, the best approach we have found for distinguishing the categories of gene/protein and not gene/protein is a new algorithm we term a priority model. Every token associated with any name in SEMCAT has associated with it two probabilities. The first probability is the probability that the token indicates that it is part of a gene/protein name and the second probability is an indicator of how reliable the token is as an indicator. With this model, given a phrase, one can compute an estimate of the probability that the phase is a gene/protein name. We find that with the priority model we can achieve an F score of 96% as compared with 95% for our best PCFG approach. (with Lorrie Tanabe). The top performance for gene mention recognition in BioCreative II was by Rie Ando from IBM who introduced a technique called alternating structural optimization. This approach takes many labeling problems similar to named entity tagging, but simply tries to predict the occurrence of the names or the tokens from the surrounding textual context. When the SVM solution weight vectors for these many auxiliary problems have been learned, one performs a singular value decomposition and subtracts from each vector its first h components in the decomposition. This subtraction is only used to decrease the penalty in the regularization term of the cost function. The weight vectors are then relearned and the process is repeated. This is continued until convergence. The final result is a set of h components of the decomposition of the many weight vectors. One uses these components to enhance the learning on the actual named entity recognition task. This is a bit complicated and difficult to use. We are studying how we may be able to use a similar approach, but with a simpler method of applying the auxiliary learning to improve named entity recognition. One problem is how to combine such auxiliary learning with the SEMCAT data. We are currently working to improve this model by finding a way to apply it to more than two classes at a time. 2)We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these task and participated in the second. In the second task we used the priority model to locate protein mentions and it proved very successful and competitive with other approaches. 3) We are currently working to develop more general methods of finding high value articles for PPI based on their abstracts. This effort involves not only more powerful ranking methods, but also ways to display evidence to the user for a users quick evaluation.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000093-11
Application #: 8344950
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 11
Fiscal Year: 2011
Total Cost: $179,884
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2015 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine
NIH 2014 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine
NIH 2013 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$369,833
NIH 2012 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$195,229
NIH 2011 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$179,884
NIH 2010 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$195,870
NIH 2009 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$221,141

Publications

Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57

Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:

Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:

Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:

Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056

Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042

Kim, Sun; Kwon, Dongseop; Shin, Soo-Yong et al. (2012) PIE the search: searching PubMed literature for protein interaction information. Bioinformatics 28:597-8

Krallinger, Martin; Vazquez, Miguel; Leitner, Florian et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12 Suppl 8:S3

Kim, Sun; Wilbur, W John (2011) Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 12 Suppl 8:S9

Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1

Showing the most recent 10 out of 14 publications

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: