1) We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about four million name strings distributed over those categories. We have experimented with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories. However, the best approach we have found for distinguishing the categories of gene/protein and not gene/protein is a new algorithm we term a priority model. Every token associated with any name in SEMCAT has associated with it two probabilities. The first probability is the probability that the token indicates that it is part of a gene/protein name and the second probability is an indicator of how reliable the token is as an indicator. With this model, given a phrase, one can compute an estimate of the probability that the phase is a gene/protein name. We find that with the priority model we can achieve an F score of 96% as compared with 95% for our best PCFG approach. (with Lorrie Tanabe). The top performance for gene mention recognition in BioCreative II was by Rie Ando from IBM who introduced a technique called alternating structural optimization. This approach takes many labeling problems similar to named entity tagging, but simply tries to predict the occurrence of the names or the tokens from the surrounding textual context. When the SVM solution weight vectors for these many auxiliary problems have been learned, one performs a singular value decomposition and subtracts from each vector its first h components in the decomposition. This subtraction is only used to decrease the penalty in the regularization term of the cost function. The weight vectors are then relearned and the process is repeated. This is continued until convergence. The final result is a set of h components of the decomposition of the many weight vectors. One uses these components to enhance the learning on the actual named entity recognition task. This is a bit complicated and difficult to use. We are studying how we may be able to use a similar approach, but with a simpler method of applying the auxiliary learning to improve named entity recognition. One problem is how to combine such auxiliary learning with the SEMCAT data. We are currently working to improve this model by finding a way to apply it to more than two classes at a time. 2)We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these task and participated in the second. In the second task we used the priority model to locate protein mentions and it proved very successful and competitive with other approaches. 3) We are currently working to develop more general methods of finding high value articles for PPI based on their abstracts. This effort involves not only more powerful ranking methods, but also ways to display evidence to the user for a users quick evaluation.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:
Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042
Kim, Sun; Kwon, Dongseop; Shin, Soo-Yong et al. (2012) PIE the search: searching PubMed literature for protein interaction information. Bioinformatics 28:597-8
Krallinger, Martin; Vazquez, Miguel; Leitner, Florian et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12 Suppl 8:S3
Kim, Sun; Wilbur, W John (2011) Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 12 Suppl 8:S9
Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1

Showing the most recent 10 out of 14 publications