One of the problems that is important for semantic processing of natural language text is named entity recognition. This problem seems to be inherently more difficult in the biological realm than it proved to be in the realm of business applications or news story analysis as in the MUC conferences. Our interest in the issue stems from its potential importance in indexing and retrieval of information dealing with a particular gene or protein. However really high quality named entity recognition in biology would have many applications as a starting point for semantic analysis. In past work on this problem we developed a tagger for gene/protein name recognition in text called ABGene and subsequently a database of 20,000 sentences annotated for the occurrence of gene/protein names. The first 15,000 of these sentences formed the basis of the gene/protein mention recognition task for the BioCreative I Workshop held in 2004. Subsequent to the BioCreative I Workshop the whole 20,000 sentence corpus was revised by 1) Removing tokenization and instead providing the text of the original sentence;2) changing the annotations to be character based instead of token based;3) revising the annotation guidelines to deal with some of the problems which had become apparent in the Workshop;4) correcting some erroneous annotations that had come to our attention. The resulting data has become known as the GENETAG corpus. It has at least one unique property. Many of the annotated entities have alternative annotations associated with them so that more than one answer is correct for a particular entity. We believe this is important as many entities can be annotated in more than one way and for quite a number there is no clear single correct answer. In 2005 we were invited to be co-organizers of BioCreative II and to be responsible for the gene mention recognition task. For this purpose we gave out the first 15,000 sentences of GENETAG as practice and training data and the last 5,000 sentences were used for testing. Whereas 14 teams participated in BioCreative I, 21 teams participated in BioCreative II. The top F score obtained on the gene/protein mention task in BioCreative I was 83.2% while the top score in BioCreative II was 87.2%. Because there were some changes in the annotation guidelines and some corrections to the data, one cannot say definitively how much progress this represents, but it does suggest progress. Conditional random fields were much more commonly used in BioCreative II and new approaches to the use of unannotated data also appeared. We performed an analysis of the annotations provided by all the participants and applied a conditional random fields approach to learn how to combine all predictions to make an improved prediction. In this we used 200 fold cross validation. We were able to achieve a balanced F score of 90.7%. This indicates that there is yet room for improvement in how individual systems perform on the problem of gene/protein mention detection. (with Larry Smith and Lorrie Tanabe). We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about four million name strings distributed over those categories. We have experimented with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories. However, the best approach we have found for distinguishing the categories of gene/protein and not gene/protein is a new algorithm we term a priority model. Every token associated with any name in SEMCAT has associated with it two probabilities. The first probability is the probability that the token indicates that it is part of a gene/protein name and the second probability is an indicator of how reliable the token is as an indicator. With this model, given a phrase, one can compute an estimate of the probability that the phase is a gene/protein name. We find that with the priority model we can achieve an F score of 96% as compared with 95% for our best PCFG approach. (with Lorrie Tanabe). The top performance for gene mention recognition in BioCreative II was by Rie Ando from IBM who introduced a technique called alternating structural optimization. This approach takes many labeling problems similar to named entity tagging, but simply tries to predict the occurrence of the names or the tokens from the surrounding textual context. When the SVM solution weight vectors for these many auxiliary problems have been learned, one performs a singular value decomposition and subtracts from each vector its first h components in the decomposition. This subtraction is only used to decrease the penalty in the regularization term of the cost function. The weight vectors are then relearned and the process is repeated. This is continued until convergence. The final result is a set of h components of the decomposition of the many weight vectors. One uses these components to enhance the learning on the actual named entity recognition task. This is a bit complicated and difficult to use. We are studying how we may be able to use a similar approach, but with a simpler method of applying the auxiliary learning to improve named entity recognition. One problem is how to combine such auxiliary learning with the SEMCAT data. We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these task and participated in the second. In the second task we used the priority model to locate protein mentions and it proved very successful and competitive with other approaches.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:
Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042
Kim, Sun; Kwon, Dongseop; Shin, Soo-Yong et al. (2012) PIE the search: searching PubMed literature for protein interaction information. Bioinformatics 28:597-8
Krallinger, Martin; Vazquez, Miguel; Leitner, Florian et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12 Suppl 8:S3
Kim, Sun; Wilbur, W John (2011) Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 12 Suppl 8:S9
Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1

Showing the most recent 10 out of 14 publications