1) I have been a co-organizer of the BioCreative Workshops since 2005 and have taken part in BioCreative II (2007), BioCreative III (2010), and the BioCreative-2012 Workshop (2012). The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. Our contribution for BioCreative III was to organize the Gene Normalization (GN) task. This included selecting documents to annotate, overseeing the annotation process, evaluating participants submissions, and writing up a full description of the task and presenting results at the conference. We introduced two innovations in the task. First, we used the Tap-k measure developed by John Spouge and his group and we introduced an EM algorithm to estimate the correct answers based on all participant entries for the task. Tap-k is a performance measure which can best be characterized as a truncated mean average precision with a penalty term for retrieving useless records below the last useful hit. The EM algorithm allowed us to evaluate peoples predictions over a much larger set of full text documents than we could provide gold standard human judgments for (507). We also provided a set of 50 full text documents for which we had human annotations. When the results of the EM algorithm evaluation were compared with the gold standard results the ranks were quite close and allowed us to conclude that the automatic method of assessment was successful in singling out the top performing systems. We also entered the triage task for detecting papers suitable for curation of protein-protein interactions. For this task we used the priority model to identify gene/protein names and used parsing to prepare dependency relations between proteins and other text elements and these relationships as well as text words were used as features. Machine learning was then applied to this representation and we turned in the best performance on the task. We recently helped organize the BioCreative-2012 Workshop associated with the Biocuration 2012 Conference and also participated in Task I which involved producing a triage system for the CTD database. Our approach was based on the approach used for the GN task of BioCreative III, but we took a different approach to identify genes, proteins, and diseases based on a semantic classifier and we also added features based on an LDA analysis of the CTD database. Our approach was effective in obtaining the best triage results on the task. 2) We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these task and participated in the second. In the second task we used the priority model to locate protein mentions and it proved very successful and competitive with other approaches. 3) We are currently working to develop more general methods of finding high value articles for PPI based on their abstracts. This effort involves not only more powerful ranking methods, but also ways to display evidence to the user for a users quick evaluation.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:
Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042
Kim, Sun; Kwon, Dongseop; Shin, Soo-Yong et al. (2012) PIE the search: searching PubMed literature for protein interaction information. Bioinformatics 28:597-8
Krallinger, Martin; Vazquez, Miguel; Leitner, Florian et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12 Suppl 8:S3
Kim, Sun; Wilbur, W John (2011) Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 12 Suppl 8:S9
Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1

Showing the most recent 10 out of 14 publications