1) I have been a co-organizer of the BioCreative Workshops since 2005 and have taken part in BioCreative II (2007), BioCreative III (2010), and the BioCreative-2012 Workshop (2012). The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. We recently helped organize the BioCreative-2012 Workshop associated with the Biocuration 2012 Conference and also participated in Task I which involved producing a triage system for the CTD database. Our approach was based on the approach used for the GN task of BioCreative III, but we took a different approach to identify genes, proteins, and diseases based on a semantic classifier and we also added features based on an LDA analysis of the CTD database. Our approach was effective in obtaining the best triage results on the task. 2) We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these tasks. Our contribution on the first task included selecting documents to annotate, overseeing the annotation process, evaluating participants submissions, and writing up a full description of the task and presenting results at the conference. We also entered the triage task for detecting papers suitable for curation of protein-protein interactions. For this task we used the priority model to identify gene/protein names and used parsing to prepare dependency relations between proteins and other text elements and these relationships as well as text words were used as features. Machine learning was then applied to this representation and we turned in the best performance on the task. 3) We are currently working to develop more general methods of finding high value articles for PPI based on their abstracts. This effort involves not only more powerful ranking methods, but also ways to display evidence to the user for a users quick evaluation. 4) We are also investigating an approach to named entity recognition for a large number of biologically important entity types. 5) We have begun a project called BioC which is an effort to create a general XML format defined by a DTD and software to read and write this format in C++, Java, Python, and possibly other languages. The idea is to use this common currency to make software modules that are useful for natural language processing more interoperable. The project is in its beginning stage, but already we have the DTD defined and the software to read and write in the languages mentioned as well as significant NLP processing modules using this approach. The approaches is being featured in the BioCreative IV Workshop.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
National Library of Medicine
Zip Code
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:
Islamaj Dogan, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042
Kim, Sun; Kwon, Dongseop; Shin, Soo-Yong et al. (2012) PIE the search: searching PubMed literature for protein interaction information. Bioinformatics 28:597-8
Krallinger, Martin; Vazquez, Miguel; Leitner, Florian et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12 Suppl 8:S3
Kim, Sun; Wilbur, W John (2011) Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 12 Suppl 8:S9
Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1

Showing the most recent 10 out of 14 publications