1) I have been a co-organizer of the BioCreative Workshops since 2005 and have taken part in BioCreative II (2007), BioCreative III (2010), BioCreative-2012 Workshop (2012), and BioCreative IV (2013). The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. We recently helped organize the BioCreative-2012 Workshop associated with the Biocuration 2012 Conference and also participated in Task I which involved producing a triage system for the CTD database. Our approach was based on the approach used for the GN task of BioCreative III, but we took a different approach to identify genes, proteins, and diseases based on a semantic classifier and we also added features based on an LDA analysis of the CTD database. Our approach was effective in obtaining the best triage results on the task. 2) We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these tasks. Our contribution on the first task included selecting documents to annotate, overseeing the annotation process, evaluating participants submissions, and writing up a full description of the task and presenting results at the conference. We also entered the triage task for detecting papers suitable for curation of protein-protein interactions. For this task we used the priority model to identify gene/protein names and used parsing to prepare dependency relations between proteins and other text elements and these relationships as well as text words were used as features. Machine learning was then applied to this representation and we turned in the best performance on the task. 3) We are currently working to develop more general methods of finding high value articles for PPI based on their abstracts. This effort involves not only more powerful ranking methods, but also ways to display evidence to the user for a users quick evaluation. 4) We are also investigating an approach to named entity recognition for a large number of biologically important entity types. 5) We have begun a project called BioC which is an effort to create a general XML format defined by a DTD and software to read and write this format. Currently this approach has been implemented in C++, Java, Python, Pearl, Ruby, and GO. The idea is to use this common currency to make software modules that are useful for natural language processing more interoperable. The project is in its early stages, but already we have software to read and write in the languages mentioned as well as significant NLP processing modules using this approach and over 25 gold standard NLP annotated data sets available in the format. The approach was featured in the BioCreative IV Workshop and a proposal to feature it again in the upcoming BioCreative V Workshop has been developed.
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57 |
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014: |
Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014: |
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014: |
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056 |
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042 |
Kim, Sun; Kwon, Dongseop; Shin, Soo-Yong et al. (2012) PIE the search: searching PubMed literature for protein interaction information. Bioinformatics 28:597-8 |
Kim, Sun; Wilbur, W John (2011) Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 12 Suppl 8:S9 |
Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1 |
Lu, Zhiyong; Kao, Hung-Yu; Wei, Chih-Hsuan et al. (2011) The gene normalization task in BioCreative III. BMC Bioinformatics 12 Suppl 8:S2 |
Showing the most recent 10 out of 14 publications