1) I have been a co-organizer of the BioCreative Workshops since 2005 and have taken part in BioCreative II (2007), BioCreative III (2010), BioCreative-2012 Workshop (2012), and BioCreative IV (2013) and my group is taking part in BioCreative V (2015) which has not yet taken place. The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. 2) We are currently working to develop more general methods of finding high value articles for PPI based on their abstracts. This effort involves not only more powerful ranking methods, but also ways to display evidence to the user for a users quick evaluation. 3) We are also investigating an approach to named entity recognition for a large number of biologically important entity types. We have found certain general patterns that can be used to find genes and other entity types with a higher reliability than can be done with a general CRF. This is ongoing research with a promise for more useful general patterns. 4) We have begun a project called BioC which is an effort to create a general XML format defined by a DTD and software to read and write this format. Currently this approach has been implemented in C++, Java, Python, Pearl, Ruby, and GO. The idea is to use this common currency to make software modules that are useful for natural language processing more interoperable. The project is in its early stages, but already we have software to read and write in the languages mentioned as well as significant NLP processing modules using this approach and over 25 gold standard NLP annotated data sets available in the format. The approach was featured in the BioCreative IV Workshop and the approach has formed the basis of the BioC Collaborative Track at BioCreative V which will take place in a short time. This track has received contributions from eight teams besides our own and has built a user interface which displays annotated articles to Biogrid curators to assist them in their work.

Project Start
Project End
Budget Start
Budget End
Support Year
15
Fiscal Year
2015
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57
Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:
Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:
Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:
Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056
Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042
Kim, Sun; Kwon, Dongseop; Shin, Soo-Yong et al. (2012) PIE the search: searching PubMed literature for protein interaction information. Bioinformatics 28:597-8
Kim, Sun; Wilbur, W John (2011) Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 12 Suppl 8:S9
Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1
Lu, Zhiyong; Kao, Hung-Yu; Wei, Chih-Hsuan et al. (2011) The gene normalization task in BioCreative III. BMC Bioinformatics 12 Suppl 8:S2

Showing the most recent 10 out of 14 publications