1) I have been a co-organizer of the BioCreative Workshops since 2005 and have taken part in BioCreative II (2007), BioCreative III (2010), and the BioCreative-2012 Workshop (2012). The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. We recently helped organize the BioCreative-2012 Workshop associated with the Biocuration 2012 Conference and also participated in Task I which involved producing a triage system for the CTD database. Our approach was based on the approach used for the GN task of BioCreative III, but we took a different approach to identify genes, proteins, and diseases based on a semantic classifier and we also added features based on an LDA analysis of the CTD database. Our approach was effective in obtaining the best triage results on the task. 2) We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these tasks. Our contribution on the first task included selecting documents to annotate, overseeing the annotation process, evaluating participants submissions, and writing up a full description of the task and presenting results at the conference. We also entered the triage task for detecting papers suitable for curation of protein-protein interactions. For this task we used the priority model to identify gene/protein names and used parsing to prepare dependency relations between proteins and other text elements and these relationships as well as text words were used as features. Machine learning was then applied to this representation and we turned in the best performance on the task. 3) We are currently working to develop more general methods of finding high value articles for PPI based on their abstracts. This effort involves not only more powerful ranking methods, but also ways to display evidence to the user for a users quick evaluation. 4) We are also investigating an approach to named entity recognition for a large number of biologically important entity types. 5) We have begun a project called BioC which is an effort to create a general XML format defined by a DTD and software to read and write this format in C++, Java, Python, and possibly other languages. The idea is to use this common currency to make software modules that are useful for natural language processing more interoperable. The project is in its beginning stage, but already we have the DTD defined and the software to read and write in the languages mentioned as well as significant NLP processing modules using this approach. The approaches is being featured in the BioCreative IV Workshop.
|Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:|
|Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:|
|Islamaj Dogan, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:|
|Smith, Larry H; Wilbur, W John (2009) The value of parsing as feature generation for gene mention recognition. J Biomed Inform 42:895-904|
|Rzhetsky, Andrey; Shatkay, Hagit; Wilbur, W John (2009) How to get the most out of your curation effort. PLoS Comput Biol 5:e1000391|
|Smith, Larry; Tanabe, Lorraine K; Ando, Rie Johnson nee et al. (2008) Overview of BioCreative II gene mention recognition. Genome Biol 9 Suppl 2:S2|