Free Text Gene Name Recognition

Wilbur, Willy

Abstract

1) I have been a co-organizer of the BioCreative Workshops since 2005 and have taken part in BioCreative II (2007), BioCreative III (2010), BioCreative-2012 Workshop (2012), and BioCreative IV (2013). The overall goal of the BioCreative Workshops is to promote the development of text mining and text processing tools which are useful to the communities of researchers and database curators in the biological sciences. We recently helped organize the BioCreative-2012 Workshop associated with the Biocuration 2012 Conference and also participated in Task I which involved producing a triage system for the CTD database. Our approach was based on the approach used for the GN task of BioCreative III, but we took a different approach to identify genes, proteins, and diseases based on a semantic classifier and we also added features based on an LDA analysis of the CTD database. Our approach was effective in obtaining the best triage results on the task. 2) We recently co-chaired the BioCreative III Workshop in which the main competitive tasks were to find gene mentions in a full text article and map them to their GenBank identifiers and score them as to reliability, to classify PubMed records as likely to represent articles containing information on protein-protein interactions, and to find the text in full papers that describes the method used by an experimenter to experimentally verify a protein-protein interaction. We organized the first of these tasks. Our contribution on the first task included selecting documents to annotate, overseeing the annotation process, evaluating participants submissions, and writing up a full description of the task and presenting results at the conference. We also entered the triage task for detecting papers suitable for curation of protein-protein interactions. For this task we used the priority model to identify gene/protein names and used parsing to prepare dependency relations between proteins and other text elements and these relationships as well as text words were used as features. Machine learning was then applied to this representation and we turned in the best performance on the task. 3) We are currently working to develop more general methods of finding high value articles for PPI based on their abstracts. This effort involves not only more powerful ranking methods, but also ways to display evidence to the user for a users quick evaluation. 4) We are also investigating an approach to named entity recognition for a large number of biologically important entity types. 5) We have begun a project called BioC which is an effort to create a general XML format defined by a DTD and software to read and write this format. Currently this approach has been implemented in C++, Java, Python, Pearl, Ruby, and GO. The idea is to use this common currency to make software modules that are useful for natural language processing more interoperable. The project is in its early stages, but already we have software to read and write in the languages mentioned as well as significant NLP processing modules using this approach and over 25 gold standard NLP annotated data sets available in the format. The approach was featured in the BioCreative IV Workshop and a proposal to feature it again in the upcoming BioCreative V Workshop has been developed.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Investigator-Initiated Intramural Research Projects (ZIA)
Project #: 1ZIALM000093-14
Application #: 8943225
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 14
Fiscal Year: 2014
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country
Zip Code

Related projects


NIH 2015 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine
NIH 2014 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine
NIH 2013 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$369,833
NIH 2012 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$195,229
NIH 2011 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$179,884
NIH 2010 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$195,870
NIH 2009 ZIA LM	Free Text Gene Name Recognition Wilbur, Willy / National Library of Medicine	$221,141

Publications

Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57

Comeau, Donald C; Batista-Navarro, Riza Theresa; Dai, Hong-Jie et al. (2014) BioC interoperability track overview. Database (Oxford) 2014:

Islamaj Do?an, Rezarta; Comeau, Donald C; Yeganova, Lana et al. (2014) Finding abbreviations in biomedical literature: three BioC-compatible modules and four BioC-formatted corpora. Database (Oxford) 2014:

Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:

Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056

Kim, Sun; Kim, Won; Wei, Chih-Hsuan et al. (2012) Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information. Database (Oxford) 2012:bas042

Kim, Sun; Kwon, Dongseop; Shin, Soo-Yong et al. (2012) PIE the search: searching PubMed literature for protein interaction information. Bioinformatics 28:597-8

Krallinger, Martin; Vazquez, Miguel; Leitner, Florian et al. (2011) The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 12 Suppl 8:S3

Kim, Sun; Wilbur, W John (2011) Classifying protein-protein interaction articles using word and syntactic features. BMC Bioinformatics 12 Suppl 8:S9

Arighi, Cecilia N; Lu, Zhiyong; Krallinger, Martin et al. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics 12 Suppl 8:S1

Showing the most recent 10 out of 14 publications

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: