Free Text Gene Name Recognition

Wilbur, Willy

Abstract

Currently we are pursuing two projects designed to make progress on the problem of gene/protein name recognition:? ? 1) We have produced a set of 20,000 sentences with all occurrences of gene/protein names in them marked up with the character offset for name beginning and name ending in the sentence. The sentences were taken as random samples from restricted classes of MEDLINE abstracts. Half were chosen as likely to have gene/protein names in them and half were selected as unlikely to have such names. Since there is ambiguity in marking names, alternative markings are listed as correct answers where this is thought to be appropriate. Three fourths of these names formed the basis for a task in the BioCreAtIvE1 (Critical Assessment of Information Extraction in Biology) Workshop held in Granada, Spain in 2004. Twelve teams attempted to designed systems that could correctly tag the gene/protein names in the sentences. Several teams obtained precisions and recalls in the low 80% range. A number of different approaches were successful and these results suggest ways in which gene/protein name tagging. The 20,000 sentences forming the basis of this work have been re-edited and a number of errors corrected. The 15,000 sentences which formed the basis of BioCreAtIvE1 and currently being used for the training phase of BioCreAtIvE2 and the last 5,000 sentences which have never been released will form the testing material for the BioCreAtIvE2 which is planned for early 2007. ? ? 2) We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about five million name strings distributed over those categories. We have experimented with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories. In order to improve performance we have developed a new model term a Priority Model for name recognition. This model allows us to categorize names as gene/protein names with an F-score of 0.96 and better then what we were able to achieve with either a language model of a probabilistic context free grammar. We are currently using this to create features for using in a conditional random fields approach to gene/protein name recognition and are achieving about an 0.83 F-score.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Intramural Research (Z01)
Project #: 1Z01LM000093-07
Application #: 7594470
Study Section

Project Start
Project End
Budget Start
Budget End
Support Year: 7
Fiscal Year: 2007
Total Cost: $194,193
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country: United States
Zip Code

Related projects


NIH 2008 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine	$224,159
NIH 2007 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine	$194,193
NIH 2006 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2005 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2004 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2003 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2002 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2001 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine

Publications

Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9

Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne et al. (2006) SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc :754-8

Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H et al. (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3

Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1

Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107

Tanabe, Lorraine; Wilbur, W John (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 1:611-26

Rzhetsky, Andrey; Iossifov, Ivan; Koike, Tomohiro et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53

Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84

Yu, Hong; Hatzivassiloglou, Vasileios; Friedman, Carol et al. (2002) Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp :919-23

Tanabe, Lorraine; Wilbur, W John (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18:1124-32

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: