Free Text Gene Name Recognition

Wilbur, Willy

Abstract

Currently we are pursuing two projects designed to make progress on the problem of gene/protein name recognition: 1) We have produced a set of 20,000 sentences with all occurrences of gene/protein names in them marked up with the word offset for name beginning and name ending in the sentence. The sentences were taken as random samples from restricted classes of MEDLINE abstracts. Half were chosen as likely to have gene/protein names in them and half were selected as unlikely to have such names. Since there is ambiguity in marking names, alternative markings are listed as correct answers where this is thought to be appropriate. Three fourths of these names formed the basis for a task in the recent BioCreAtIvE (Critical Assessment of Information Extraction in Biology) Workshop held in Granada, Spain this year. Twelve teams attempted to designed systems that could correctly tag the gene/protein names in the sentences. Several teams obtained precisions and recalls in the low 80% range. A number of different approaches were successful and these results suggest ways in which we can improve ABGene. 2) We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about four million name strings distributed over those categories. We are experimenting with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Intramural Research (Z01)
Project #: 1Z01LM000093-04
Application #: 6988466
Study Section: (CBB)

Project Start
Project End
Budget Start
Budget End
Support Year: 4
Fiscal Year: 2004
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country: United States
Zip Code

Related projects


NIH 2008 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine	$224,159
NIH 2007 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine	$194,193
NIH 2006 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2005 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2004 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2003 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2002 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2001 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine

Publications

Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9

Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne et al. (2006) SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc :754-8

Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H et al. (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3

Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1

Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107

Tanabe, Lorraine; Wilbur, W John (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 1:611-26

Rzhetsky, Andrey; Iossifov, Ivan; Koike, Tomohiro et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53

Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84

Yu, Hong; Hatzivassiloglou, Vasileios; Friedman, Carol et al. (2002) Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp :919-23

Tanabe, Lorraine; Wilbur, W John (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18:1124-32

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: