Free Text Gene Name Recognition

Wilbur, Willy

Abstract

We have begun the development of a system to recognize gene or protein names in natural language text. The system currently consists of two modules. One is a Bayes text classifier that we have trained on over 130k documents that contain known gene names. These documents are compared with the remainder of the text in PubMed and the difference is learned using the naive Bayes classifier. The second module is the Brill tagger that we have modified to run on text with a biological orientation. We have taken an additional step to teach the tagger to tag gene names consisting of a single word with a GENE tag. Several hundred additional rules have been learned in this regard. Several processing steps are applied as filters after the tagger to identify gene names which are multi-term, etc. We are currently evaluating the performance of this system in recognizing gene names in a test set of text. The plan is to continue work on this system and to incorporate new approaches into the basic system to improve it further

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Intramural Research (Z01)
Project #: 1Z01LM000093-01
Application #: 6546821
Study Section: (CBB)

Project Start
Project End
Budget Start
Budget End
Support Year: 1
Fiscal Year: 2001
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country: United States
Zip Code

Related projects


NIH 2008 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine	$224,159
NIH 2007 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine	$194,193
NIH 2006 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2005 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2004 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2003 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2002 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine
NIH 2001 Z01 LM	Free Text Gene Name Recognition Wilbur, Willy John / National Library of Medicine

Publications

Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9

Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne et al. (2006) SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc :754-8

Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H et al. (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3

Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107

Tanabe, Lorraine; Wilbur, W John (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 1:611-26

Rzhetsky, Andrey; Iossifov, Ivan; Koike, Tomohiro et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53

Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1

Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84

Yu, Hong; Hatzivassiloglou, Vasileios; Friedman, Carol et al. (2002) Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp :919-23

Tanabe, Lorraine; Wilbur, W John (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18:1124-32

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: