We have begun the development of a system to recognize gene or protein names in natural language text. The system currently consists of two modules. One is a Bayes text classifier that we have trained on over 130k documents that contain known gene names. These documents are compared with the remainder of the text in PubMed and the difference is learned using the naive Bayes classifier. The second module is the Brill tagger that we have modified to run on text with a biological orientation. We have taken an additional step to teach the tagger to tag gene names consisting of a single word with a GENE tag. Several hundred additional rules have been learned in this regard. Several processing steps are applied as filters after the tagger to identify gene names which are multi-term, etc. We are currently evaluating the performance of this system in recognizing gene names in a test set of text. The plan is to continue work on this system and to incorporate new approaches into the basic system to improve it further

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000093-01
Application #
6546821
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
2001
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9
Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne et al. (2006) SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc :754-8
Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H et al. (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3
Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107
Tanabe, Lorraine; Wilbur, W John (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 1:611-26
Rzhetsky, Andrey; Iossifov, Ivan; Koike, Tomohiro et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53
Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1
Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84
Yu, Hong; Hatzivassiloglou, Vasileios; Friedman, Carol et al. (2002) Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp :919-23
Tanabe, Lorraine; Wilbur, W John (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18:1124-32