We have begun the development of a system to recognize gene or protein names in natural language text. The system currently consists of two modules. One is a Bayes text classifier that we have trained on over 500k documents that contain known gene names. These documents are compared with the remainder of the text in PubMed and the difference is learned using the naive Bayes classifier. The second module is the Brill tagger that we have modified to run on text with a biological orientation. We have taken an additional step to teach the tagger to tag gene names consisting of a single word with a GENE tag. Several hundred additional rules have been learned in this regard. Several processing steps are applied as filters after the tagger to identify gene names which are multi-term, etc. We are currently evaluating the performance of this system in recognizing gene names in a test set of text. The plan is to continue work on this system and to incorporate new approaches into the basic system to improve it further.
Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9 |
Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne et al. (2006) SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc :754-8 |
Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H et al. (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3 |
Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1 |
Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107 |
Tanabe, Lorraine; Wilbur, W John (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 1:611-26 |
Rzhetsky, Andrey; Iossifov, Ivan; Koike, Tomohiro et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53 |
Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84 |
Yu, Hong; Hatzivassiloglou, Vasileios; Friedman, Carol et al. (2002) Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp :919-23 |
Tanabe, Lorraine; Wilbur, W John (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18:1124-32 |