Currently we are pursuing two projects designed to make progress on the problem of gene/protein name recognition:? ? 1) We have produced a set of 20,000 sentences with all occurrences of gene/protein names in them marked up with the character offset for name beginning and name ending in the sentence. The sentences were taken as random samples from restricted classes of MEDLINE abstracts. Half were chosen as likely to have gene/protein names in them and half were selected as unlikely to have such names. Since there is ambiguity in marking names, alternative markings are listed as correct answers where this is thought to be appropriate. Three fourths of these names formed the basis for a task in the BioCreAtIvE1 (Critical Assessment of Information Extraction in Biology) Workshop held in Granada, Spain in 2004. Twelve teams attempted to designed systems that could correctly tag the gene/protein names in the sentences. Several teams obtained precisions and recalls in the low 80% range. A number of different approaches were successful and these results suggest ways in which gene/protein name tagging. The 20,000 sentences forming the basis of this work have been re-edited and a number of errors corrected. The 15,000 sentences which formed the basis of BioCreAtIvE1 and currently being used for the training phase of BioCreAtIvE2 and the last 5,000 sentences which have never been released will form the testing material for the BioCreAtIvE2 which is planned for early 2007. ? ? 2) We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about five million name strings distributed over those categories. We have experimented with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories. In order to improve performance we have developed a new model term a Priority Model for name recognition. This model allows us to categorize names as gene/protein names with an F-score of 0.96 and better then what we were able to achieve with either a language model of a probabilistic context free grammar. We are currently using this to create features for using in a conditional random fields approach to gene/protein name recognition and are achieving about an 0.83 F-score.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000093-06
Application #
7316268
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
6
Fiscal Year
2006
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9
Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne et al. (2006) SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc :754-8
Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H et al. (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3
Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107
Tanabe, Lorraine; Wilbur, W John (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 1:611-26
Rzhetsky, Andrey; Iossifov, Ivan; Koike, Tomohiro et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53
Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1
Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84
Yu, Hong; Hatzivassiloglou, Vasileios; Friedman, Carol et al. (2002) Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp :919-23
Tanabe, Lorraine; Wilbur, W John (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18:1124-32