Currently we are pursuing two projects designed to make progress on the problem of gene/protein name recognition: 1) We have produced a set of 20,000 sentences with all occurrences of gene/protein names in them marked up with the word offset for name beginning and name ending in the sentence. The sentences were taken as random samples from restricted classes of MEDLINE abstracts. Half were chosen as likely to have gene/protein names in them and half were selected as unlikely to have such names. Since there is ambiguity in marking names, alternative markings are listed as correct answers where this is thought to be appropriate. Three fourths of these names formed the basis for a task in the recent BioCreAtIvE (Critical Assessment of Information Extraction in Biology) Workshop held in Granada, Spain this year. Twelve teams attempted to designed systems that could correctly tag the gene/protein names in the sentences. Several teams obtained precisions and recalls in the low 80% range. A number of different approaches were successful and these results suggest ways in which we can improve ABGene. 2) We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about four million name strings distributed over those categories. We are experimenting with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000093-05
Application #
7148042
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
5
Fiscal Year
2005
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9
Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne et al. (2006) SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc :754-8
Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H et al. (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3
Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1
Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107
Tanabe, Lorraine; Wilbur, W John (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 1:611-26
Rzhetsky, Andrey; Iossifov, Ivan; Koike, Tomohiro et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53
Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84
Yu, Hong; Hatzivassiloglou, Vasileios; Friedman, Carol et al. (2002) Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp :919-23
Tanabe, Lorraine; Wilbur, W John (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18:1124-32