One of the problems that is important for semantic processing of natural language text is named entity recognition. This problem seems to be inherently more difficult in the biological realm than it proved to be in the realm of business applications or news story analysis as in the MUC conferences. Our interest in the issue stems from its potential importance in indexing and retrieval of information dealing with a particular gene or protein. However really high quality named entity recognition in biology would have many applications as a starting point for semantic analysis. In past work on this problem we developed a tagger for gene/protein name recognition in text called ABGene and subsequently a database of 20,000 sentences annotated for the occurrence of gene/protein names. The first 15,000 of these sentences formed the basis of the gene/protein mention recognition task for the BioCreative I Workshop held in 2004. Subsequent to the BioCreative I Workshop the whole 20,000 sentence corpus was revised by 1) Removing tokenization and instead providing the text of the original sentence; 2) changing the annotations to be character based instead of token based; 3) revising the annotation guidelines to deal with some of the problems which had become apparent in the Workshop; 4) correcting some erroneous annotations that had come to our attention. The resulting data has become known as the GENETAG corpus. It has at least one unique property. Many of the annotated entities have alternative annotations associated with them so that more than one answer is correct for a particular entity. We believe this is important as many entities can be annotated in more than one way and for quite a number there is no clear single correct answer. ? In 2005 we were invited to be co-organizers of BioCreative II and to be responsible for the gene mention recognition task. For this purpose we gave out the first 15,000 sentences of GENETAG as practice and training data and the last 5,000 sentences were used for testing. Whereas 14 teams participated in BioCreative I, 21 teams participated in BioCreative II. The top F score obtained on the gene/protein mention task in BioCreative I was 83.2% while the top score in BioCreative II was 87.2%. Because there were some changes in the annotation guidelines and some corrections to the data, one cannot say definitively how much progress this represents, but it does suggest progress. Conditional random fields were much more commonly used in BioCreative II and new approaches to the use of unannotated data also appeared. We performed an analysis of the annotations provided by all the participants and applied a conditional random fields approach to learn how to combine all predictions to make an improved prediction. In this we used 200 fold cross validation. We were able to achieve a balanced F score of 90.7%. This indicates that there is yet room for improvement in how individual systems perform on the problem of gene/protein mention detection. (with Larry Smith and Lorrie Tanabe).? We have become convinced that more information about the different types of entities that can occur in sentences in MEDLINE can be used to improve name recognition. This has led us to design a set of semantic categories and to attempt to fill these categories with actual names that can be harvested from databases and from web sites. We call the result SEMCAT. It currently recognizes seventy-five categories and contains about four million name strings distributed over those categories. We have experimented with probabilistic context free grammars and Markov models of text strings in an attempt to learn how to recognize the entities in different categories. However, the best approach we have found for distinguishing the categories of gene/protein and not gene/protein is a new algorithm we term a priority model. Every token associated with any name in SEMCAT has associated with it two probabilities. The first probability is the probability that the token indicates that it is part of a gene/protein name and the second probability is an indicator of how reliable the token is as an indicator. With this model, given a phrase, one can compute an estimate of the probability that the phase is a gene/protein name. We find that with the priority model we can achieve an F score of 96% as compared with 95% for our best PCFG approach. (with Lorrie Tanabe).? The top performance for gene mention recognition in BioCreative II was by Rie Ando from IBM who introduced a technique called alternating structural optimization. This approach takes many labeling problems similar to named entity tagging, but simply tries to predict the occurrence of the names or the tokens from the surrounding textual context. When the SVM solution weight vectors for these many auxiliary problems have been learned, one performs a singular value decomposition and subtracts from each vector its first h components in the decomposition. This subtraction is only used to decrease the penalty in the regularization term of the cost function. The weight vectors are then relearned and the process is repeated. This is continued until convergence. The final result is a set of h components of the decomposition of the many weight vectors. One uses these components to enhance the learning on the actual named entity recognition task. This is a bit complicated and difficult to use. We are studying how we may be able to use a similar approach, but with a simpler method of applying the auxiliary learning to improve named entity recognition. One problem is how to combine such auxiliary learning with the SEMCAT data.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000093-08
Application #
7735078
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
8
Fiscal Year
2008
Total Cost
$224,159
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9
Tanabe, Lorraine; Thom, Lynne H; Matten, Wayne et al. (2006) SemCat: semantically categorized entities for genomics. AMIA Annu Symp Proc :754-8
Tanabe, Lorraine; Xie, Natalie; Thom, Lynne H et al. (2005) GENETAG: a tagged corpus for gene/protein named entity recognition. BMC Bioinformatics 6 Suppl 1:S3
Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107
Tanabe, Lorraine; Wilbur, W John (2004) Generation of a large gene/protein lexicon by morphological pattern analysis. J Bioinform Comput Biol 1:611-26
Rzhetsky, Andrey; Iossifov, Ivan; Koike, Tomohiro et al. (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform 37:43-53
Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1
Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84
Yu, Hong; Hatzivassiloglou, Vasileios; Friedman, Carol et al. (2002) Automatic extraction of gene and protein synonyms from MEDLINE and journal articles. Proc AMIA Symp :919-23
Tanabe, Lorraine; Wilbur, W John (2002) Tagging gene and protein names in biomedical text. Bioinformatics 18:1124-32