The ability to locate important phrases in natural language text is useful for the purposes of indexing or placing hyperlinks in text. In either case one seeks to improve access to the textual material. In the past the most common method used for the location of phrases has been a part of speech tagger. We have developed a new approach that uses a number of scoring algorithms to rank phrases as to how useful they may be. Eight different methods have been developed and tested. They have proved effective in ranking known phrases from the Unified Medical Language System developed by the National Library of Medicine high among all the phrases obtained from subsets of the Medline document collection. Six of the methods have been combined to produce optimal scoring methods and have proven useful in extracting material of quality similar to that already in the UMLS. They also appear promising as a way to mark text with hyperlinks for navigation purposes. Two papers are being published on this topic and the methods are being applied to the electronic text book project at NCBI.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000090-02
Application #
6432766
Study Section
(CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
2
Fiscal Year
2000
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code
Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9
Wilbur, W John; Kim, Won; Xie, Natalie (2006) SPELLING CORRECTION IN THE PUBMED SEARCH ENGINE. Inf Retr Boston 9:543-564
Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit (2006) New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 7:356
Kim, Won; Wilbur, W John (2005) A strategy for assigning new concepts in the MEDLINE database. AMIA Annu Symp Proc :395-9
Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1
Smith, L; Wilbur, W J (2004) Retrieving definitional content for ontology development. Comput Biol Chem 28:387-91
Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107
Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84
Kim, W; Wilbur, W J (2000) Corpus-based statistical screening for phrase identification. J Am Med Inform Assoc 7:499-511
Aronson, A R; Bodenreider, O; Chang, H F et al. (2000) The NLM Indexing Initiative. Proc AMIA Symp :17-21

Showing the most recent 10 out of 11 publications