Statistical Phrase Extraction Techniques In Natural Lang

Wilbur, Willy

Abstract

The ability to locate important phrases in natural language text is useful for the purposes of indexing or placing hyperlinks in text. In either case one seeks to improve access to the textual material. In the past the most common method used for the location of phrases has been a part of speech tagger. We have developed a new approach that uses scoring algorithms to rank phrases as to how useful they may be. A number of different methods have been developed and tested. These are being combined with methods of stemming and of finding inflectional variants of phrases that are synonymous for retrieval purposes. The UMLS system is also being used to find synonymous phrases for indexing. These methods are being applied to find useful phrases in NCBI's electronic textbook project that is currently online but still under development. The methods are also beginning to be applied to the PubMedCentral database of journal articles in biology and medicine and to the indexing of OCR material from the scanning of back issues of journals for this database. Current work is focused on adding a sophisticated abbreviation detection capability to this system.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Library of Medicine (NLM)
Type: Intramural Research (Z01)
Project #: 1Z01LM000090-08
Application #: 7316265
Study Section: (CBB)

Project Start
Project End
Budget Start
Budget End
Support Year: 8
Fiscal Year: 2006
Total Cost
Indirect Cost

Institution

Name: National Library of Medicine
Department
Type
DUNS #

City
State
Country: United States
Zip Code

Related projects


NIH 2008 Z01 LM	Natural Language Processing Techniques To Enhance Information Access. Wilbur, Willy John / National Library of Medicine	$224,159
NIH 2007 Z01 LM	Statistical Phrase Extraction Techniques In Natural Language Databases. Wilbur, Willy John / National Library of Medicine	$229,501
NIH 2006 Z01 LM	Statistical Phrase Extraction Techniques In Natural Lang Wilbur, Willy John / National Library of Medicine
NIH 2005 Z01 LM	Statistical Phrase Extraction Techniques In Natural Lang Wilbur, Willy John / National Library of Medicine
NIH 2004 Z01 LM	Statistical Phrase Extraction Techniques In Natural Lang Wilbur, Willy John / National Library of Medicine
NIH 2003 Z01 LM	Statistical Phrase Extraction Techniques In Natural Lang Wilbur, Willy John / National Library of Medicine
NIH 2002 Z01 LM	Statistical Phrase Extraction Techniques In Natural Lang Wilbur, Willy John / National Library of Medicine
NIH 2001 Z01 LM	Statistical Phrase Extraction Techniques In Databases Wilbur, Willy John / National Library of Medicine
NIH 2000 Z01 LM	Statistical phrase extraction techniques in natural language databases. Wilbur, Willy John / National Library of Medicine
NIH 1999 Z01 LM	Statistical phrase extraction techniques in natural language databases. Wilbur, Willy John / National Library of Medicine

Publications

Yu, Hong; Kim, Won; Hatzivassiloglou, Vasileios et al. (2007) Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. J Biomed Inform 40:150-9

Wilbur, W John; Kim, Won; Xie, Natalie (2006) SPELLING CORRECTION IN THE PUBMED SEARCH ENGINE. Inf Retr Boston 9:543-564

Wilbur, W John; Rzhetsky, Andrey; Shatkay, Hagit (2006) New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinformatics 7:356

Kim, Won; Wilbur, W John (2005) A strategy for assigning new concepts in the MEDLINE database. AMIA Annu Symp Proc :395-9

Smith, L; Rindflesch, T; Wilbur, W J (2004) MedPost: a part-of-speech tagger for bioMedical text. Bioinformatics 20:2320-1

Smith, L; Wilbur, W J (2004) Retrieving definitional content for ontology development. Comput Biol Chem 28:387-91

Yeganova, L; Smith, L; Wilbur, W J (2004) Identification of related gene/protein names based on an HMM of name variations. Comput Biol Chem 28:97-107

Smith, L; Yeganova, L; Wilbur, W J (2003) Hidden Markov models and optimized sequence alignments. Comput Biol Chem 27:77-84

Kim, W; Wilbur, W J (2000) Corpus-based statistical screening for phrase identification. J Am Med Inform Assoc 7:499-511

Aronson, A R; Bodenreider, O; Chang, H F et al. (2000) The NLM Indexing Initiative. Proc AMIA Symp :17-21

Showing the most recent 10 out of 11 publications

Comments

Be the first to comment on Willy Wilbur's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: