Rapidly increasing storage media capabilities and spreading interconnectivity have heralded the arrival of the information age. Unfortunately, accessing online information remains an inexact science. While valuable information can be found, typically many irrelevant documents are also retrieved and many relevant ones are missed. Terminology mismatches between the user's query and document contents are one cause of retrieval failures. Expanding a user's query with related words can improve search performance, but the problem of identifying related words remains. This research uses corpus linguistics techniques to automatically discover word similarities directly from the contents of an untagged textual database and to incorporate that information in an information retrieval system. These similarities are calculated based on the contexts in which the words appear. Using these similarities, user queries are automatically expanded, resulting in conceptual retrieval rather than requiring exact word matches between queries and documents. The effects of using different algorithms to calculate the similarities and the effects of expanding different sets of query words is evaluated. In addition, the search performance of the retrieval engine serves as a task-based method for comparing the quality of word-word similarities calculated using different corpus linguistics techniques.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
9409263
Program Officer
C. Suzanne Iacono
Project Start
Project End
Budget Start
1994-08-15
Budget End
1998-08-31
Support Year
Fiscal Year
1994
Total Cost
$104,925
Indirect Cost
Name
University of Kansas
Department
Type
DUNS #
City
Lawrence
State
KS
Country
United States
Zip Code
66045