The Center for Intelligent Information Retrieval (CIIR) is investigating the impact of statistically derived semantic word relationships on information retrieval. Exploiting these relationships, for example, by identifying when different words express the same content can lead to more effective rankings of retrieval results. Semantic relationships are not labeled explicitly in text and are too varied to be identified solely by hand. The CIIR is mining massive corpora for direct and indirect word co-occurrence data using both offline and retrieval-time computation. The particular focus is on techniques that create and use Web-based corpora of "comparable" sentences and text chunks for estimating word and phrase translation probabilities, and on techniques that derive relationships from "context vectors" that represent word and phrase meanings. The quality of the word relationships that are discovered is being tested using large-scale retrieval experiments. In addition, the CIIR is addressing computational barriers to large-scale data mining by moving its new distributed computational framework, TupleFlow, to Hadoop. That framework was developed for the type of indexing and analysis operations that are required for large-scale studies of relational structure in text. TupleFlow is an extension of MapReduce, with advantages in flexibility, scalability, disk abstraction, and low abstraction penalties. This work is expected to have broad impact by improving the quality of search results.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0844226
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2009-02-01
Budget End
2012-01-31
Support Year
Fiscal Year
2008
Total Cost
$450,000
Indirect Cost
Name
University of Massachusetts Amherst
Department
Type
DUNS #
City
Amherst
State
MA
Country
United States
Zip Code
01003