Traditional models of information retrieval assume documents are independently relevant. However, users often want to see the retrieved documents cover several subtopics, or different possible interpretations, or nuggets of relevant information for a query. In these cases, modeling documents as independently relevant does not provide the optimal user experience. This research project attempts to remedy this with new models of document interdependence and new evaluation measures. There are three threads running through this work: (1) the models of diversity, novelty, and redundancy that will be needed to implement ranking algorithms; (2) measurements of diversity, novelty, and redundancy in a ranking of documents; and (3) optimizing model structures and parameters to the measures.

The main challenges addressed in this project include: how to model relevance along with interdependencies among search results for different tasks and domains; how to predict the degree of diversity; and how to evaluate the new models. The new models include set-based and pattern-based retrieval models, and the new measures include set-based labeling, list-based preference etc. The models will be evaluated with the existing measures as well as the new measures, and the new measures will be evaluated with in-depth statistical analysis. The result of this project includes a suite of models and measures that combine novelty and diversity with relevance for different domains, including biomedical search, legal search and Web search.

The far-reaching potential impact of this project is an improvement in search engine utility across the board. The project addresses specific applications to biomedical search, legal or patent search, and web search reflect major domains in which search engines are important and in which improved models for novelty and diversity would directly improve the user experience. Project results, including publications and evaluation datasets, will be disseminated via the project website (http://ir.cis.udel.edu/monads).

Project Report

This project was about building automatic search engine that could find novel results (that is, results that tell you something you didn't know before) and diverse results (that is, results that could be useful to the widest space of user needs). In order to do that, we focused on four questions: How do we know that novelty and diversity is actually important to users of search engines? How can a computer differentiate between different aspects of information that a document conveys? In other words, how can we identify different subtopics of a user query? How can a computer use those identified subtopics to provide the best possible search results that include both novel and diverse information? How do we know whether the computer has done a good job or not? Intellectual merit: Intellectual outcomes of this project related to the four questions above are: We performed experiments with with users to show that, when given a choice, they will consistently prefer to see documents that provide relevant AND novel information over documents that just provide relevant information. We also showed by simulation that providing a more diverse set of results benefits is more likely to benefit more users. We developed methods to automatically discover subtopics based on a user query from the document collections as well as knowledge bases. The identified query subtopics are accurate and effective to help users identify useful documents covering a wide range of diversified and relevant information. We proposed a general framework for developing multiple effective diversification methods that can be used to generate satisfying search results for a query with diversified information needs. These methods have been shown to perform well in international competitions. We defined ways to evaluate search systems based on our experiments described in #1 that reward systems for providing more novel and more diverse results, based specifically on how much the system's own users indicate they prefer novelty and diversity. Broader impacts: This project provided financial support for 6 different graduate students. Two of them completed PhDs on work towards this project: Wei Zheng (now a software engineer at Google) and Praveen Chandar (currently a postdoctoral researcher at Columbia University). The project also produced 20 publications and datasets available for download to anyone. Finally, work from the project has been performed in collaboration with industry partners and has been used in industry work on search.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1017026
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-07-31
Support Year
Fiscal Year
2010
Total Cost
$495,968
Indirect Cost
Name
University of Delaware
Department
Type
DUNS #
City
Newark
State
DE
Country
United States
Zip Code
19716