Helping users find relevant information is undeniably an important problem vital to the functioning of today's information-based societies. It is therefore no surprise that millions of people worldwide make use of search engine technologies each and every day. Although existing search technologies work well, there is still considerable room for improvement. Search engine innovation is driven by the ability to rapidly, and repeatedly, measure the quality of the results produced by a given system. This type of measurement typically requires some form of human input. For example, a human expert may be hired to assess the relevance of search results, or the search engine may log user interactions, such as the queries entered and the results clicked. After a sufficiently large amount of data has been collected, it can then be used to accurately measure search engine quality. It can also be used to improve the quality of existing search engines via a process known as "tuning" or "training". However, gathering large amounts of this information typically requires a significant amount of human effort or computational resources. Therefore, sustained innovation is only possible at a very steep cost.

Techniques for constructing large information retrieval test collections that require no human effort are the primary focus of this research study. Rather than relying on human-curated information, implicit relevance signals from the Web are mined to automatically construct large, reusable test collections for a variety of search tasks, including Web search, news search, and enterprise search. The observation that the Web contains a large number of implicit relevance signals is the starting point of the research. The simplest example of an implicit relevance signal is the hyperlink, which can be interpreted as a signal acknowledging the relevance of the target page by the source author. The hypothesis that such implicit relevance signals can be effectively mined and aggregated in a completely unsupervised manner to create test collections without any human effort is investigated in this research. Automatically generated test collections are evaluated in two different ways. First, the test collections are evaluated according to their ability to accurately measure the quality of search systems compared to human-generated test collections. Second, the quality of search engines tuned using the automated test collections are compared against engines tuned using manual test collections.

The broader impact of this project is derived from automatically constructed test collections that are freely distributed to the broader research community. Advances in search engine technologies are expected as the result of increased availability of training data to systematically evaluate and tune search engines, both in industrial and academic settings. Additional broader impact is expected from the integration of research and education at both the graduate and undergraduate levels and from engaging women and underrepresented students through various outreach programs.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1147810
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2011-09-01
Budget End
2013-02-28
Support Year
Fiscal Year
2011
Total Cost
$150,000
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089