Text search is undeniably vital to today's information-based societies, helping users locate relevant information in web pages, journal articles, news stories, blogs, emails, tweets, and a myriad of other sources. Naturally, users desire results that are not only good but also fast. Learning to rank, the dominant approach to information retrieval (IR) today, focuses almost exclusively on effectiveness, often neglecting the runtime speed (i.e., efficiency) of the ranking functions. This project contributes to the emerging research area of learning to efficiently rank, which aims to let algorithm designers capture, model, and reason about tradeoffs between effectiveness and efficiency in a unified framework.

Specifically, this project explores a novel cascade model for retrieval, where ranking is broken into a finite number of distinct stages. Each stage considers successively richer and more complex features, but over successively smaller candidate document sets. The intuition is that although complex features are more time-consuming to compute, examining fewer documents offsets the additional overhead. In other words, the cascade model views retrieval as a multi-stage progressive refinement problem. Based on the survey of the current state-of-the-art, knowledge, this is the first project to explore this approach to the ranking problem, marking a substantial departure from previous "monolithic" ranking functions. Although exploration in this uncharted area carries some risk, this research promises to open up a new frontier in IR research.

This project aims to narrow the chasm between academic and industrial IR research by bringing together theoretical IR research and practical considerations in "real-world" search. It is expected that the cascade model will be of interest to web search engine companies, thus providing a path from the exploratory research results to significant impact in production systems. Furthermore, this work dovetails with the emerging area of green computing: more efficient algorithms use less energy, hence help reduce the environmental footprint of web-scale services. The project web site (www.umiacs.umd.edu/~jimmylin/projects/) includes more information about this project and will be used for the release of a prototype as part of the Ivory open-source retrieval toolkit.

Project Report

When it comes to search, users desire results that are not only good but also fast. These two desiderata, however, are often in tension. To obtain high quality results, systems typically take advantage of machine learning techniques to rank documents based on multitude of "features". For example: Does the document have matching query terms? Does the document have matching phrases? Does the document have high editorial quality? Etc. The more features an algorithm considers, the better the quality overall; however, considering more features takes more time, particularly for certain types of features that are computationally expensive. Thus, we often observe an inverse relationship between quality and speed (i.e., fast but poor quality or slow but high quality). The goal of this project is to explore search techniques that are able to balance result quality with the time required to compute those results. Our intuition is to consider the problem as a "cascade", proceeding in stages: the algorithm would first consider "cheap" and "fast" features over a large set of documents, and then progressively consider more "costly" features, but over smaller sets of documents. The additional time required to consider more computationally-intensive features is balanced by the fact that fewer documents need to be analyzed. In this way, cascade ranking can generate high quality results quickly. Building on this basic idea, we explored two main instantiations based on different classes of machine learning techniques: linear models and decision trees. In both cases, we showed that it is indeed possible to design algorithms that are both good and fast. For the second type of model, we additionally developed a number of optimizations to take advantage of modern computer architectures. As a result, we are able to substantially increase the speed of algorithms based on decision trees. In addition to designing ranking algorithms that better balance speed and quality, we considered the overall architecture of search systems. Traditionally, researchers have viewed search engines as monolithic entities; building on the intuition of ranking cascades, we took a different route and re-conceived of search in terms of multi-stage architectures, breaking the overall problem down into different stages that can be implemented by distinct software components in a loosely-coupled manner. Recently, a paper about Microsoft's Bing search engine describes a very similar design, which provides validation for our ideas. More generally, the issue of algorithmic efficiency is an important problem recognized by industry. Datacenters, which power the vast majority of the world's Internet services, consume a huge amount of energy and are a significant contributor of greenhouse gases. More efficient algorithms mean fewer machines for the same quality of service, or an expansion of services to better benefit users. Our research achieves broader impact in its contribution to these important challenges. Finally, our work has yielded open-source software that captures our findings, so that our results can be replicated and expanded upon by others.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1144034
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2011
Total Cost
$150,000
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742