This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).
Many information needs can be more easily expressed using longer, sentence-length queries, but the inadequacies of current search engines force people to try to think up the right combination of keywords to find relevant documents. This can be very difficult and often leads to search failures. On the other hand, long queries are handled poorly by current search engines. The focus of this project is on developing retrieval algorithms and query processing techniques that will significantly improve the effectiveness of long queries. A specific emphasis is on techniques for transforming long queries into semantically equivalent queries that produce better search results. In contrast to purely linguistic approaches to paraphrasing, query transformation is done in the context of, and guided by, retrieval models. Query transformation steps such as stemming, segmentation, and expansion have been studied for many years, and we are both extending and integrating this work in a common framework. The new query processing techniques for long queries are being developed and distributed using the NSF-funded Lemur toolkit from UMass/CMU, and are being evaluated using a variety of document and query collections from sources such as the web, social media sites such as forums, and TREC, with an involvement of graduate and undergraduate students. The project Web site (http://ciir.cs.umass.edu/research/longqueries) will be used to further disseminate results.
Given that search is one of the two most common activities on the web and that new modalities for search, such as voice interfaces and collaborative question answering, are increasing the importance of long queries, this research could have a very broad impact, both in the home and the office.