This project maintains and enhances existing community software infrastructure, and creates new community data infrastructure to enable the information retrieval research community and related research communities to conduct research on a "web scale", meaning datasets of a billion or more web pages together with large query logs. The software infrastructure is based on the Lemur Toolkit and the associated Indri search engine, which are used by many information retrieval researchers due to the support for multiple retrieval models, multiple forms of evidence, and a powerful probabilistic query language. The enhancements to Lemur include support for the popular MapReduce style of distributed processing and other efficiency improvements to make it practical to do research on large web datasets 'out of the box' in common computer hardware environments.
The new data infrastructure consists of maintenance and distribution of a newly created billion-page dataset, another new web dataset, and large, anonymized search logs that match the datasets. The combination of large datasets and corresponding large search logs enable a broad community to conduct research with more realistic data resources than were available previously. This research will lead to further advances in the understanding of the underlying issues for large-scale, personalized search, which will be an important part of the next generation of search engines.
For further information, see the project web site at the URL: www.lemurproject.org.