Many previous successes in information retrieval (IR) research have been based on open source software and test collections supported by university groups. This project will extend open-source search engine software developed by the Lemur Project to support major emerging research areas that are important for the development of the next generation of search engines that will run on mobile, speech-based platforms and be capable of learning and adapting to individual users. We will also add new capabilities that support large-scale collection of material from social media and common text mining tasks. These extensions will also be an important part of training graduate students in computer science in these critical new research areas.
The impact of machine learning, natural language processing (NLP), and social media on IR research has been significant and continues to grow. Machine learning techniques, in particular, have been instrumental in helping to develop ranking functions that involve linear or non-linear combinations of many representation features of queries and documents. The enhancements we will provide in Lemur to support ongoing research in these areas include tools for training parameters, defining and incorporating new features, learning-to-rank methods that are relatively easy to use and well-integrated with the other components of the search engine, and integration with NLP and data mining toolkits. These enhancements, and the architectural changes to support multi-pass searching and flexible query processing, are not available currently in any open source search engine or IR toolkit.
For further information, see the project web site at the URL: www.lemurproject.org/.