Modern search engines, especially those designed for the World Wide Web, commonly analyze and combine hundreds of features extracted from the submitted query and underlying documents (e.g., web pages) in order to assess the relative relevance of a document to a given query and thus rank the underlying collection. The sheer size of this problem has led to the development of learning-to-rank algorithms that can automate the construction of such ranking functions: Given a training set of (feature vector, relevance) pairs, a machine learning procedure learns how to combine the query and document features in such a way so as to effectively assess the relevance of any document to any query and thus rank a collection in response to a user input. Much thought and research has been placed on feature extraction and the development of sophisticated learning-to-rank algorithms. However, relatively little research has been conducted on the choice of documents and queries for learning-to-rank data sets nor on the effect of these choices on the ability of a learning-to-rank algorithm to "learn", effectively and efficiently.
The proposed work investigates the effect of query, document, and feature selection on the ability of learning-to-rank algorithms to efficiently and effectively learn ranking functions. In preliminary results on document selection, a pilot study has already determined that training sets whose sizes are as small as 2 to 5% of those typically used are just as effective for learning-to-rank purposes. Thus, one can train more efficiently over a much smaller (though effectively equivalent) data set, or, at an equal cost, one can train over a far "larger" and more representative data set. In addition to formally characterizing this phenomenon for document selection, the proposed work investigate this phenomenon for query and feature selection as well, with the end goals of (1) understanding the effect of document, query, and feature selection on learning-to-rank algorithms and (2) developing collection construction methodologies that are efficient and effective for learning-to-rank purposes.
In addition to characterizing and developing collection construction methodologies, the project plan includes development and release of new, efficient, and effective learning-to-rank data sets for use by academia and industry. In fostering this effort, the project team has close ties with the National Institute of Standards and Technology (NIST) and Microsoft Research, two of the premier organizations that develop and release Information Retrieval data sets. All research results and data sets developed as part of this project will be made available at the project website (www.ccs.neu.edu/home/jaa/IIS-1017903/). The project provides an educational and training experience for students.
Modern search engines, especially those designed for the World Wide Web, commonly analyze and combine hundreds of features extracted from the submitted query and underlying documents (e.g., web pages) in order to assess the relative relevance of a document to a given query and thus produce a ranked list of documents (or web pages) for the user. Example features include the presence or absence of query terms in the document, the number of web pages that point to the document, and the document's length, age, and so forth. The sheer size of this problem has led to the development of learning-to-rank algorithms that can automate the construction of such ranking functions: Given a training set of (query, document, relevance) triples, a machine learning procedure learns how to combine the query and document features in such a way so as to effectively predict the relevance of any document to any query and thus rank a collection in response to a user input. For modern search engines, much thought and research has been placed on feature extraction and the development of sophisticated learning-to-rank algorithms. However, relatively little research has been conducted on the choice of documents and queries for training the learning-to-rank algorithm nor on the effect of these choices on the ability of a learning-to-rank algorithm to "learn", effectively and efficiently. Furthermore, relatively little research has been conducted on the effect of noisy training data in this process, particularly noisy or incorrect relevance assessments. Our work has demonstrated that noise and the careful choice of training data can have a profound effect on the ability of learning-to-rank algorithms to accurately "learn" to solve the search engine task. For example, we have shown that learning-to-rank algorithms are most effective when trained on data that is rather skewed in the sense that the training data include both highly relevant and highly non-relevant documents (for any given training query), whereas mediocre documents are of far less importance. Furthermore, we have shown that noisy relevance assessments in the training and testing data greatly affect the real and perceived quality of the trained ranker. Finally, we have shown that by incorporating a model of relevance assessment noise in the learning-to-rank algorithm itself, one can improve its ultimate performance significantly. Finally, this project provided research topics for three students who obtained their PhDs during the life of the project and two further students who will obtain their PhDs in the upcoming year. The graduated students have gone on to pursue careers in academia, industrial research, and development.