This is a collaborative research project (0704689: Yiming Yang, Carnegie-Mellon University; 0704628: Daqing He, University of Pittsburgh). Adaptive filtering (AF) is an open challenge in information retrieval, defined as the problem of incrementally learning about the topics of interest from user feedback (relevance judgments of the retrieved documents) over a chronologically processed stream of documents. The goal of this research project is to significantly improve adaptive filtering technologies. The approach consists of: (1) a new framework named the Enriched Vector Space Model (EVSM) that represents multi-type objects (including users, queries, topics, documents, Named Entities and sources of data), records the interactions among objects during the adaptive filtering process, and enables the comparison among objects based on both content similarity and relationship similarity; and (2) a system that bridges adaptive filtering, collaborative filtering, personalized active learning and Generalized Hubs and Authorities for effective learning about evolving interests of users. The experimental research is linked to educational benefits for graduate students via participation in the system implementation, data annotation, empirical evaluations and user studies in this project, as well as through course materials the Principal Investigators teach on the related topics and techniques. The results of this project will provide a significant contribution to the field of information search and to our understanding of how to effectively learn from multiple users, and how to combine multi-aspect user information in a new unified framework, with broad applications in information retrieval (web-based and enterprise search engines, for example) by giving them a major adaptive and personalization dimension.
The project Web sites (http://nyc.lti.cs.cmu.edu/UserCentricAFCF/ and http://amber.sis.pitt.edu/UserCentricAFCF ) will be used to disseminate resulting publications, open-source code and annotated test data sets.
The new challenges for retrieval systems (search engines) are to provide relevant answers to user’s queries in a concise manner, i.e., cover different aspects of a topic without being redundant. Different users have different tolerance towards redundancy. While some users only want to see previously unseen information, others might expect a certain level of redundancy for various reasons like corroboration of information, or assessing the consensus of opinions on a topic/product based on different news sources, reviews, blogs, etc. Under the auspices of NSF grant titled User Centric Adaptive and Collaborative Filtering (III-COR 0704689 and 0704628) the CLAIR group at Carnegie Mellon University led by Prof. Yiming Yang has conducted cutting edge research in addressing such complex user information needs. A new theoretical framework and a set of algorithms are developed for multi-session retrieval/filtering/recommendation and for evaluation of competing approaches. The system takes a semi-supervised approach to identifying informative "nuggets" in documents, and optimizes the ranked lists of documents so as to maximize the coverage of informative nuggets and minimize the redundancy in the ranked list. Users’ tendency to abandon search at varying points in the ranked lists of search results and the user’s tolerance of redundancy are also modeled probabilistically in a stochastic process. By learning user-specific parameters in the model, the system can be personalized for each user with respect to the desired redundancy tolerance level as well as the user’s persistence in going through the ranked list of documents. Empirical evaluations show that the new novelty-driven and user-centric optimization significantly enhance the expected utility of the distillation system, compared conventional approaches which are relevance-driven only. Another major contribution in this project is the development of a novel framework and new algorithms for multi-task active learning. The goal is to minimize the amount of training data required to learn multiple tasks, such as classification, regression, recommendation and filtering. For example, we study how to acquire classification data for one categorization problem so that it is beneficial for learning other categorization problems, and how to acquire minimal supervision from one user to improve the recommendations for other similar users. The CMU team developed a novel strategy called Benevolent Active Learning to explicitly estimate the impact of supervision across tasks, a hitherto unstudied problem. Empirical analysis on popular benchmark datasets demonstrate the effectiveness of their approaches over current state-of-art which takes a narrow single-task oriented approach to Active Learning. Another important contribution on the Active learning front is called Personalized Active Learning (PAL). It is an effort in rapidly understanding user interests with minimal user interaction. The interaction is in the form of questions such as "Do you like item X", and the PAL’s goal is to identify the minimal set of items that will unravel the user’s interests and therefore improve retrieval performance in the future. It is important to ensure that the user can provide the answer to the above question in affirmative/negative but not respond with "I do not know" or "What is item X?". The latter cases will lead to a failed dialogue and simply increase the user interaction required to discover their interests. The PAL algorithm significantly outperforms existing state-of-the-art in user interest elicitation by minimizing such failures. With the advent of social websites such as Facebook/Twitter and increased online activities of users, it has become possible to tap into several sources of information for understanding user interests. CLAIR has developed a novel multi-faced personalized search (MPS) system that takes into account the topicality of user search, their search history, the authoritativeness of on-topic pages, and also recommendations based on search results liked by users with similar interests. Evaluating such a complex system is not possible with existing benchmark datasets available to academic researchers. CLAIR addressed this challenge by developing a multi-faceted search dataset called CiteData in collaboration with Prof. Daqing He’s team at University of Pittsubrgh. This new dataset consists of controlled measurable tasks involving personalized retrieval of academic publications, citations, ratings from thousands of users in the CiteULike dataset. It has been made publicly available for researchers and will significantly benefit future benchmark evaluations of systems that take multiple facets of information into account in personalized search, filtering and recommendation. This Project Outcomes Report for the General Public is displayed verbatim as submitted by the Principal Investigator (PI) for this award. Any opinions, findings, and conclusions or recommendations expressed in this Report are those of the PI and do not necessarily reflect the views of the National Science Foundation; NSF has not approved or endorsed its content. Partner Organizations: University of Pittsburgh: Prof. Daqing He and his team from the University of Pittsburgh collaborated closely with us in the test collection design, development, and analysis effort.