Ranked data arises from m raters ordering by some mechanism n items to express their preferences for the item. Such data can represent election voting, psychological and medical surveys, book and movie recommendation, and web-site ranking system such as search engines. In this proposal the investigators develop the theory and methodology of statistical inference in the case where n and m tend to infinity, and each rater provides an increasingly censored or partial preference information. Under this scenario, they demonstrate how to obtain consistent non-parametric estimators and develop efficient computational procedures for their use. Another aspect that is examined is visualizing preference data by embedding it in a low dimensional space, and designing appropriate surveys for preference data.

The methodology and theory developed in this proposal should help build superior recommendations systems which are becoming increasingly popular in today's online businesses. Such systems build a customized list of recommended items based on the user's past preferences. The proposal also develops visualization techniques for such data which should increase the ability of businesses to analyze customer survey data. In the past such techniques have been either ad-hoc and lacking statistical interpretation, or computationally prohibitive. This proposal aims at developing useful tools for preference data that are both statistically interpretable and computationally efficient, in a realistic large data setting.

Project Report

Preference or ranked data arises from users or raters ordering by some mechanism to express their preferences for the item. Such data can represent election voting, psychological and medical surveys, book and movie recommendation systems, marketing and advertisement studies, and web site ranking systems such as search engines. Recent technological advances have increased the availability of rating datasets and their importance to science and industry. However, existing statistical techniques for preference data have been largely inadequate, and many rely on strong parametric assumptions which are unrealistic in most practical cases and require intractable computation. This project investigates statistical inference for preference data when there are a large number of raters and items. In such a case, data is censored in the sense that each rater reveals only a small portion of their preference relation. We proposed new nonparametric methodologies by using kernel smoothing estimators, and tested our proposed methods in the context of movie recommendation system (Netflix data). While our proposed methods perform slightly worse than other widely used existing (state-of-art engineering) methods, our methods are much richer in probabilistic interpretation than existing methods, and can be incorporated in modern recommendation systems. An extension of our methods has led to new algorithms for interviewing uses in cold-start recommendation systems. Broader Impact. The tools developed in this project are applicable to a wide range of datasets that have substantial importance in the high-tech and business communities. It will benefit the industry and will strengthen the ties between statistical theory and industry. During the project period, the PI posted some blog posts (smlv.cc.gatech.edu) on modern advances in information technology - in particular on search engine optimizations and googe's instant search. The posts were followed (and discussed) by a local newspaper (Atlanta's AJC) and a local radio station. The PI is working on an online book on probability and R, which is expected to be useful for student studying Statistics or Machine Learning. The book is available at theanalysisofdata.com for free and a video version is under developed.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0907466
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2009-08-01
Budget End
2012-07-31
Support Year
Fiscal Year
2009
Total Cost
$175,881
Indirect Cost
Name
Georgia Tech Research Corporation
Department
Type
DUNS #
City
Atlanta
State
GA
Country
United States
Zip Code
30332