This Small Business Innovation Research (SBIR) Phase I project addresses the problem of learning predictive models of individual choice behavior using sparse information on the behavior of any single individual. The intellectual merit of the project is developing a novel parsimonious view of this problem by modeling choice behavior as a distribution over permutations of alternatives, and making this view implementable at scale. A unit of data in this paradigm is a single comparison between two alternatives. Data of this sort can be derived in a variety of contexts ranging from product reviews to transaction data. While being a parsimonious modeling viewpoint, exact computation, or even representing such models is intractable. The project will focus on developing approximate solutions that, in the spirit of recent advances in high-dimensional statistics, exploit the potential of sparse approximations to such models. Given the vast quantities of data available to build such models it will be important for the algorithms developed to be amenable to parallelization in a manner reminiscent of the Map/Reduce computational paradigm. The algorithms developed will fit this paradigm with key algorithmic steps decomposing across data collected for a single individual. In summary, this project will develop a massively parallelizable approach to modeling individual choice behavior using unstructured data from a variety of sources.

The broader impact/commercial potential of this project rests in enabling the emerging, all pervasive transition from 'search' to 'discovery'. This transition can be witnessed in sectors ranging from e-commerce to offline retail to matching impressions to advertisers on demand side platforms. The key stumbling block in this transition is the seeming requirement to build attribute rich models for a given context as opposed to a black box approach. The approach taken in this project is of the latter variety. As a concrete example, the task of merchandising requires an offline retailer to decide on the right assortment of products to carry in segments ranging from tooth paste to clothing; the approach here will power such decision making in an entirely data driven fashion. In a different direction, serving ads based on models that capture a surfer's preferences across the various silos of products and topics on the web can be enabled at scale and incredible granularity using the approach here. The level of granularity made possible by the approach here cannot be achieved with 'parametric' attribute driven approaches. In summary, the tools developed in this project have the potential to do for `discovery' what the PageRank algorithm did for search.

Project Report

Summary. The primary outcome of this project is development of fundamentally new and powerful recommendation technology that can provide meaningful recommendations by utilizing all sorts of data simultaneously. The foundations for the project are firmly based upon the multiple awards winning research of PIs starting early 2007 as well as development done as part of this project. Broader Impact. A `recommendation' system is everywhere in any of the modern information system – be it setting like Netflix, Amazon or Yelp. The goal of such a system is to understand the `preferences', `likes and dislikes' or more generally `choices' of an individual, a community or entire population so that it can provide meaningful suggestions or ordered list of options from a potentially large pool of options, that are of interest to the particular individual, community or the population. Naturally design of such a system requires access to preferences or choices of individuals, communities and entire population. In the recent times, emergence of variety of `sensing platforms' like mobile phones, web-interfaces and electronic recording systems have led to the availability of large amounts of such preference data in variety of contexts. Therefore, in principle, it seems feasible to develop such an ambitious `recommendation system'. In reality, however, the state-of-art recommendation systems fall short from achieving this. The primary reason being -- the available preference data, generated in the existing settings mentioned above, is very heterogeneous in their natural representation while existing systems are primarily designed to work with a specific representation. For example, electronic transactions suggest which products where purchased; web or mobile phone app logs capture browsing behaviors and written reviews capture detailed preferences in the textual form. However, popular recommendation systems (known as collaborative filtering) work with preference data that is present in form of `star rating', 'scores' or 'thumbs up/down'. Intellectual merit. This project precisely addressed this fundamental challenge of developing recommendation system that can operate with heterogeneous preference data and provide very meaningful recommendations across whole gamut of settings. Intellectually, development of this technology has advanced the state-of-art for machine learning and statistics, social choice and policy making and large-scale data processing. This system is based on the following simple and powerful insight of PIs: most, if not all, forms of heterogenous preference data can be viewed in form of comparisons. This includes all forms of preference data mentioned above: browsing log, web-or-app-clicks, transactions, reviews, scores or ratings, likes/dislikes, etc. For example, consider a user using entertainment media portal like Netflix: while browsing the web-interface of it, if the user clicks on the movie `Fargo' while having `Salt' and `About a boy' on the visual screen for the user, it can be immediately concluded that s/he prefers `Fargo' over the other two movies. And this can be thought as two comparisons: `Fargo' > `Salt' and `Fargo' > `About a boy'. Similarly, if the user in a physical (or online) store purchased DVD of `Fargo' while having `Salt' and `About a Boy' on the same shelf (or browsed during the transaction), then similar conclusion (and comparisons) as above can be derived. Indeed, the intensity of comparison between two settings could be (and most likely, should be) different -- a click on an item while browsing is not as strong a preference signal as a purchase. In summary, the technology problem solved by this project – given a collection of options or choices (e.g. movies, books), and preference data in the form of bag comparisons produces by a collection of individuals, develop system that provides rank ordered set of options for each of the individual – and this rank ordered set of options is likely to change depending upon the individual's history.

Agency
National Science Foundation (NSF)
Institute
Division of Industrial Innovation and Partnerships (IIP)
Type
Standard Grant (Standard)
Application #
1248473
Program Officer
Muralidharan Nair
Project Start
Project End
Budget Start
2013-01-01
Budget End
2013-06-30
Support Year
Fiscal Year
2012
Total Cost
$150,000
Indirect Cost
Name
Celect, LLC
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02142