In large scale prediction problems that arise in many application areas, data is plentiful, and it is computational resources that constrain the performance of prediction methods. The broad goal of this research project is the design and analysis of methods for large scale prediction problems that make effective use of limited computational resources. The main aims are: to improve our understanding of the tradeoff between the accuracy of a prediction method and its computational requirements; to develop model selection methods that adaptively choose the model complexity to give the best predictive accuracy for the available computational resources; to improve our understanding of the difficulty of solving large scale prediction problems using distributed computational resources; to develop analysis techniques and methods for asynchronous online prediction, which exploit the flexibility to respond to queries out of order; and hence to develop effective methods for large scale prediction problems.
As data acquisition and storage has become cheaper, enormous data sets have become available in many areas, including web information retrieval, the biological, medical, and physical sciences, manufacturing, finance and retail. Consequently, for many statistical prediction problems, the amount of data available is so huge that we can treat it as unlimited. For instance, in using image and caption data to train a prediction rule that can automatically choose appropriate labels for images, the web provides an effectively unlimited supply of training data. Similar situations arise in using click stream data to predict the choices of visitors to a popular web site, or in using customers' ratings of movies to make useful recommendations. For these large scale prediction problems, the bottleneck to performance is not the amount of data, rather it is the computational resources that are available. Many modern prediction methods have been designed and analyzed from the perspective that data is precious: they aim for optimal predictive accuracy for a given sample size. But for large scale problems, this is the wrong perspective; computation is the precious resource that must be used wisely. This shift in perspective introduces some novel tradeoffs. One of the most important tradeoffs arises in choosing the complexity of a prediction rule. Should we use our computational resources trying to optimize over a very complex family of prediction rules, which would not allow us to gather much data? Or should we save computation by using simpler prediction rules, and instead spend this computation on gathering more data? This research project is aimed at improving our understanding of these tradeoffs, and hence developing strategies for large scale prediction problems that best exploit the available computational resources.