The past decade has witnessed an explosion in the scale and richness of data sets that arise in both science and engineering. A wide variety of application areas have lead to large-scale data sets. Examples include social networks such as Facebook, on-line recommender systems such as Amazon and Netflix, neuroscience data including fMRI, EEG, and brain-machine interfaces, and image/video processing which includes face recognition, surveillance, and security. All of these areas require effective methods for statistical inference---that is, methods that lead to actionable conclusions from the data. The classical approach in statistics is to study inference algorithms without consideration of their computational and storage requirements; this approach leads to many methods that simply cannot be implemented for large-scale problems. The goal of this research is to develop a principled framework for characterizing the fundamental limits of statistical estimation under computational and storage constraints. This shift in perspective will lead to the development of new and computationally efficient methods for statistical estimation in resource-constrained environments.
While the notion of minimax risk characterizes the fundamental limits of statistical estimation, it is based on taking an infimum over all measurable functions of data, thereby allowing for estimators that have exponential computational complexity, require prohibitive amounts of storage, and/or reveal sensitive data. The goal of this proposal is to study various constrained forms of statistical minimax based on limiting the class of possible estimators. The proposed work is interdisciplinary in nature, combining ideas from mathematical statistics, information theory, optimization theory, and computational complexity. The first research thrust concerns the tradeoffs between computational costs and statistical accuracy. The main goal is to understand when there are gaps between the classical minimax risk, and the lowest risk achievable by algorithms that run in polynomial-time. Specific model classes of interest include high-dimensional forms of sparse regression, sparse principal component analysis, and classification problems in neural networks. The second research thrust focuses on estimation in distributed settings. Many data sets are so large so that they cannot be stored at a single central location, but instead must be split into many pieces, and stored on separate machines that can communicate only relatively small amounts of information. Thus, an important problem is to characterize the minimal amount of communication needed for a distributed implementation to match the performance of the centralized estimator.