As a result of rapid advances in information technology, massive datasets are being generated in all fields of science, engineering, social science, business, and government. Useful information is often extracted from these data through statistical model fitting, e.g., through regression models. These models are useful for describing relationships between predictor variables and a response variable. Given a set of n data elements and p predictors, p and/or n can be large in much modern massive data set applications. In these cases, conventional algorithms often face severe computational challenges. Subsampling of rows and/or columns of a data matrix have traditionally been employed as a heuristic to reduce the size of large data sets, thus enabling computations to run more quickly. Recently, however, an innovative sampling methodology that uses the empirical statistical leverage scores of the data matrix as a nonuniform importance sampling distribution has been proposed. This has been applied to the ordinary least squares (OLS) problem and other related problems, and this leverage-based nonuniform sampling procedure gives a very good approximation to the OLS based on full data (when p is small and n is large) more rapidly than traditional methods, both in worst-case theory and in high-quality numerical implementations. As of yet, however, the statistical properties of these algorithms are unexplored. Understanding these properties is of interest for both fundamental and very practical reasons; and the investigators' work addresses these problems. The investigators consider both statistical theory as well as the evaluation of that theory with high-quality numerical implementations on large real-world data.
This research proposal consists of two related research thrusts, both of which center around the common goal of an integrated treatment of statistical and computational issues. The first research thrust focuses on studying the statistical properties of the subsampling estimation using the statistical leverage scores in linear regression. The second research thrust generalizes the theory and methods to nonlinear regression and dimension reduction models. The proposed theory and methods serve as an inspiration for new ideas to push statistical methodology development forward. The research provides new insight into the existing algorithms, produces innovative methodologies for analyzing large-scale data, inspires new lines of quantitative investigations in interdisciplinary research and offers a unique educational experience.
The research addressed theoretical and practical aspects of using the notion of statistical leverage for the development of improved approximation algorithms in large-scale machine learning and data analysis. Statistical leverage has been used historically in regression diagnostics to identify outliers and errors in the data, and it has been used more recently in the area of Randomized Numerical Linear Algebra to develop improved algorithms for ubiquitous matrix problems. In particular, the research developed novel statistical interpretations and connections for random sampling and random projection algorithms that explicitly or implicitly use statistical leverage and algorithmic leveraging ideas. It provided an improved understanding of results that have been obtained in genetics and astronomical applications, and it provided several directions for the development of improved algorithms in those areas. The extension of these ideas from least-squares regression to least absolute deviation regression was also considered, and an improved understanding was obtained of how low-precision and high-precision algorithms for these problems could be derived. This has implications for the usefulness of the methodology in large-scale data analysis and in conjunction with more-established methods such as stochastic gradient descent. The research was widely disseminated through usual publication venues, it was incorporated into tutorials and classes that the lead researcher developed, and it was included in major meetings on large-scale data analysis.