This research proposal consists of two related research thrusts, both of which center around the common goal of an integrated treatment of statistical and computational issues. The first research thrust focuses on studying the statistical properties of the subsampling estimation using the statistical leverage scores in linear regression. The second research thrust generalizes the theory and methods to nonlinear regression and dimension reduction models. The proposed theory and methods serve as an inspiration for new ideas to push statistical methodology development forward. The research provides new insight into the existing algorithms, produces innovative methodologies for analyzing large-scale data, inspires new lines of quantitative investigations in interdisciplinary research and offers a unique educational experience.
As a result of rapid advances in information technology, massive datasets are being generated in all fields of science, engineering, social science, business, and government. Useful information is often extracted from these data through statistical model fitting, e.g., through regression models. These models are useful for describing relationships between predictor variables and a response variable. Given a set of n data elements and p predictors, p and/or n can be large in much modern massive data set applications. In these cases, conventional algorithms often face severe computational challenges. Subsampling of rows and/or columns of a data matrix have traditionally been employed as a heuristic to reduce the size of large data sets, thus enabling computations to run more quickly. Recently, however, an innovative sampling methodology that uses the empirical statistical leverage scores of the data matrix as a nonuniform importance sampling distribution has been proposed. This has been applied to the ordinary least squares (OLS) problem and other related problems, and this leverage-based nonuniform sampling procedure gives a very good approximation to the OLS based on full data (when p is small and n is large) more rapidly than traditional methods, both in worst-case theory and in high-quality numerical implementations. As of yet, however, the statistical properties of these algorithms are unexplored. Understanding these properties is of interest for both fundamental and very practical reasons; and the investigators' work addresses these problems. The investigators consider both statistical theory as well as the evaluation of that theory with high-quality numerical implementations on large real-world data.