Recent technological advances have provided researchers with a wealth of data. Unlocking the potential of these vast data sources often involves linking a response variable (e.g., the presence of a disease, or a positive response to a treatment) to a large number of potential features of interest (e.g. gene expression levels or activities of brain regions). It is increasingly common to measure a large number of features (thousands or more), on a small number of subjects (or patients). Generally, only a small number of these features are related to the response variable. To determine those relevant features, statistical/machine-learning modelling algorithms are used to automatically select a small subset of features that are most predictive of response. Many of these algorithms enforce strong restrictions on the models they allow; e.g. linearity. In scientific domains, where only a small number of subjects are measured, these restrictions can be useful: building more complex models requires more subjects. However, they are sometimes overly restrictive. In this project, a framework to estimate less restrictive (additive) models that employ variable selection in high-dimensional problems will be developed. This framework will help overcome a number of computational challenges. It will additionally lay a theoretical foundation for analyzing the statistical behavior of such high dimensional estimators (that will include the impact of finite computational resources). A publicly available software implementation for flexible high-dimensional modeling will also be developed.
This project engages seminal questions in nonparametric estimation and penalized regression. Generally, the computational and theoretical challenges of sparse nonparametric regression in high dimensions are studied separately: Iterative algorithms are constructed that eventually get within a prespecified tolerance of the minimum, while statistical properties (e.g. convergence rates) of the exact minimizer are studied. In addition, for non-parametric problems, existing theoretical studies have often focused on statistical properties when the structure implied by the objective (e.g., sparsity) holds exactly. This project aims to merge the study of computational and statistical optimality, in the setting of high-dimensional additive, and more general nonparametric, models. More specifically, it aims to analyze the statistical properties of approximate, rather than exact, minimizers; from there, it characterizes the number of descent iterations needed to obtain estimators with optimality guarantees. In addition, the project aims to extend these ideas to settings where the structure/smoothness may be misspecified. To address these challenges, the project brings together ideas from convex optimization, empirical process theory, penalized regression, and approximation theory, and will serve as a template for engaging those bodies of knowledge together.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.