In this big-data era, massive data sets are being generated routinely and we are seeing a growing need for powerful, reliable, and interpretable statistical learning tools to help understand these data. The main ideas and approaches in this projectl focus on developing effective statistical learning tools to learn about complex and heterogeneous structures, such as those changing in time or varying among different groups of individuals, in high-dimensions. The activities will have a significant impact on high dimensional Bayesian analysis and modeling of nonlinear relationships. While most current efforts for high-dimensional Bayesian analyses have been focused on linear models, this project focuses on two ways of generalizing standard linear models to meet certain practical challenges: one is a generalized form of mixture modeling, termed as individualized variable selection, which enables each individual observation to have its own set of dependent variables through the employment of neuronized priors. Another extension is the Bayesian inference of index models that form a mixture structure. The project will lead to useful tools (or customized software) for discovering interpretable nonlinear and interactive patterns among a large number of potential variables. Various aspects of statistical modeling, design, and learning strategies integrated in our algorithms are broadly applicable to problems involving signal discovery in complex systems and high-dimensional data. The project will also provide both educational and interdisciplinary research opportunities for graduate students, and will result in software useful to biomedical researchers, economists, social scientists, and many other practitioners.

In a vast number of regression problems, especially under high-dimensional settings, the structure of the association between covariates in hand and the target quantity of interest might be heterogeneous over observations, which calls for effective methods to detect such non-trivial structures. Standard procedures, including traditional variable selections, commonly overlook the existence of interplays of these heterogeneous factors. This research project aims to develop statistical procedures that identify the complicated relationship between response Y and a set of covariates X in flexible and computationally efficient ways. Project 1 focuses on Bayesian individualized variable selection (BIVS), which generalizes standard linear regression models to quantify heterogeneous effects among individual observations that differ in their dependent variables with different magnitudes. The PIs will investigate its theoretical properties, including model selection consistency and its robustness when the model assumption is violated. Project 2 is devoted to the development of an efficient Bayesian method to infer the semi-parametric relationship between the response and covariates through general index models. The PIs will explore its computational feasibility and theoretical properties such as the posterior contraction rate on the estimation of the sufficient dimension reduction space. Project 3 focuses on a fast tuning parameter selection procedure by employing a generative process via neural networks. By using this procedure, the cross-validation can be efficiently implemented for general models, such as the BIVS and Bayesian index models, regularized variable selection, and nonparametric function estimation.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
2015411
Program Officer
Pena Edsel
Project Start
Project End
Budget Start
2020-08-15
Budget End
2023-07-31
Support Year
Fiscal Year
2020
Total Cost
$120,000
Indirect Cost
Name
Harvard University
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02138