Regression models are widely used in investigating the associations between a set of predicting variables, the so-called covariates, and some outcome variable. Estimates of regression coefficients and their confidence intervals provide useful information, for example, the importance of certain genetic variants to lung cancer, or brain regions associated with memory loss in an aging population. With the advent of big data era, regression models with many covariates have been commonly used to tackle many important scientific problems in areas such as genomics, neuroimaging, business, engineering, information technology, and other biomedical studies, and sometimes the number of covariates (e.g. genetic variants) is even greater than the sample size (e.g. the number of study participants). Making statistical inference (i.e. constructing confidence intervals for regression coefficients) for a large number of covariates becomes a challenging issue because the conventional methods such as the maximum likelihood estimation may either not exist or yield biased estimates. It has been shown in recent years that the regression coefficients can be estimated by using regularized methods, e.g., the lasso approach. However, it is also well-known that the regularized methods yield biased estimates, thus cannot be directly used for making statistical inference, in particular, for constructing confidence intervals. Some researchers have shown that proper statistical inference can be made in linear regression models after implementing a clever de-biasing procedure. However, it is also found that the de-biased method does not work beyond linear models. Without imposing restrictive assumptions, theory and methods will be developed for the generalized linear models and the Cox regression model with a large number of covariates, as well as for the functional regression models with applications in brain imaging studies. Proper distributional theory and confidence intervals will be provided, which will lead to more reliable results in scientific research.

The existing de-biased methods do not successfully correct the bias in nonlinear models, e.g., the generalized linear models or the Cox model, leading to poor results in statistical inference. The main causes of the problem include the unrealistic sparsity assumption imposed on the inverse expected Hessian matrix, and that the "negligible" terms in the existing de-biased methods are in fact not negligible. In this project, two methods that further de-bias the lasso estimators without relying on the assumption of sparse inverse expected Hessian matrix will be considered: (i) directly inverting the Hessian matrix when the number of regression parameters is less than the sample size; (ii) eliminating the major bias term without using the inverse of Hessian matrix - a quadratic programming approach, which can potentially handle the case with larger number of regression parameters than the number of observations. Additional challenges arise in the Cox regression with high-dimensional covariates, where the partial-likelihood-based loss functions for all the observations are not i.i.d., and each loss function is not Lipschitz. The proposed method will be approximating the loss function to yield i.i.d. losses and extended to handling multivariate and clustered survival data with even more complicated loss functions. For the brain imaging data, functional regression model using Haar wavelet basis is investigated. The major added challenge is to characterize the impact of the approximation error using Haar wavelets on the asymptotic distribution of the refined de-biased functional estimation.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1915711
Program Officer
Pena Edsel
Project Start
Project End
Budget Start
2019-10-01
Budget End
2022-09-30
Support Year
Fiscal Year
2019
Total Cost
$199,994
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697