The dramatic improvements in data collection and acquisition technologies over the last decades have enabled scientists to collect massive amounts of high-dimensional data that allow for monitoring and studying of complex systems. Due to their intrinsic nature, many of the high-dimensional datasets, such as omics and genome-wide association study (GWAS) data, have a much smaller sample size compared to the dimension (referred to as the small-n-large-P problem). Current research on statistical modeling of small-n-large-P data focuses on linear and generalized linear models. However, these approaches are often not adequate for modeling complex systems, and estimation of the model parameters is challenging. This project addresses two fundamental problems, statistical modeling and parameter estimation, toward a valid statistical analysis of high-dimensional data. Successful completion of this project will generate hands-on tools for statistical inference of high-dimensional complex systems, which can benefit researchers in many areas of science and technology. In particular, the proposed applications to biomedical studies will lead to accurate tools for detecting biomarkers associated with disease processes and tailoring optimal therapy for individual patients with complex diseases. The research results will be disseminated to the statistical and biomedical communities, via collaboration, conference presentations, books, and articles to be published in academic journals. The project will also have significant impact on education through the involvement of graduate students in the project, and incorporation of results into undergraduate and graduate courses. In addition, the R package developed under this project will provide a valuable tool for statistical analysis of high-dimensional data.

The current approach to modeling small-n-large-P data focuses on linear and generalized linear models, and casts the problem as variable selection by imposing a sparsity constraint on parameter values. Although these models have many advantages, such as simplicity and computational efficiency, estimation of the parameters is still a challenging problem. While regularization is often used in these situations, it can perform poorly when the sample size is small and the variables are highly correlated. Two new methods are proposed to address these concerns, namely, Bayesian neural network (BNN) and blockwise coordinate consistency (BCC). The BNN method works by first fitting the data with a feed-forward neural network, conducting variable selection through network structure selection under a Bayesian framework, and resolving the associated computational difficulty via parallel computing. Compared to existing methods, BNN can lead to much more precise selection of relevant variables and outcome prediction for high-dimensional nonlinear systems. The BCC method works by maximizing a new objective function, the expectation of the log-likelihood function, using a cyclic algorithm and iteratively finding consistent estimates for each block of parameters conditional on the current estimates of the other parameters. The BCC method reduces the high-dimensional parameter estimation problem to a series of low-dimensional parameter estimation problems. The preliminary results indicate that BCC can provide a drastic improvement in both parameter estimation and variable selection over regularization methods. The validity of the proposed methods will be rigorously studied and applied to biomarker discovery, precision medicine, and joint estimation of the regression coefficients and precision matrix for high-dimensional multivariate regression.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1818674
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2017-09-06
Budget End
2019-08-31
Support Year
Fiscal Year
2018
Total Cost
$120,658
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907