The challenge of high-dimensionality characterizes many contemporary statistical problems arising from many frontiers of scientific research and technological development. In high-dimensional statistical research, low-dimensional structures, which entail sparsity under suitable parametrization, are needed to be explored in order to circumvent the issue of noise accumulation with dimensionality. This proposal intends to confront a number of important high-dimensional statistical problems from genomics, machine learning, health studies, economics, and finance. These include various emerging issues from the analysis of microarray data such as normalization, significance analysis, and disease classification; variable selection and feature extraction from high-dimensional statistical learning; sparse classification and clustering from high-dimensional feature spaces; high-dimensional covariance matrix estimation for asset allocation and portfolio management; sparse covariance estimation for spatial and temporal studies and genetic networks. All of these problems have their distinguished characters from the context of their applications, but nevertheless share similar challenges with high dimensionality and admit features of sparsity. These emerging problems of high societal impacts will be confronted via developing new statistical methods to address the features and challenges associated with high-dimensionality, from statistical computation, feature selection, to noise reduction. At the same time, the PI also intends to provide fundamental understanding, via asymptotic analysis and simulation studies, to these problems and their associated methodologies that push theory, methods, and computation forward.

Thanks to technological innovation, the availability of large-scale and complex data are widely available nowadays in many contemporary scientific problems. High-dimensional statistical models are required to address these scientific endeavors. The challenges of high-dimensionality arise from diverse fields of sciences and the humanities, ranging from genomics and health sciences to economics and finance. In these fields, variable selection, feature extraction, sparsity explorations are crucial for knowledge discovery. In this proposal, we propose to develop cutting-edge statistical theory and methods to address these problems from genomic studies, machine learning, health science, economics, and finance. The proposed techniques and results will not only help researchers to solve emerging problems in their disciplines, but also have strong impact on statistical thinking, methodological development, and theoretical studies.

Project Report

, which are critical to many contemporary statistical problems arising from many frontiers of scientific research and technological developments such as genomics, genetics, and big data analytics. Driven by selecting genes or SNPs or gene-gene interactions that are relevant to a disease or a biological process, we have developed systematically the framework of sure independent screening and introduced two-scale screening and selection techniques for ultrahigh dimensional variable selection. These techniques are further applied to handling binary data and survival data, and nonparametric models. We have also developed theory and methods for high-dimensional variable selection and inference, establishing strong optimality properties, dealing with outliers and heavy tailed errors, and improving statistical and computational efficiency of high-dimensional statistical procedures. In addition, we invent various techniques for estimating large covariance matrices and its inverse. These are applied to extracting latent factors in finance and genomics, and to building graphical models for understanding genomics networks and biological processes. Moreover, we have developed a new framework for false discovery control in large-scale multiple correlated testing problems, leveraging further our invention on large covariance modeling. Furtherermore, we have developed various new high-dimensional classification techniques and theory, which are applied to disease classifications and to understand biological processes. The outcome of this project also includes publishing several useful software packages that are now publicly available. The package, SIS, is implemented in R. It is an iteratively large-scale screening and moderate-scale selection technique that permits one to analyze high throughput data with binary, discrete or continuous responses. The aim of software is to effectively select some useful genes or proteins that are associated with clinical and biological outcomes. The R-software package, called PFA, allows one to control false discovery proportion in the large-scale correlated tests that are frequently used to select useful genes and proteins in genomics as well as SNPs in genetics. The R-package POET is designed for estimating large covariance matrices in approximate factor models by thresholding principal orthogonal complements. The outcome can be used to extract latent factors, to build network graphs, and to carry out further statistical inferences. This research project yields fruitful outcomes. All of the proposed research objectives have been achieved. The project results in 45 research articles published in top scientific journals, 5 invited discussions and commentaries, and 2 manuscripts that are submitting for publication. Twelve postdoctoral fellows were trained with the benefit of this research. Ten of them took a tenure track assistant professor, and two went to industry. Fifteen Ph.D. students were trained as a part of their pursuit of the Ph.D. degree, and all of them have successfully defended their Ph.D. theses. Among those, nine got the tenure assistant professor in the United States, and six went to industry. In addition, 24 senior theses have been supervised with benefit of this NSF project.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0704337
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2007-06-15
Budget End
2014-05-31
Support Year
Fiscal Year
2007
Total Cost
$920,001
Indirect Cost
Name
Princeton University
Department
Type
DUNS #
City
Princeton
State
NJ
Country
United States
Zip Code
08540