Big Data is an area of intense current interest in statistical research and practice due to the rapid development of information technologies and their applications to modern scientific experiments. High-dimensional statistical methods typically provide crucial elements and ideas in engineering solutions for complex Big Data problems. Important fields with an abundance of such problems include bioinformatics, signal processing, neural imaging, communications and social networks, text mining and more. In many such applications, the nominal complexity of the problem, typically measured by the dimension of the data such as genetic components in bioinformatics, brain regions or voxels in neural imaging, or computers and routers in the Internet, is much greater than number of sample points or the information content of the data. The research project will identify and characterize high-dimensional statistical models and problems in which efficient statistical inference are feasible, and will develop new methodologies and algorithms to carry out such efficient statistical inference with high-dimensional data. The proposed research is motivated by and will be directly applicable to real life problems in the aforementioned areas where modern information technologies prosper. Furthermore, the proposed research will have significant educational impact.

A longstanding challenge in high-dimensional data is to identify problems where regular statistical inference is feasible without relying on model selection consistency theory. Consistent model selection allows reduction of the nominal complexity of the problem to a manageable level by identifying all relevant features. However, model selection consistency typically requires uniformly strong signal to separate relevant features from irrelevant ones. Unfortunately, such uniform signal strength assumption is seldom supported by either the data or the underlying science, especially in biological, medical and sociological applications. The PI has proposed a semi-low-dimensional approach of statistical inference and successfully applied it to construct regular p-values and confidence intervals in high-dimensional regression and graphical models. This approach corrects the bias of model selectors just as semiparametric approach corrects the bias of nonparametric estimators. The proposed research will further develop this approach in high-dimensional data analysis and tackle new problems in ways not visible just a few years ago. It will focus on efficient statistical inference with semisupervised data and problems involving many high-dimensional or complex components, including confidence regions and significant tests for composite and multivariate features with high-dimensional data. The project will develop practical methods, efficient algorithms, statistical software, and solid theory directly relevant to common applications involving many high-dimensional or complex components.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Rutgers University
United States
Zip Code