Proposed research is motivated from the discrimination problem with high dimension, low sample size data. The investigator studies the intrinsic difficulties of the discrimination problem by exploring asymptotic geometric structure of such data. Three main activities are proposed: a) the asymptotic inconsistency of leave-one-out cross-validation. The study is expected to explain why it shall fail when the number of variables greatly exceeds the number of observations; b) the effect of the relationship between the dimensionality and the sample size on the difficulty of discrimination task; and c) a discriminant direction vector that only exists for the data with high dimension, low sample size. The data points collapse on this direction vector and also are most separated by group labels. The investigator plans to study its theoretical and empirical properties of the procedure such as its optimality, uniqueness, and asymptotic performances.

The overall goal is to investigate the nontraditional and unique challenges in high dimension, low sample size discrimination. The proposed approach may be regarded atypical, but it is more relevant to the problem itself. The applications of proposed research include text document classification such as Spam email filter, medical imaging such as functional magnetic resonance imaging, and bioinformatics such as microarray gene expression and proteomics.

Project Report

This project contributes the statistics community by increasing awareness of various challenges in the analysis of data sets with more variables than the number of observations. Such data sets can have some counter-intuitive characteristics unlike traditional low-dimensional multivariate data sets in common textbooks. For example in classification, the discriminant direction vector with zero training error can be a good example such that something seemingly artificial can work surprisingly well in high dimensional settings. This project also aims to contribute the field by developing data analysis methods suitable for high dimension, low sample size data, when traditional approaches can no longer work. The new classification methods that controls the variance of the projected data can be viewed as a good example. We also have focused on some important high dimensional data sets, such as astronomy data and functional magnetic resonance imaging data, since these types of data are special even in high dimensional data analysis problems. In astronomy the study of variable stars i.e., stars characterized by showing significant variationin their brightness over time has made crucial contributions to our understanding of manyfields, from stellar birth and evolution to the calibration of the extragalactic distance scale. We performed a time series analysis of the periods between maximum brightness ofa group of 378 long period variable stars. Data from functional Magnetic Resonance Imaging (fMRI) have many (around 60,000) dimensions(voxels) that are spatially correlated and typically very small sample sizes, often less than 20, while time-dependent observations are measured. In this way, the analysis offMRI data is one of the most important high dimensional data analysis. We conducted an investigation of the null hypothesis distribution for functional magnetic resonance imaging(fMRI) data using multi-scale analysis. One of the most important lessons we have learned during the project is that high-dimensionality may not always give us challenges. The surplus of the dimensions (variables) can be translated into more leverage in terms of the modes of regularization or model selection. Another important lesson is that relevant non-classical asymptotics for increasing dimensionality is crucial to justify the methods.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Georgia
United States
Zip Code