Proposed research is motivated from the discrimination task with high dimension, low sample size data. The investigator studies the intrinsic difficulties of the discrimination problem by exploring asymptotic geometric structure of such data. Three main activities are proposed: a) the asymptotic inconsistency of leave-one-out cross-validation. The study is expected to explain why it shall fail when the number of variables greatly exceeds the number of observations; b) the effect of the relationship between the dimensionality and the sample size on the difficulty of discrimination task; c) a discriminant direction vector that only exists for the data with high dimension, low sample size. The data points collapse on this direction vector and also are most separated by group labels. The investigator explores its various theoretical and empirical properties such as its optimality, uniqueness, and asymptotic performances.

Even though these topics are loosely related one another in their technical aspects, their goals are essentially the same: exploring the nontraditional and unique challenges in high dimension, low sample size discrimination. While it has been an actively researched area over recent years, however, understanding fundamental challenges of high dimension, low sample size problems is yet satisfactory. This research approaches this problem in a way that may be regarded atypical in a traditional sense, but is more relevant to the problem itself. The applications of proposed research include text document classification such as Spam email filter, medical imaging such as functional magnetic resonance imaging, and bioinformatics such as microarray gene expression and proteomics.

Project Report

The central finding of the first part of the project is the discovery of a surprising connection between fractal behavior of predictors and rates of statistical learning. This discovery has important implications for understanding statistical learning in a wide variety of complex adaptive systems having fractal behavior, for example: patterns of gene expression related to disease incidence, sensitive periods of development in life course epidemiology, biological neural networks, and social networks with small-world dynamics. Specifically, it was shown that the rates of learning in such systems, modeled using a new type of functional regression, are determined by the Hurst parameter, an exponent of self-similarity scaling. In addition, it was shown that a certain type of bootstrap re-sampling naturally adapts to the full range of this fractal behavior, without requiring prior knowledge of the Hurst parameter. The framework for developing these results involved a flexible new class of functional regression models that provide inference for sensitive locations, such as the location of a gene related to disease incidence. The major innovation is that features describing sensitive locations are treated as the main targets of statistical inference, in contrast to earlier approaches to functional data analysis in which features involving the point impact of a predictor were ignored. The second part of the project has contributed to the development of statistical methods in physical oceanography, namely methods for the estimation of steady state ocean circulation from tracer data. Deep ocean circulation is difficult to measure directly, but it is possible to infer steady-state flow indirectly from tracer measurements such as oxygen, salinity and silica. The statistical problem is to estimate the circulation in a given layer of the ocean based on noisy tracer measurements at a sparse set of locations in that layer. A Bayesian approach to ill-posed inverse problems was adopted, with regularization imposed by specifying a prior distribution on the water flow across the layer. Simulation techniques were developed to extract information about this posterior distribution. The main contribution is that important convergence properties of these simulation methods, crucial for validating the accuracy of the reconstructed flow field, have been established for the first time.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
0806088
Program Officer
Gabor J. Szekely
Project Start
Project End
Budget Start
2008-07-01
Budget End
2012-06-30
Support Year
Fiscal Year
2008
Total Cost
$190,592
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027