Proposed research develops new practical methodology and algorithms as well as theoretical results for high-dimensional data classification. The main issue is that when the number of measured variables exceeds by far the number of observations, estimating the full covariance matrix accurately is impossible. The investigator has previously shown that ignoring the dependence completely in such a situation is a better option, but it means discarding a lot of information. This problem will be resolved by developing new sparse covariance estimators which will only retain important dependence information, and studying their behavior theoretically in the context of discriminant analysis. New practical, computationally efficient algorithms for computing these estimators from data will be developed on the basis of clustering and graph partitioning methods and compared to existing regularization techniques. Theoretically, asymptotically optimal or near optimal classification performance is expected to be demonstrated. Another way to reduce the size of the problem and get to the underlying structure is to reduce the data dimension using recently developed nonlinear manifold projection methods, which aim to discover a nonlinear low-dimensional embedding preserving most of the information contained in the data. Using these methods requires estimating intrinsic data dimension and neighborhood scale on a manifold, and new rigorous estimators for both are proposed, along with an analysis of their statistical properties. Careful estimation of these two parameters will improve on the current mostly heuristic methods used in machine learning and increase applicability of manifold projection methods for high-dimensional data classification.
This proposal addresses the new challenges posed by the massive amounts of data collected in the modern world by developing new theoretical and practical tools for dealing with high-dimensional data, particularly with the situation when the number of measurements taken for one observation is large relative to the number of observations. New sparse estimators of dependence structure in such data are developed, which only contain the information relevant for data classification. The new estimators can also be used in any problem where large covariance matrices need to be estimated from limited amount of data, and hence will have an impact on a wide range of modern applications, such as classification and analysis of gene expression data, analysis of complex chemical and physical experiments, remote sensing, and medical imaging, among others.