This research will leverage ideas from algebraic and differential geometry to address core problems in modern high-dimensional and massive data science. The project will develop statistical methods and numerical tools, grounded in solid mathematical, statistical, and computational foundations, to extract low dimensional geometry from massive data with applications in clustering, data summarization, prediction, dimension reduction, and visualization. The solutions developed as part of this project can result in fundamental advances in practical applications across fields as diverse as biology, medicine, social sciences, communication networks, and engineering. In addition to internal validation via statistical and mathematical theory and simulation studies, the methods developed in the project will involve external validation via interdisciplinary applications. These applications include: (1) inference of population structure from genomic data; (2) document analysis via topic models; and (3) inference of subsets of putative gene networks relevant to drug resistance in melanoma.

The research is motivated by the central premise that, even though the amount of data may be massive, a compact model can represent these data. Specifically, high-dimensional and/or massive data can be reasonably approximated by a mixture of subspaces, for which sparse representations exist. A mixture of subspaces of potentially different dimensions is a flexible, rich representation of data with nice mathematical properties that can scale to large data. There are several fundamental challenges in modeling mixtures of subspaces that will be addressed in this research: 1) the subspaces will be of different dimensions, 2) both the subspace parameters and the mixing parameters need to be inferred, 3) efficient algorithms for inference are required for both high-dimensional and massive data. The central foundational impediment in all of these challenges is that the model is a stratified space (a union of manifolds), and therefore has singularities. The key insight in this research is that there exist embeddings and representations of the model space that mitigate these singularities. These ideas are implemented as concrete Bayesian, frequentist, and numerical algorithms and models to address the real world examples listed above.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1546132
Program Officer
Nandini Kannan
Project Start
Project End
Budget Start
2015-12-01
Budget End
2018-11-30
Support Year
Fiscal Year
2015
Total Cost
$322,242
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705