The era of big data has introduced unprecedented computational and mathematical challenges. Traditional machine learning algorithms often lack scalable computational complexity, while modern approaches lack solid mathematical foundations. Moreover, high data dimensionality creates challenges for traditional methods of data analysis. The principal investigators (PIs) propose to combine classic dimension reduction methods with data-driven distances, so that both the distance and embedding procedure are data dependent. This novel approach allows for greater flexibility in balancing the density-based and geometric features of the data, achieves a density-based simplification of geometry, and insightfully represents the data in a small number of dimensions. In contrast to black box methods such as deep learning, the developed methodology can be rigorously analyzed to derive strong theoretical guarantees for several statistical and machine learning tasks. This research will contribute computational tools for cancer immunogenomics and the investigators will consult with the Rogel Cancer Center at the University of Michigan for scientific questions related to tumor immunology and T-cell biology. In addition, new data analysis tools will be made publicly available in an open source software package.

The investigators' approach is driven by the analysis of a family of data-dependent path metrics. These metrics are both density-sensitive and geometry-preserving, with the balance governed by the choice of a single parameter p. By utilizing the space of paths through data, the PIs will obtain density based metrics and embeddings while avoiding the explicit computation of a density estimator, which may be unreliable in a large number of dimensions. The PIs will propose a simple yet highly flexible data model which does not assume the data is sampled from a manifold or collection of manifolds, and investigate the continuous limit of these metrics and an associated graph Laplacian operator. By continuously varying the parameter p, the PIs will propose to create data videos which represent the data from multiple perspectives. The PIs will investigate both multidimensional scaling and graph Laplacian embeddings as mechanisms for obtaining path-based low dimensional representations, and will explore fast algorithms with scalable computational complexity for approximating these metrics. The PIs will contextualize path metrics in the larger frame work of data-driven metrics and focus specifically on the analysis of biological data.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1912906
Program Officer
Yuliya Gorb
Project Start
Project End
Budget Start
2019-07-15
Budget End
2021-06-30
Support Year
Fiscal Year
2019
Total Cost
$150,000
Indirect Cost
Name
Michigan State University
Department
Type
DUNS #
City
East Lansing
State
MI
Country
United States
Zip Code
48824