Current approaches to big data and accompanying computational methods have left behind critical applications where the data is not a collection of individual points, but rather whole geometric objects. Such applications include medical imaging, LiDAR for self-driving cars, and single-cell RNA sequencing, to name a few. Transferring the overwhelming success of simpler data processing and statistical techniques to this regime requires not only large datasets, but also suitable models and algorithms for analysis of this more general type of data. The theory of optimal transport has proven valuable to address these limitations thanks to recent advances on the computational front. Yet, understanding optimal transport as a statistical tool is still in its infancy. This project aims at developing a "geometric data analysis" toolbox based on optimal transport to tackle these new datasets. This proposal will help create a common language to interact and collaborate across disciplines. Much of this research will be integrated in this curriculum and made available through MIT OpenCourseWare. This proposal will also enable rich interdisciplinary training of PhD and undergraduate students.
The proposed methods are built around the rich mathematical theory of optimal transport (OT). This theory provides a framework for the development of new methods for geometric data analysis in addition to their rigorous statistical and computational analysis. The nascent theory of computational optimal transport is still largely dissociated from statistics, and many methods do not account properly for sampling and measurement noise. To avoid the pitfalls of overfitting, this proposal singularly and systematically takes a statistical approach to geometric data analysis. With an understanding of the theoretical advantages and drawbacks of OT for statistical modeling, it will lead to scalable OT algorithms with strong statistical guarantees. A tangible outcome of this proposal is a cohesive toolbox extending not only averaging but also regression, classification, clustering, and other notions from classical statistics in a fashion that captures global geometric features of data. It will have a direct impact on various applications in analysis of not only medical images but also point clouds gathered by LiDAR for self-driving cars, sequences of gene expressions produced by single-cell RNA sequencing, and other diverse yet large-scale sources of data. These datasets contain millions of entities but resist application of standard statistical procedures; current state-of-the-art techniques for their analysis are ad-hoc, not generalizable, and fail to reach the quality achieved by "big data" tools in other domains. Educational impact will be made by incorporating this work in new degree programs in statistics at MIT.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.