Unsupervised learning can be described as the task of finding structure from unlabeled observations. This is a fundamental task in scientific fields like biology, material science, economics, and astronomy, where in a typical scenario researchers possess a large number of unlabeled data, but only a few (or none) expert labeled data points, as these are typically very expensive to obtain. This project focuses on the mathematical analysis of data science algorithms for unsupervised learning. Broadly speaking, the aim of the investigation is to contribute to the development of mathematical foundations for data science and artificial intelligence by providing rigorous mathematical theory supporting methodologies used in practice, as well as by proposing principled novel ones. The investigator will not only study individual algorithms, but find common structure between them and highlight their otherwise unclear unity.
Among the many methodologies for unsupervised learning found in the literature, graph-based methods occupy a prominent role, and they will be the focus of this investigation. These are algorithms that rely on having access to similarity graphs, which endow data sets with a geometric structure. Different graphs capture different features of a given data set and hence are useful for different tasks. The project will focus on the following concrete directions: 1) the proposal and analysis of new algorithms for unsupervised learning inspired by variational methods for evolution on metric spaces and in particular on the space of probability measures over discrete and semi-discrete structures. In this direction the main goal is to introduce a mathematical formalism for the design of new learning methodologies, and to use this formalism in order to build connections between seemingly unrelated statistical procedures like those inspired by mode-seeking techniques as the ones based on spectral methods. Some of the applications in mind go beyond unsupervised learning, and also include the systematic testing of deep neural network architectures. The second direction concerns the study of statistical properties of graph-based unsupervised learning algorithms in settings where graphs capturing the geometry of a data set are very sparse (below standard connectivity), or where data generating distributions are degenerate and violate standard geometric assumptions found in the literature. In this direction, the goal is to identify the limits of sparsity that allow for meaningful learning, as well as to develop quantitative statistical error analysis for solutions to non-convex optimization problems on graphs built on data sets, for graphs that are connected to a ground truth geometric model.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.