The clustering procedure is ubiquitous in science and engineering, especially in analysis of massive data sets and complex networks. The purpose of clustering is to divide a given data set into groups of similar items, called clusters. Although many heuristics for clustering exist (and are widely used), the theoretical properties of this learning task are less well-understood. Relatively few theoretical analyses have been performed establishing conditions under which we may expect to successfully cluster data. The goal of this project is to establish realistic theoretical guarantees for clustering, both in terms of data amenable to clustering as well as in the development of effective, computationally efficient algorithms. The project will facilitate interdisciplinary research via applications in image processing, astronomy and physics, mathematical biology, social network analysis, and high-dimensional statistics. The project will also provide educational opportunities for graduate and undergraduate students through the development of new courses focusing on the interaction of machine learning and large-scale optimization, interdisciplinary undergraduate research programs, and K-12 outreach programs.
The project focuses on two main research thrusts. The first concerns the generalization of average case analyses and theoretical guarantees for perfect recovery for convex relaxations of model problems for clustering, such as the densest submatrix localization and graph partition problems. The goal of these analyses is to extend the existing state of the art to probabilistic models for data that are more representative of data observed in practical applications. Extensive average planted case analyses will be performed to establish computational and information-theoretic bounds for perfect recovery under generalizations of the stochastic block model. The second thrust will focus on the design, analysis, and implementation of numerical methods for large-scale semidefinite and nonlinear optimization, with specific focus on the algorithms for model problems for clustering and classification. Theoretical convergence analysis and numerical simulation will be performed to illustrate the efficacy of the methods. This project is jointly funded by the Computational Mathematics program and the Established Program to Stimulate Competitive Research (EPSCoR).
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.