When Can We Cluster Data? Improved Conditions for Perfect Recovery and Numerical Methods

Ames, Brendan

Abstract

The clustering procedure is ubiquitous in science and engineering, especially in analysis of massive data sets and complex networks. The purpose of clustering is to divide a given data set into groups of similar items, called clusters. Although many heuristics for clustering exist (and are widely used), the theoretical properties of this learning task are less well-understood. Relatively few theoretical analyses have been performed establishing conditions under which we may expect to successfully cluster data. The goal of this project is to establish realistic theoretical guarantees for clustering, both in terms of data amenable to clustering as well as in the development of effective, computationally efficient algorithms. The project will facilitate interdisciplinary research via applications in image processing, astronomy and physics, mathematical biology, social network analysis, and high-dimensional statistics. The project will also provide educational opportunities for graduate and undergraduate students through the development of new courses focusing on the interaction of machine learning and large-scale optimization, interdisciplinary undergraduate research programs, and K-12 outreach programs.

The project focuses on two main research thrusts. The first concerns the generalization of average case analyses and theoretical guarantees for perfect recovery for convex relaxations of model problems for clustering, such as the densest submatrix localization and graph partition problems. The goal of these analyses is to extend the existing state of the art to probabilistic models for data that are more representative of data observed in practical applications. Extensive average planted case analyses will be performed to establish computational and information-theoretic bounds for perfect recovery under generalizations of the stochastic block model. The second thrust will focus on the design, analysis, and implementation of numerical methods for large-scale semidefinite and nonlinear optimization, with specific focus on the algorithms for model problems for clustering and classification. Theoretical convergence analysis and numerical simulation will be performed to illustrate the efficacy of the methods. This project is jointly funded by the Computational Mathematics program and the Established Program to Stimulate Competitive Research (EPSCoR).

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Mathematical Sciences (DMS)
Type: Standard Grant (Standard)
Application #: 2012554
Program Officer: Malgorzata Peszynska

Project Start
Project End
Budget Start: 2020-08-15
Budget End: 2022-07-31
Support Year
Fiscal Year: 2020
Total Cost: $129,374
Indirect Cost

When Can We Cluster Data? Improved Conditions for Perfect Recovery and Numerical Methods
Ames, Brendan
University of Alabama Tuscaloosa, Tuscaloosa, AL, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments