This project uses tools from optimization and statistics to put in the hands of practitioners a suite of methods to validate the results of clustering. Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups. The methods to be developed will recognize the cases when the data is well clustered and, in these cases, will provide a "certificate of stability" guaranteeing this. This research will develop validation methods for several widely used clustering paradigms (such as K-means, Spectral Clustering, finite Mixture Models) in realistic data scenarios. These methods will be integrated and disseminated with the big data open source platform megaman, developed and maintained by the group of PI Meila.

The proposed approach is based on the notions of good and stable clustering. ``Good'' means that clustering C fits the data well, according to the current clustering paradigm. ``Stable'' means that the only clusterings that fit well are small perturbations of C. This can only happen when C captures structure present in the data. While goodness can be easily checked in practice, stability is not a property that can be verified directly. The core of this project is to find conditions on the data and C that guarantee stability AND are practically verifyable. Such results are known as stability results. The project outlines two novel approaches to this goal. The first approach is based on using convex relaxations to the original clustering problem. The second approach is based on ``recycling'' existing theoretical results proved under assumptions about the data generating model. Often, the proof of such a result contains the elements of a model free stability proof. Thus, this project will be using existing statistical theory in a novel way. Among convex relaxations for clustering, the relaxations to a Semi-Definite Program (SDP) are especially promising. Therefore, ways to accelerate the validation algorithms by exploiting the special structure of the SDPs will be explored.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1810975
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2018-08-01
Budget End
2021-07-31
Support Year
Fiscal Year
2018
Total Cost
$275,000
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195