Clustering, or partitioning data-points based on pairwise similarities, is one of the most important problems of unsupervised learning and data mining. Algorithms for clustering often suffer from low accuracy and high time complexity due to processing massive amounts of data that are noisy, and for lack of a benchmark to validate the clustering output. This project studies new directions for clustering algorithms and their fundamental limits by incorporating several new technical tools from information theory. Scalable algorithms for clustering with theoretical guarantees and practical validation should have high impact in all areas of data science.

This project aims to make use of interactive algorithms that adaptively gather labeled data to improve accuracy of clustering and related machine learning problems. In recent years, wide-spread use of crowdsourcing has facilitated the development of such algorithms. The algorithms that will be developed, as well as the lower bounding methods, extend to a large class of machine learning problems, and help understand trade-off between information theoretic optimality and computational efficiency. A generalized setup taking into account erroneous interactions, use of similarity measures, overlapping clusters and parallel interactions in design of algorithms is being pursued. Connections of this setup with well-known random graph community models such as the stochastic block model will also be explored in this project. New statistical models, such as the geometric block model, will be studied as an alternative to the popular stochastic block model. Algorithms for community recovery for the geometric block model and information theoretic lower bounds will be developed. This project will also develop statistical models that faithfully capture the properties of real networks to provide a benchmark for testing clustering algorithms. These include statistical models for community detection and interactive clustering.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2019-10-01
Budget End
2022-09-30
Support Year
Fiscal Year
2019
Total Cost
$499,980
Indirect Cost
Name
University of Massachusetts Amherst
Department
Type
DUNS #
City
Hadley
State
MA
Country
United States
Zip Code
01035