Modern machine learning algorithms often encounter a trade off between scalability and modeling power. For the problem of data clustering, Bayesian approaches enjoy numerous modeling advantages over classical methods, but hard clustering methods such as k-means are often preferred in practice due to their simplicity and scalability.

This project explores bridging the gap between classical hard clustering methods and clustering models based on Bayesian nonparametrics. The first step is an asymptotic result connecting the Dirichlet process Gaussian mixture model with a k-means-like algorithm that does not fix the number of clusters in advance. Using this key result, the PI and his team will explore four related research directions which collectively demonstrate the utility of this asymptotic approach: (1) extensions of the analysis to hierarchical Bayesian models, leading to scalable hard clustering methods over multiple data sets; (2) connections to spectral methods and graph clustering, leading to novel and flexible graph clustering methods; (3) extensions beyond the Gaussian setting, leading to new approaches to topic modeling and other discrete-data clustering problems; and (4) extensive experiments in both the computer vision and text domains.

Given that k-means is truly a workhorse of machine learning, these four directions have the potential to impact a wide array of large-scale applications including computer vision, bioinformatics, social network analysis, and many other domains. Furthermore, the research will benefit the broader community through released software and integration into coursework at Ohio State University.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1217433
Program Officer
Weng-keen Wong
Project Start
Project End
Budget Start
2012-07-01
Budget End
2016-06-30
Support Year
Fiscal Year
2012
Total Cost
$439,689
Indirect Cost
Name
Ohio State University
Department
Type
DUNS #
City
Columbus
State
OH
Country
United States
Zip Code
43210