The past two decades have witnessed an explosion in the scale and complexity of data sets that arise in science and engineering. Broadly, clustering methods which discover latent structure in data are our primary tool for navigating, exploring and visualizing massive datasets. These methods have been widely and successfully applied in phylogeny, medicine, psychiatry, archaeology and anthropology, phytosociology, economics and several other fields. Despite its ubiquity, the widespread scientific adoption of clustering methods have been hindered by the lack of flexible clustering methods for high-dimensional datasets and by the dearth of meaningful inferential guarantees in clustering problems. Accordingly, the goal of this research is to develop new and effective methods for clustering complex data-sets, and to further develop an inferential grounding -- which will in turn lead to actionable conclusions -- for these methods. This research will lead to the development of new clustering methods, as well as to a deeper understanding of the fundamental limitations of methods aimed at uncovering latent structure in data.

The research component of this project consists of four aims designed to address related aspects of this high-level goal: (a) analyze and develop new clustering methods for high-dimensional datasets, with a particular focus on practically useful methods like mixture-model based clustering, and minimum volume clustering; (b) develop novel methods for inference in the context of clustering, motivated by scientific applications where it is important not only to cluster the data but also to clearly characterize the sampling variability of the discovered clusters; (c) develop fundamental lower bounds for high-dimensional clustering (d) develop novel methods for clustering functional data with inferential guarantees. These research components are closely coupled with concrete educational initiatives, including the development and broad dissemination of publicly-available software for high-dimensional clustering; tutorials and workshops at Machine Learning conferences and fostering further interactions between the Departments of Statistics and Machine Learning at Carnegie Mellon.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Nandini Kannan
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Carnegie-Mellon University
United States
Zip Code