The past two decades have witnessed an explosion in the scale and complexity of data sets that arise in science and engineering. Broadly, clustering methods which discover latent structure in data are our primary tool for navigating, exploring and visualizing massive datasets. These methods have been widely and successfully applied in phylogeny, medicine, psychiatry, archaeology and anthropology, phytosociology, economics and several other fields. Despite its ubiquity, the widespread scientific adoption of clustering methods have been hindered by the lack of flexible clustering methods for high-dimensional datasets and by the dearth of meaningful inferential guarantees in clustering problems. Accordingly, the goal of this research is to develop new and effective methods for clustering complex data-sets, and to further develop an inferential grounding -- which will in turn lead to actionable conclusions -- for these methods. This research will lead to the development of new clustering methods, as well as to a deeper understanding of the fundamental limitations of methods aimed at uncovering latent structure in data.

The research component of this project consists of four aims designed to address related aspects of this high-level goal: (a) analyze and develop new clustering methods for high-dimensional datasets, with a particular focus on practically useful methods like mixture-model based clustering, and minimum volume clustering; (b) develop novel methods for inference in the context of clustering, motivated by scientific applications where it is important not only to cluster the data but also to clearly characterize the sampling variability of the discovered clusters; (c) develop fundamental lower bounds for high-dimensional clustering (d) develop novel methods for clustering functional data with inferential guarantees. These research components are closely coupled with concrete educational initiatives, including the development and broad dissemination of publicly-available software for high-dimensional clustering; tutorials and workshops at Machine Learning conferences and fostering further interactions between the Departments of Statistics and Machine Learning at Carnegie Mellon.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1713003
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2017-07-01
Budget End
2021-06-30
Support Year
Fiscal Year
2017
Total Cost
$380,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213