In this age of big data, clustering, or the task of grouping objects based on similarity, is a fundamentally important problem with growing applications in science, engineering, and medicine. For instance, in biomedicine clustering is frequently used to discover possible co-regulation between genes or identify subtypes of cancer to enhance diagnosis and treatment. Currently, state-of-the-art methods are of limited utility because (1) they are based on a vague notion of ?similarity? that may induce incorrect assumptions, (2) they make use of ad hoc optimization criteria that may not reflect clustering performance, or (3) they are sensitive to outliers. Furthermore, validation methods to assess clustering performance are not reliable. In short, there is great potential to advance science and medicine by developing a firm foundation for clustering.
This research resolves these issues by transforming clustering from a subjective activity to an objective operation. The investigators develop a Bayes optimal decision theory for clustering that is analogous to Bayes decision theory in classification. This sheds light on optimal clustering algorithms, their ability to make predictions, and fundamental limits of performance under known random point process models. The investigators also develop robust Bayesian clusterers to address model uncertainty, and examine methods to optimally train clustering operators from examples of correctly clustered point sets. These works parallel recent results in intrinsically robust Bayesian filtering and optimal Bayesian classification theory. Understanding robustness and learning are crucial if one is to apply clustering algorithms in real world problems with uncertain models. This research paves the way to resolving a number of fundamental issues that could not be addressed before, including cluster operator convergence, error estimation consistency and error estimation accuracy.