This project addresses the development of general principled methods to efficiently include domain knowledge expressed as constraints into clustering algorithms. This not only allows improved clustering quality and algorithm performance but also finding insights that are novel and useful with respect to existing domain expertise. For example, phylogenetic trees built using hierarchical clustering should be consistent with existing domain knowledge such as that several species could not have evolved from one another.
Existing clustering under constraints work has focused on non-hierarchical clustering with conjunctions of must-link and cannot-link constraints that assert that two objects must or must not be in the same cluster. This work can be interpreted as expressing knowledge using a limited logic comprised of instances as objects, two binary relations (must-link and cannot-link) and a single connector (and).
This project will make three primary contributions. Firstly, it will examine a more complete logic to represent knowledge by adding in new relations, a complete set of connectives (not, and, or, implication), universal and existential quantifiers and new objects. This logic can express a large variety of knowledge such as minimum/maximum cluster separation, cluster width and even forcing distributions of certain objects across clusters. Secondly, it will investigate incorporating constraints beyond non-hierarchical clustering algorithms into algorithms for hierarchical agglomerative clustering, graph and social network clustering, and feature selection for clustering. Lastly, it will explore the computational challenges of using constraints by identifying easy to satisfy sets of constraints and developing a framework to explain why some constraint sets are more useful than others.
This project will demonstrate and validate its technical contributions on two core application domains: analysis of pandemic micro-simulations results to aid in disaster preparation and image mining. The long term vision is to incorporate knowledge efficiently in a principled manner into other data mining tasks such as classification, anomaly detection and association rules.
Project outreach for high school and undergraduates students will be in the form of hands-on discovery learning courses with emphasis on the two core application domains. For graduate students and researchers the tutorial slides, papers, datasets and software generated from the project will be freely available.
Further information on this project may be found at the URLs www.constrained-clustering.org and www.cs.albany.edu/~davidson.