This project addresses the development of general principled methods to efficiently include domain knowledge expressed as constraints into clustering algorithms. This not only allows improved clustering quality and algorithm performance but also finding insights that are novel and useful with respect to existing domain expertise. For example, phylogenetic trees built using hierarchical clustering should be consistent with existing domain knowledge such as that several species could not have evolved from one another.
Existing clustering under constraints work has focused on non-hierarchical clustering with conjunctions of must-link and cannot-link constraints that assert that two objects must or must not be in the same cluster. This work can be interpreted as expressing knowledge using a limited logic comprised of instances as objects, two binary relations (must-link and cannot-link) and a single connector (and).
This project will make three primary contributions. Firstly, it will examine a more complete logic to represent knowledge by adding in new relations, a complete set of connectives (not, and, or, implication), universal and existential quantifiers and new objects. This logic can express a large variety of knowledge such as minimum/maximum cluster separation, cluster width and even forcing distributions of certain objects across clusters. Secondly, it will investigate incorporating constraints beyond non-hierarchical clustering algorithms into algorithms for hierarchical agglomerative clustering, graph and social network clustering, and feature selection for clustering. Lastly, it will explore the computational challenges of using constraints by identifying easy to satisfy sets of constraints and developing a framework to explain why some constraint sets are more useful than others.
This project will demonstrate and validate its technical contributions on two core application domains: analysis of pandemic micro-simulations results to aid in disaster preparation and image mining. The long term vision is to incorporate knowledge efficiently in a principled manner into other data mining tasks such as classification, anomaly detection and association rules.
Project outreach for high school and undergraduates students will be in the form of hands-on discovery learning courses with emphasis on the two core application domains. For graduate students and researchers the tutorial slides, papers, datasets and software generated from the project will be freely available.
Further information on this project may be found at the URLs www.constrained-clustering.org and www.cs.albany.edu/~davidson.
Clustering is a core part of machine learning and data mining. The purpose of clustering is to take a collection of objects (images, documents, videos for instance) and divide into multiple different clusters. Here each cluster contains very similar objects and the purpose is to add structure to large collections of data. Clustering is used extensively in business (companies cluster their customer databases to generate customer types/profiles), medicine (medical imaging scans are clustered to find regions of interest) and science (climatic data is clustered to find dipoles). However, clustering is inherently undirected and there is no principled way for the practitioner to add guidance and knowledge to the problem. Guidance and knowledge here is in the form of constraints on what the resultant clusters should look like. For example, businesses may require that high yield and low yield customers be in different clusters and neurologists that certain anatomical regions must be in the same cluster. The intellectual merit of the work supported by this grant was a multi-facet exploration of constraints in clustering. We investigated the computational complexity of adding constraints to clustering [1][2]. We explored constraints in a variety of algorithms such as to build hierarchies [3], to partition graphs [4] and constraints in tensors [5]. We also explored related problems such as polling a domain expert for constraints in active formulations [6]. The broader impact of the work is demonstrated by the resultant software’s use in a variety of domains by ourselves and others such as medical imaging (fMRI data) [7], documents in multiple languages [8] and personal information [9]. The research was supplemented by a Google Grant on the topic of hierarchies and constraints. References [1] Ian Davidson, S. S. Ravi: The complexity of non-hierarchical clustering with instance and cluster level constraints. Data Min. Knowledge. Discovery 14(1): 25-61 (2007) [2] Ian Davidson, S. S. Ravi: Using instance-level constraints in agglomerative hierarchical clustering: theoretical and empirical results. Data Min. Knowl. Discov. 18(2): 257-282 (2009) [3] Sean Gilpin, Ian N. Davidson: Incorporating SAT solvers into hierarchical clustering algorithms: an efficient and flexible approach. KDD 2011: 1136-1144 [4] Xiang Wang, Ian Davidson: Flexible constrained spectral clustering. KDD 2010: 563-572 [5] Ian N. Davidson, Sean Gilpin, Owen T. Carmichael, Peter B. Walker: Network discovery via constrained tensor analysis of fMRI data. KDD 2013: 194-202 [6] Xiang Wang, Ian Davidson: Active Spectral Clustering. ICDM 2010: 561-568 [7] Henry L. Phillips, Peter B. Walker, Carrie H. Kennedy, Owen T. Carmichael, Ian N. Davidson: Guided Learning Algorithms: An Application of Constrained Spectral Partitioning to Functional Magnetic Resonance Imaging (fMRI). HCI (24) 2013: 709-716 [8] Xiang Wang, Buyue Qian, Ian Davidson: Improving document clustering using automated machine translation. CIKM 2012: 645-653 [9] Weifeng Zhi, Xiang Wang, Buyue Qian, Patrick Butler, Naren Ramakrishnan, Ian Davidson: Clustering with Complex Constraints - Algorithms and Applications. AAAI 2013