Clustering is a fundamental tool for data science, hypothesis discovery, pattern discovery, and information integration. Given a collection of objects, clustering is the task of automatically grouping the objects so that objects within a group (called a cluster) are more similar to each other than to objects in other clusters. Clustering is widely used in medicine, engineering, science and commerce. Most modern clustering methods scale well to a large number of objects, but not to a large number of clusters. Furthermore currently widely used clustering methods excessively assign objects to clusters, suffer in accuracy, and do not represent uncertainty in the clustering. All of these weaknesses limit analysis capabilities in many scientific, engineering, and other high-impact applications. This project is developing new machine learning and algorithms for large-scale clustering that scales to both massive number of objects and massive number of clusters. This project will build on the recent preliminary success with a family of algorithms that build hierarchical clustering, which supports efficient re-assignment of data to new clusters, and which naturally represents uncertainty. The new research aims to further increase accuracy and scalability. The project team will demonstrate its new research in multiple domains relevant to national priorities, including clustering chemical compounds for material science discovery, clustering single cell genome data, and entity resolution on scientific metadata (such as paper authors, patent authors, papers, institutions, etc)--- creating tools that advance scientific discovery, collaboration and scientific peer review. All of the software developed as part of this project will be released as open source software in order to facilitate experimentation and adoption of our methods in research and practice. The project team will develop a tutorial at the intersection of machine learning and algorithms, and will additionally teach a course on efficient clustering methods to researchers beyond computer scientists.
This project will develop new research on machine learning and algorithms for hierarchical clustering that scales to both massive number of input objects, N, and massive number of clusters, K---a problem setting termed "extreme clustering," named after its similarly- motivated supervised cousin, "extreme classification." The project builds on the successes of recent preliminary work on PERCH, a family of algorithms for large-scale, incremental-data, non-greedy, hierarchical clustering that has achieved remarkable new state-of-the-art results. The method efficiently routes new data points to the leaves of an incrementally-built tree. Motivated by the desire for both accuracy and speed, the approach performs tree rotations both for the sake of enhancing subtree purity and encouraging balanced trees. Experiments demonstrate that PERCH constructs more accurate trees than other tree-building clustering algorithms and scales well with both N and K, achieving a higher quality clustering than the strongest at clustering competitor in nearly half the time. The project will perform new research (a) improving flexibility through alternative clustering cost functions and data representations, (b) further improving scalability and accuracy through new tree routing functions, (c) developing new tree-cut methods for determining the best clusterings and distributions over clusterings, and (d) inventing new methods for joint clustering of multiple inter-related data instance types. Evaluation and application of the research will be conducted on multiple broad-impact, large-data domains, including biomedicine, material science, image analysis, and scientific information integration.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.