Clustering or organization of data into groups is a fundamental problem that forms the basis of exploratory data analysis and aids in data management. However, there is often a significant resource and computational cost associated with obtaining and analyzing large-scale datasets that routinely arise in modern systems, such as the Internet, biological and social networks. The ability to discover meaningful clusters in high-dimensional data that is plagued with high noise, outliers and missing observations, will have a significant impact on understanding these systems.
This project aims to develop robust clustering methods that can identify clusters very efficiently by selectively querying for the most informative data measurements. Spectral clustering is a popular technique that identifies clusters by analyzing the eigenvectors of a matrix of similarity values between the data points. This project investigates the effect of missing and erroneous data on the eigenvector structure, and leverages this understanding to develop active methods that intelligently guide subsequent data queries.
Robust and efficient clustering methods are crucial for identifying groups of proteins and drugs that interact with each other, paving the way for transformative health technologies. These methods are also important for learning and maintaining the organization of computer and social networks, thus promoting seamless exchange of ideas and technology. This PI is involved in disseminating the research through collaborations with the CMU Lane Center for Computational Biology, publishing results and software online (www.cs.cmu.edu/~aarti/research_projects), developing and teaching inter-disciplinary courses, as well as the Opportunities for undergraduate women research in Computer Science (OurCS) program at Carnegie Mellon University.
Identifying groups of similar objects, a task known as clustering, is important in many scientific fields. Despite its ubiquity, the effect of noise, outliers, and missing data (characteristics that plague any modern dataset) on clustering algorithms is not well understood. In this project, we 1) investigated noise and outlier tolerance properties of spectral clustering algorithms and 2) developed new clustering methods that use judicious choice of which data to collect, store and compute to enable learning clusters from partial observations. Towards the first goal, we developed a precise characterization of the noise tolerance of spectral clustering, a popular clustering algorithm. Our results indicate how the noise robustness depends on number of objects to be clustered, amount of noise and outliers, variations in similarity between and within clusters, and the resolution of the finest cluster of interest. We also developed a similar characterization for the biclustering setting, where the goal is to cluster pairs of objects such as drugs and genes. Towards the second goal, we modified existing spectral methods and developed new methods to learn clusters from partial observations. Specifically, we investigated ``active" methods that employ judicious choice of which similarities to measure or compute to enable learning of clusters of a desired resolution from few selective measurements. A natural issue that comes up is that clustering methods that can learn from partial observations are less robust to noise. To address this, we developed adaptive methods that can tradeoff noise tolerance vs. number of measurements in a principled way, as per the resource constraints and needs of the application. Additionally, our work on noise tolerant active clustering also provided insights to solutions for other related problems such as adaptive noisy matrix completion and recovering latent trees from selective end-to-end measurements as in network tomography. Broad impact: Ability to cluster data into homogeneous groups is a critical first step in myriad applications, including biomedical and healthcare data analysis, neuroscience, computer and social networks. In all these applications, the data is corrupt and often missing due to resource constraints. The algorithms we have developed allow practitioners to tradeoff competing requirements such as robustness, runtime and data collection budget in a principled way, unleashing the potential of modern datasets. Specifically, we applied our methods to cluster asthma patients according to their phenotypes and cluster brain fibers, in collaboration with domain experts. The project trained undergraduate and graduate students in cutting-edge tools at the intersection of machine learning, statistics, and signal processing. Components of the research were also incorporated in existing and new inter-disciplinary courses taught by the PI. Additionally, the research results were used for demonstration and inspiration in high-school STEM events, as well as encouraging women students to pursue careers in computer science.