Rapid technology advances in devices and computer systems continue to grow our capacity to collect and store data. Clustering is often the first stage analysis performed to discover patterns, gain insights, and extract knowledge from massive amount of data routinely faced in science, engineering, and commercial domains. For instance, in biomedical studies, clustering is used to reveal pathological subgroups and help researchers form new hypothesis for in-depth investigation. It is thus imperative to develop new clustering methods to meet the ever-increasing challenges of data with high complexity, huge volume, and from distributed sources. In this project, novel statistical and optimization-based approaches and software packages will be developed to address these challenges. Graduate students will be trained to conduct research at the forefront of machine learning. The research results will be used to enrich courses and outreach educational materials in data science.

A prominent statistical paradigm for clustering is based on mixture models, which is objective, parsimonious, not biased for known clusters, and has a probabilistic framework that can be extended and interpreted in standard ways. For high-dimensional large-scale data, existing mixture-model based methods have fundamental limitations. Furthermore, a big data environment can require the integration of clustering results at distributed sites, a problem called multi-source clustering. This research will advance cluster analysis from multiple aspects. First, hidden Markov model on variable blocks (HMM-VB), a special Gaussian mixture model (GMM), is developed to tackle high dimensionality. The estimation of HMM-VB will be enhanced by computationally efficient methods to identify the latent variable block structure and by mixture factor analyzers. Second, leveraging the latent states of HMM-VB, a new variable selection approach will be developed for clustering high-dimensional data. Third, the emerging topic of multi-source clustering will be studied. New methods based on optimal transport and Wasserstein barycenter will be developed for aggregating clustering results from multiple sources. Applications in biomedical areas will be pursued.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
2013905
Program Officer
Yong Zeng
Project Start
Project End
Budget Start
2020-08-01
Budget End
2023-07-31
Support Year
Fiscal Year
2020
Total Cost
$225,000
Indirect Cost
Name
Pennsylvania State University
Department
Type
DUNS #
City
University Park
State
PA
Country
United States
Zip Code
16802