Cluster Analysis for High-Dimensional and Multi-Source Data

Li, Jia; Lin, Lynn

Abstract

Rapid technology advances in devices and computer systems continue to grow our capacity to collect and store data. Clustering is often the first stage analysis performed to discover patterns, gain insights, and extract knowledge from massive amount of data routinely faced in science, engineering, and commercial domains. For instance, in biomedical studies, clustering is used to reveal pathological subgroups and help researchers form new hypothesis for in-depth investigation. It is thus imperative to develop new clustering methods to meet the ever-increasing challenges of data with high complexity, huge volume, and from distributed sources. In this project, novel statistical and optimization-based approaches and software packages will be developed to address these challenges. Graduate students will be trained to conduct research at the forefront of machine learning. The research results will be used to enrich courses and outreach educational materials in data science.

A prominent statistical paradigm for clustering is based on mixture models, which is objective, parsimonious, not biased for known clusters, and has a probabilistic framework that can be extended and interpreted in standard ways. For high-dimensional large-scale data, existing mixture-model based methods have fundamental limitations. Furthermore, a big data environment can require the integration of clustering results at distributed sites, a problem called multi-source clustering. This research will advance cluster analysis from multiple aspects. First, hidden Markov model on variable blocks (HMM-VB), a special Gaussian mixture model (GMM), is developed to tackle high dimensionality. The estimation of HMM-VB will be enhanced by computationally efficient methods to identify the latent variable block structure and by mixture factor analyzers. Second, leveraging the latent states of HMM-VB, a new variable selection approach will be developed for clustering high-dimensional data. Third, the emerging topic of multi-source clustering will be studied. New methods based on optimal transport and Wasserstein barycenter will be developed for aggregating clustering results from multiple sources. Applications in biomedical areas will be pursued.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Mathematical Sciences (DMS)
Type: Standard Grant (Standard)
Application #: 2013905
Program Officer: Yong Zeng

Project Start
Project End
Budget Start: 2020-08-01
Budget End: 2023-07-31
Support Year
Fiscal Year: 2020
Total Cost: $225,000
Indirect Cost

Cluster Analysis for High-Dimensional and Multi-Source Data
Li, Jia Lin, Lynn
Pennsylvania State University, University Park, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments