Rapidly advances in modern science and technology are resulting in the generation of data sets of unprecedented sizes and complexity. A common source of complexity in data sets is the presence of subpopulations. For example, a disease may have several subtypes; and customers may be attracted by different features of the same product. Cluster analysis is a popular tool to identify subpopulations, which affords a refined investigation on each of them. This project develops novel clustering methods to reveal the increasingly complex patterns within contemporary data sets. In addition to the allocation of subjects, the clustering methods in this research further find the defining features of each subpopulation. The research team will apply these methods to various real-world problems with potential to affect multiple fields that rely on such data sets. Open source and user-friendly software will also be provided. Moreover, this project will be integrated with educational and outreach activities, including new courses, interdisciplinary training, and mentoring of underrepresented student groups in mathematical and statistical sciences.

Classical clustering methods tend to be inefficient and/or inaccurate when data are highly correlated, heavy-tailed, and/or comprise higher-order tensors. To address these challenges in high-dimensional unsupervised learning problems, the investigators pursue new probabilistic models and statistical methods for clustering of large and complex data. The investigators promote parsimony in the models by the synthesis of the sparsity principle through variable selection and the dimension reduction principle through linear projections. The pursuant probabilistic frameworks enable simultaneous variable selection/dimension reduction, parameter estimation and prediction. By separating and excluding the noise in the data set, efficiency in estimation and prediction is greatly enhanced. Concurrently, parsimony in the models leads to scalable algorithms and new statistical insights.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Florida State University
United States
Zip Code