Bayesian nonparametrics is a statistical modeling framework that combines flexibility of classical nonparametric statistical methods with principled assessment of uncertainty under the Bayesian paradigm. Traditional Bayesian nonparametric methods, however, have largely focused on models based on a single data set, while many modern statistical scenarios involve multiple data sets of similar nature collected under related or comparative conditions. This project develops a suite of new modeling and computational strategies that are tailored for effective joint modeling of multiple data sets in ways that (i) capture cross-sample variation in modern complex data, and (ii) are computationally efficient to allow application to massive data. The developed methodology will have impact in a range of fields including biology, economics, education, astrophysics, political science, and climate science, where the task is to properly characterize variation across data sets. The project provides excellent research training opportunities for graduate students.

Novel models, methods, and algorithms will be developed in the context of two classes of widely used Bayesian nonparametric models: (i) mixture models with discrete random measure (DRM) mixing distributions (e.g., Dirichlet process mixtures) and (ii) tree-structured random measure (TSRM) models (e.g., Polya tree type models). These two model classes are different in nature with each having its own advantages and limitations in modeling multiple data sets, and as such the strategies in advancing these two model classes are distinct. A key limitation of DRM mixtures in modeling multiple samples is their lack of flexibility in characterizing the cross-sample variation, and thus a new latent variable modeling strategy will be developed for substantially enhancing their capacity in this regard. Also to be investigated are the theoretical and empirical properties of the resulting dispersion mixture models, as well as generalizations of this strategy in broader ranges of hierarchical models where incorporating flexible cross-sample variation in observed and latent quantities is important. For TSRM models, capable of characterizing complex cross-sample variation, the focus is on addressing their lack of scalability, both computational and statistical, with respect to increasing dimensionality as well as their sensitivity to the underlying tree structures, a critical component for building such models. The development for these two model classes will form a powerful and general toolbox that can be applied in a variety of scientific and engineering problems involving the analysis of multiple related data sets.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
2013930
Program Officer
Pena Edsel
Project Start
Project End
Budget Start
2020-07-01
Budget End
2023-06-30
Support Year
Fiscal Year
2020
Total Cost
$75,828
Indirect Cost
Name
Duke University
Department
Type
DUNS #
City
Durham
State
NC
Country
United States
Zip Code
27705