Identifying variation across data sets is one of the most commonly encountered statistical inferential tasks, and it lies at the heart of numerous applications in a wide range of fields from astrophysics and biology to economics and political science. The recent explosion of "big data" has raised several critical challenges in detecting cross-sample variation, which render existing methods inadequate and entail an urgent need for new methodologies. The most notable and prevalent challenges include complex distributional structures, the highly local nature of variation, various extraneous sources of variation, data sparsity, and massive computational demand. The overarching aim of this research project is to develop a general framework including theory, methods, algorithms, and software for effectively identifying variation in modern big data sets to address these challenges.

Specific inference problems to be addressed in this research project include: (i) identifying differences, especially highly local variations, across multiple data sets; (ii) separating intrinsic (i.e., scientifically interesting) cross-sample variation from extraneous variation; (iii) decomposing cross-sample variation into contributions from multiple sources; and (iv) identifying cross-sample variation and variance components in general random objects, including a variety of processes and functional observations. The use of multi-scale inference and Bayesian nonparametric modeling has led to development of a general probabilistic model-based framework for detecting cross-sample variation that integrates two powerful inference tactics -- multi-resolution scanning and graphical modeling. Multi-resolution scanning is the strategy of scanning through the sample space using windows of various sizes, carrying out testing or estimation for the structure of interest -- the cross-sample variation -- on each window. A class of graphical models is then designed to incorporate various dependency structures across scanning windows and data samples, thereby allowing borrowing strength among windows and related samples to achieve high statistical efficiency in identifying cross-sample variation. The project aims to construct a suite of computationally efficient and theoretically justifiable inferential methods and algorithms and to investigate their statistical properties.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Duke University
United States
Zip Code