This research concerns one of the most important and pervasive classes of problems in modern data analysis---inference on the structure of probability distributions. Specific inference problems to be addressed fall into two broad categories. The first involves inference on the structure of a single probability distribution, including estimation of joint and conditional densities, variable selection in linear regression, and the testing of independence and conditional independence among variables. The second involves inference on the relationship across multiple distributions. This includes testing whether two (or more) data samples have the same underlying distribution, and learning the structure of their difference, with particular interest given to finding local structures---differences that lie in small subsets---in large high-dimensional spaces. To address these problems, the investigator puts forward a novel framework for constructing Bayesian priors on multivariate distributions through recursive partitioning. Inference using this framework is flexible and adaptive. Moreover, the generative nature of these priors facilitates the modeling of dependence structure across multiple distributions and this leads to powerful methods for comparing distributions. To address the computational challenges in high-dimensional problems, the investigator lays out a set of computational strategies and proposes to develop several algorithms that can drastically improve the efficiency of Bayesian posterior inference in high-dimensional problems. These strategies utilize the recursive nature of the proposed framework to efficiently explore the global landscape of the corresponding posterior distributions.

Inference on the structure of probability distributions lies at the heart of many scientific inquiries, and new statistical theory and methods are urgently needed to accommodate the ever increasing dimensionality of data sets that is commonplace in modern scientific investigations. Two specific applications that motivate this project are the analysis of high-dimensional flow cytometry data in systems biology for unraveling the functional relationships among proteins as well as the mapping of human genes to various qualitative and quantitative traits, in particular those of common diseases such as cancer and diabetes. The concepts, theory, methodology, and algorithms developed in this project will be directly applicable to these problems, as well as to the analysis of data sets arising from a wide variety of other fields ranging from environmental science to economics.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Duke University
United States
Zip Code