Many challenging questions in data science can be characterized in terms of inference for random subsets of patients, customers, proteins, symptoms, or other experimental units. Examples include the search for a subpopulation of patients who most benefit from a given treatment, the identification of subsets of mutations that characterize different tumor cell subpopulations that could then serve as possible treatment targets, or the discovery of latent disease patterns in electronic health records that could be used to propose an improved allocation of resources. In all three examples, the unusual nature of the inference targets as random subsets gives rise to challenging data analysis problems. In contrast, most traditional methods work for inference targets that are a single number, like a treatment effect, a level of differential protein expression, or a mean response. This project aims to address this gap in currently available methodology by developing and applying new methods to solve several specific inference problems related to random subsets.

This project develops novel statistical inference methods for random subsets to approach such inference problems by explicitly introducing parsimony and interpretability as criteria for the reported inference. Related methods are developed for random partitions, feature allocation, and extensions of such structures. Besides the development of models and inference paradigms, a second major thrust of the proposed work is the development of computationally feasible implementations for large data sets. Model-based Bayesian inference for random subsets quickly leads to prohibitively computation-intensive implementations when simulation-exact posterior Monte Carlo methods are used. While several big data posterior simulation methods for global parameters have been developed in recent literature, there are few such methods for random subsets, i.e., local parameters. The project will explore several approaches to develop such methods.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
1953340
Program Officer
Yong Zeng
Project Start
Project End
Budget Start
2020-05-01
Budget End
2023-04-30
Support Year
Fiscal Year
2019
Total Cost
$49,746
Indirect Cost
Name
University of Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60637