The analysis of large-scale and complex data plays an increasingly central role in society, and innovations in machine learning are yielding ever more powerful predictive technologies. However, when we use such data to guide decision making, it is important to recognize that the majority of datasets in these domains are observational rather than randomized in nature, and require careful analysis in order to draw correct conclusions about the causal effect of deploying a potential policy. The research aims to develop new methods for data-driven decision making that can harness the power and expressiveness of machine learning, all while rigorously building on best practices for causal inference from non-randomized data.
This project is centered around the following three statistical tasks: (1) Examine the problem of heterogeneous treatment effect estimation in observational studies, and develop a flexible framework that can be used with, e.g., boosting or neural networks. The accuracy of the proposed method depends on the complexity of the causal signal that we can intervene on, not on other merely associational signals. (2) Consider welfare maximizing structured policy learning, and study an approach whose regret decays as the inverse square root of the sample size in a non-parametric setting. (3) Consider the problem of learning optimal stopping rules from sequentially randomized data, and propose a new robust yet computationally feasible approach to policy learning in this setting. A unifying theme underlying all these results is that they highlight how classical ideas from semiparametric statistics can be used to rigorously leverage accurate machine learning predictors in decision-making problems.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.