The field of statistics has seen great success over many decades drawing scientific insights from simple, easy-to-understand models. But progress in computing is now allowing researchers to measure and store huge amounts of data at once, opening the door to understanding and manipulating much more complex systems than ever before. Indeed, the field of machine learning has been very successful at fitting predictive models to such data sets, but the black-box nature of these methods makes it hard to draw scientific insights from them using the usual statistical approach. In fact, not only do classical statistical methods fail in this modern big data setting, but the questions they answer no longer even make sense because the models they are based on do not hold even approximately. In this research, the PI will first come up with new statistical ways of posing scientific questions that make sense in complex data but still have interpretable answers. Second, the PI will find new statistical methods to answer those questions in a rigorous way, and study the mathematical and computational properties of these methods so that they can be used as effectively as possible. And finally, the PI will work with experts in the areas of genomics, the microbiome, and political science to use these methods to gain new scientific insights in these fields. Throughout the project, the PI will also run a free statistical consulting service to help the broader research community, develop new curricula with high school teachers, and provide enriching educational and research experiences for undergraduate and graduate students.
To understand the importance of a covariate in a high-dimensional regression, it has become increasingly popular to perform a hypothesis test for conditional independence with the response. The appeal of such a test is that it provides statistically rigorous insight that is well-defined and scientifically interpretable no matter how the response depends on the covariates, including the case when their relationship is highly nonlinear and includes interactions, possibly of high order. However, conditional independence as a model-free inferential target only provides a type of scientific insight that can be of little use in some applications. The PI will extend conditional independence to new model-free targets for two types of data on which it does not provide a useful inferential target, namely, data with highly-locally-dependent covariates and data with compositional covariates. Then, the PI will move past conditional independence entirely to propose novel model-free targets that instead of just identifying (as in variable selection) actually numerically quantify relationships in the data, such as the relationship between a covariate and a response or the interaction between two covariates in the response's conditional distribution. Along with each new target, the PI will develop entirely new methods for powerful and provably valid inference. The novelty of the proposed targets and associated methods will provide for new connections with other fields of statistics including Bayesian computation, measure theory, statistical physics, experimental design, causal inference, and graphical model estimation. Ultimately, this research aims to provide a suite of new tools for researchers to move beyond the constraints of parametric targets and instead leverage state-of-the-art machine learning tools to answer novel and important questions about their data in a statistically principled way.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.