Exciting empirical breakthroughs have emerged in data science and engineering through combination of large-scale datasets, increasingly complex statistical models, and advanced computational power. The success also promises new directions in statistics and econometrics, among other scientific disciplines. Nevertheless, the empirical phenomena exhibited by modern Machine Learning (ML) challenge the core mathematical concepts in statistics and computation: (a) Why can complex over-parametrized models enjoy excellent statistical performances even with interpolating the training examples? (b) Why can seemingly simple stochastic optimization methods optimize such complex models effectively? (c) What kinds of structures or representations of data are responsible for modern ML models’ efficacy over classical statistical models when the dimension becomes moderately large? This project aims to develop new statistical and computational paradigms that bridge the gap between theory and practice for learning from data. The project will also significantly impact undergraduate and graduate students’ training in data science research through synergetic educational and research activities to be hosted under a new initiative that integrates and enhances resources across the fields of statistics and economics.

The project will investigate the role of regularization, statistical performance, and optimization algorithms in modern ML models, including kernel machines, boosting, random forests, and neural networks. In particular, the PI will focus on the following three modules. (a) Learning functions in the interpolation/overfitting regime: The PI will study the statistical performance of minimum-norm interpolated solutions, which fall beyond the realm of the classical empirical risk minimization analysis. The PI also plans to develop a rigorous mathematical framework to quantify the adaptive representation aspects of specific ML models. (b) Learning distributions with generative models and simulation-based inference: The PI will investigate the statistical foundations of generative models for learning implicit probability distributions and study new simulation-based inference procedures. (c) Optimization algorithms motivated by stochastic approximation and online learning: The PI will study the interplay between optimization and statistical performance of gradient-based stochastic approximation methods for learning complex ML models with non-convex landscapes. The research intends to challenge conventional wisdom in statistics and computation, modernize nonparametric statistics and learning theory education, and further shed light on devising the next generation nonparametric models with algorithms and computation in mind.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Application #
2042473
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2021-07-01
Budget End
2026-06-30
Support Year
Fiscal Year
2020
Total Cost
$80,000
Indirect Cost
Name
University of Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60637