Machine learning forms the backbone of many human-facing applications, from autonomous vehicles to medical diagnostics. Despite extensive empirical progress, current theoretical wisdom falls short of providing meaningful performance guarantees that address safety ramifications associated with such technologies. The gap is especially pronounced in systems that operate on real-world, high-dimensional data, which is where theoretical guarantees are most needed. This project will develop a novel framework for high-dimensional inference that gives rise to a scalable statistical analysis of modern machine learning methods. This innovation will allow to couple empirical validation with principled performance evaluation techniques and provable accuracy assurances. Ultimately, this project will promote the wide deployment of machine learning technologies with invaluable societal benefits, from better healthcare to safer roads and improved crisis management. In conjunction, the educational component will nurture the next generation of scientists in theoretical STEM disciplines, while increasing participation of women and girls, who remain largely underrepresented.
This project introduces smooth statistical distances---a new class of discrepancy measures between probability distributions adapted to high-dimensional spaces. Smooth distances level out local irregularities in the measured distributions (via convolution with a chosen kernel) in a way that preserves inference capability, but alleviates the curse of dimensionality when estimating these distances from data. Since measuring or optimizing distances between distributions is central to basic inference setups and advanced machine learning tasks, the research agenda comprises three chronological phases: (1) develop fundamentals of smooth distances, encompassing geometric, topological and functional properties; (2) conduct a high-dimensional statistical study, from empirical approximation questions to basic inference; and (3) devise a refined generalization and sample complexity theory for machine learning tasks like generative modeling, barycenter computation, and information flow analysis, drawing on knowledge gained during the first two tasks. The smooth statistical distances paradigm has the potential to bridge central theoretical gaps and provide increased reliability in machine learning interfaces at scale.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.