This research aims to address the pressing challenges on learning and inference from large-dimensional data. Contemporary sensing and data acquisition technologies produce data at an unprecedented rate. A ubiquitous challenge in modern data applications is thus to efficiently and reliably extract relevant information and associated insights from a deluge of data. In the meantime, this challenge is exacerbated by the unprecedented growth of relevant features one needs to reason about, which oftentimes even outpaces the growth of data samples. Classical statistical inference paradigms, which either only work in the presence of an enormous number of data samples, or ignore the computational cost of the estimators at all, become highly insufficient, or even unreliable, for many emerging applications of machine learning and big-data analytics.

To address the above pressing issues in high dimensions, novel theoretical tools need to be brought in the picture in order to provide a comprehensive understanding of the performance limits of various algorithms and tasks. The goal of this project is four-fold: First, to develop a modern theory to characterize precise performance of classical statistical algorithms in high dimensions. Second, to suggest proper corrections of classical statistical inference procedures to accommodate the sample-starved regime. Third, to develop computationally efficient algorithms that can provably attain the fundamental statistical limits, if possible. Finally, forth, to identify potential computational barriers if the fundamental statistical limits cannot be met. The transformative potential of the proposed research program is in the development of foundational statistical data analytics theory through a novel combination of statistics, approximation theory, statistical physics, mathematical optimization, and information theory, offering scalable statistical inference and learning algorithms. The theory and algorithms developed within this project will have direct impact on various engineering and science applications such as large-scale machine learning, DNA sequencing, genetic disease analysis, and natural language processing. This collaborative program provides cross-university opportunities for students training, and we are committed to engaging and helping underrepresented and women students in STEM through long-term mentorships and outreach activities.This research aims to address the pressing challenges on learning and inference from large-dimensional data. Contemporary sensing and data acquisition technologies produce data at an unprecedented rate. A ubiquitous challenge in modern data applications is thus to efficiently and reliably extract relevant information and associated insights from a deluge of data. In the meantime, this challenge is exacerbated by the unprecedented growth of relevant features one needs to reason about, which oftentimes even outpaces the growth of data samples. Classical statistical inference paradigms, which either only work in the presence of an enormous number of data samples, or ignore the computational cost of the estimators at all, become highly insufficient, or even unreliable, for many emerging applications of machine learning and big-data analytics.

To address the above pressing issues in high dimensions, novel theoretical tools need to be brought in the picture in order to provide a comprehensive understanding of the performance limits of various algorithms and tasks. The goal of this project is four-fold: First, to develop a modern theory to characterize precise performance of classical statistical algorithms in high dimensions. Second, to suggest proper corrections of classical statistical inference procedures to accommodate the sample-starved regime. Third, to develop computationally efficient algorithms that can provably attain the fundamental statistical limits, if possible. Finally, forth, to identify potential computational barriers if the fundamental statistical limits cannot be met. The transformative potential of the proposed research program is in the development of foundational statistical data analytics theory through a novel combination of statistics, approximation theory, statistical physics, mathematical optimization, and information theory, offering scalable statistical inference and learning algorithms. The theory and algorithms developed within this project will have direct impact on various engineering and science applications such as large-scale machine learning, DNA sequencing, genetic disease analysis, and natural language processing. This collaborative program provides cross-university opportunities for students training, and we are committed to engaging and helping underrepresented and women students in STEM through long-term mentorships and outreach activities.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2019-08-01
Budget End
2023-07-31
Support Year
Fiscal Year
2019
Total Cost
$385,000
Indirect Cost
Name
Princeton University
Department
Type
DUNS #
City
Princeton
State
NJ
Country
United States
Zip Code
08544