Machine learning and data mining are among the most influential contributions of computer science in the last decade. Given sufficiently large datasets and computational power one can discover patterns and make reasonably accurate predictions. While there has been tremendous progress in designing efficient algorithms for analyzing massive datasets, there has been less progress in providing rigorous measures of statistical significance or robustness of the analysis. As we analyze large and noisy datasets to model complex relationships in data, it is critical to develop formally proven methods with clear performance guarantees. This project advocates a responsible approach to data analysis, based on well-founded mathematical and statistical concepts. Such an approach enhances the effectiveness and reliability of evidence- based decision making in medicine, policy and other social applications of big data analysis. Capacity-building activities of this project include: (1) Creation and dissemination of algorithms and software that implement rigorous, interpretable, and usable computational and statistical approaches to big data analysis; and (2) Educational initiatives at the graduate and undergraduate level to build a bigger and more diverse workforce of data scientists with the appropriate foundational skills both to apply analytical tools to existing datasets and to develop new approaches to future datasets.

The goal of this project is developing practical data analysis algorithmic applications based on the theoretical machine learning concept of Rademacher complexity. This project is motivated by preliminary results that have shown that the analytical properties of the Rademacher complexity, combined with its efficient sampling properties, provide a unique opportunity to develop general tools to begin bridging the gap between theory and practice in large scale data analysis. In particular, the project is focused on the following aims: improve the efficiency of rigorous data analysis algorithms through better sample complexity bounds; improve multi-comparisons and overfitting control through Rademacher generalization bounds; develop theory and practical applications of Cartesian and Chaos Rademacher Complexities; develop efficient algorithms for estimating the empirical Rademacher complexity; and explore new rigorous data analysis algorithms through the application of Rademacher theory.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1813444
Program Officer
Rebecca Hwa
Project Start
Project End
Budget Start
2018-09-01
Budget End
2021-08-31
Support Year
Fiscal Year
2018
Total Cost
$466,000
Indirect Cost
Name
Brown University
Department
Type
DUNS #
City
Providence
State
RI
Country
United States
Zip Code
02912