Classical data-analysis methods were based on mathematical, physical or statistical models for the data-generation process, which were developed under the assumption that the data were relatively clean and collected for a specific task. Over the past few decades, advances in data acquisition have led to massive, noisy, high-dimensional datasets, which were not necessarily collected for a specific task. This has lead to the emergence of data-driven methods, such as deep learning, which use massive amounts of labeled data to learn 'black-box' models, which do not provide an explicit description of the process being modeled. Such data-driven methods have led to dramatic improvements in the performance of pattern-recognition systems for applications in computer vision and speech recognition for which massive amounts of labeled data can be generated. However, existing models are not very interpretable, and their predictions are not robust to adversarial perturbations. Moreover, there are many applications in science and engineering where data labeling is extremely costly, and the ability to interpret model predictions and produce estimates of uncertainty is essential. To address these challenges, a TRIPODS Institute on the Theoretical Foundations of Data Science will be created at Johns Hopkins University. The goals of the institute will be to (1) develop the foundations for the next generation of data analysis methods, which will integrate model-based and data-driven approaches, (2) foster interactions among data scientists through a monthly seminar series, semester-long research themes, an annual research symposium, and a summer research school and workshop on the foundations of data science, and (3) create new undergraduate and graduate curricula on the foundations of data science.
The institute brings together a multidisciplinary team of mathematicians, statisticians, theoretical computer scientists, and electrical engineers with expertise in the foundations of machine learning, deep learning, statistical learning and inference on graphs, optimization, approximation theory, signal processing, dynamical systems and controls, to develop the foundations for the next generation of data-analysis methods, which will integrate model-based and data-driven approaches. In particular, the institute will focus on studying the foundations of deep neural models (e.g., feedforward networks, recurrent networks, generative adversarial networks) and generative models of structured data (e.g., graphical models, random graphs, dynamical systems), with the ultimate goal of arriving at integrated models that are more interpretable, robust to perturbations, and learnable with minimal supervision. The goals of the Phase I Institute will be to (1) study generalization, optimization and approximation properties of feedforward networks, (2) develop the foundations of statistical inference and learning on and of graphs, and (3) study the integration of deep networks and graphs for learning maps between structured datasets.
This project is part of the National Science Foundation's Harnessing the Data Revolution (HDR) Big Idea activity.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.