Machine learning is increasingly deployed in large-scale, mission critical problems for the purpose of making decisions that affect a vast number of individuals' employment, savings, health, and safety. The potential for machine learning to dramatically impact and change people's lives necessitates that machine learning methods be robust, explainable, and understandable---rather than black-box. This research develops new techniques that are both computationally motivated and theoretically sound for robust machine learning at scale. The work is situated in the context of three modern classes of applications. (1) Economists are interested in analyzing the efficacy of microcredit, small loans to individuals in impoverished areas with the goal of eliminating poverty. (2) Biologists are interested in using single-cell RNA sequencing data to understand cells' relationships and development trajectories. (3) The Internet of Things (IoT) is poised to generate a wealth of complex data across energy readings in buildings, within transportation infrastructure, from vehicles on the road, and from many other sensor sources. The PI is working directly with area experts so as to have immediate, broad impact across application domains. In an educational component of the project, the PI is a core part of developing a new graduate curriculum and degree in statistics, data science, and statistical machine learning at MIT. The methods and applications in this proposal feature in a new course on modern machine learning methods. The PI is also developing a high-school level introduction to machine learning as part of the established Women's Technology Program (WTP).
The issues of robustness and explainability particularly arise in domains with nontrivial spatial and temporal dependencies, where the amount of data is often massive, and where practitioners typically have some expert knowledge about the domain before engaging with a particular dataset. These are precisely the domains where existing machine learning methodologies are less well-developed. The need to bring structural knowledge to bear on the problem suggests the use of Bayesian methods, which can incorporate this knowledge via prior and modeling assumptions. To live up to the promise of these methods, though, practical approaches need to be robust to assumptions as well as to noisy or adversarial data, lest this data change important decisions in ways not understood by the practitioner. This research incorporates advances in statistical physics to assess the sensitivity of a data analysis to assumptions and data values. And to realize the advantages of the proposed robust and understandable machine learning framework, practitioners must face extreme scalability issues---both from a computational perspective as well as a modeling perspective. On the computational side, this research builds on recent advances from computational geometry to scale to data sets at modern sizes. On the modeling side, note that while small-scale problems exhibit dense spatio-temporal dependencies, large-scale problems tend to be sparser, and practical approaches must reflect this sparsity to be reliable at scale. This work incorporates advances in probability theory to model sparse IoT networks. This proposal is highly interdisciplinary---bringing together ideas from machine learning, statistics, physics, theoretical computer science, probability theory, and systems and applying these ideas to microcredit, single-cell RNA sequencing, sensor networks, international trade, and industrial applications including customer service at scale.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.