CIF: Medium: Collaborative Research: Coded Computing for Large-Scale Machine Learning

Grover, Pulkit

Abstract

Deep learning models are breaking new ground in data science tasks including image recognition, automatic translation and autonomous driving. This is achieved by neural networks that can be hundreds of layers deep and involve hundreds of millions of parameters. Training such large models requires distributed computations, very long training times and expensive hardware. This project studies coding theoretic techniques that can accelerate distributed machine learning and allow training with cheaper commodity hardware. Beyond the development of theoretical foundations, this project develops new algorithms for providing fault tolerance over unreliable cloud infrastructure that can significantly reduce the cost of large-scale machine learning. The research outcomes of the project will be broadly disseminated and integrated into education.

The specific focus of this research program is on mitigating the bottlenecks of distributed machine learning. Currently, scaling benefits are limited because of two reasons: first, communication is typically the bottleneck and second, straggler effects limit performance. Both problems can be mitigated using coding theoretic methods. This work proposes "coded computing", a transformative framework that combines coding theory with distributed computing to inject computational redundancy in a novel coded form. This framework is then used to develop three research thrusts: a) Coding for Linear Algebraic Computations b) Coding for Iterative Computations and c) Coding for General Distributed Computations. Each of the thrusts operates on a different layer of a machine learning pipeline but all rely on coding theoretic tools and distributed information processing.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Application #: 1763561
Program Officer: Armand Makowski

Project Start
Project End
Budget Start: 2018-09-01
Budget End: 2022-08-31
Support Year
Fiscal Year: 2017
Total Cost: $300,000
Indirect Cost

CIF: Medium: Collaborative Research: Coded Computing for Large-Scale Machine Learning
Grover, Pulkit
Carnegie-Mellon University, Pittsburgh, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments