CRII: CIF: Unifying Scheduling and Optimization Techniques to Speed-up Distributed Stochastic Gradient Descent

Joshi, Gauri

Abstract

Stochastic gradient descent (SGD) is at the core of state-of-the-art supervised learning, which is revolutionizing inference and decision-making in many diverse applications such as self-driving cars, robotics, personalized search and recommendations, and medical diagnosis. Thus, improving the speed of stochastic gradient descent is a timely and important research problem. Due to the massive scale of neural network models and training data sets used today, it has become advantageous to parallelize SGD across multiple computing nodes. Although parallelizing SGD boosts the amount of data processed per iteration, it exposes the algorithm to unpredictable node slowdown and communication delays stemming from variability in the computing infrastructure. The goal of this project is to design provably fast SGD algorithms that easily lend themselves to distributed implementations, and are robust to fluctuations in computation and network delays as well as unpredictable node failures. This project can assist in making machine learning universally accessible, without requiring access to expensive high-performance computing infrastructure. An open-source implementation of the resulting adaptive distributed SGD algorithms will be released. The research outcomes will also be incorporated into two new machine learning classes at Carnegie Mellon University, and into curriculum development and research sampler workshops for K-12 teachers and students.

The speed of single-node SGD is typically measured in terms of the convergence of training error with respect to the number of iterations. In distributed SGD, the runtime per iteration depends on system-level factors such as the computation delays at worker nodes and the gradient aggregation mechanism. Thus, there is a critical need to understand the error convergence with respect to the wall-clock time rather than the number of iterations. This project will improve the true convergence of distributed SGD with respect to wall-clock time by jointly optimizing the runtime-per-iteration and error-versus-iterations. It will consider two popular distributed SGD frameworks, the parameter server model and the communication-efficient SGD model. The research is expected to provide novel runtime and error analyses of distributed SGD in these frameworks and design the first adaptive distributed SGD algorithms that strike the best error-runtime trade-off.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 1850029
Program Officer: Phillip Regalia

Project Start
Project End
Budget Start: 2019-03-01
Budget End: 2022-02-28
Support Year
Fiscal Year: 2018
Total Cost: $175,000
Indirect Cost

CRII: CIF: Unifying Scheduling and Optimization Techniques to Speed-up Distributed Stochastic Gradient Descent
Joshi, Gauri
Carnegie-Mellon University, Pittsburgh, PA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments