Stochastic gradient descent (SGD) is at the core of state-of-the-art supervised learning, which is revolutionizing inference and decision-making in many diverse applications such as self-driving cars, robotics, personalized search and recommendations, and medical diagnosis. Thus, improving the speed of stochastic gradient descent is a timely and important research problem. Due to the massive scale of neural network models and training data sets used today, it has become advantageous to parallelize SGD across multiple computing nodes. Although parallelizing SGD boosts the amount of data processed per iteration, it exposes the algorithm to unpredictable node slowdown and communication delays stemming from variability in the computing infrastructure. The goal of this project is to design provably fast SGD algorithms that easily lend themselves to distributed implementations, and are robust to fluctuations in computation and network delays as well as unpredictable node failures. This project can assist in making machine learning universally accessible, without requiring access to expensive high-performance computing infrastructure. An open-source implementation of the resulting adaptive distributed SGD algorithms will be released. The research outcomes will also be incorporated into two new machine learning classes at Carnegie Mellon University, and into curriculum development and research sampler workshops for K-12 teachers and students.

The speed of single-node SGD is typically measured in terms of the convergence of training error with respect to the number of iterations. In distributed SGD, the runtime per iteration depends on system-level factors such as the computation delays at worker nodes and the gradient aggregation mechanism. Thus, there is a critical need to understand the error convergence with respect to the wall-clock time rather than the number of iterations. This project will improve the true convergence of distributed SGD with respect to wall-clock time by jointly optimizing the runtime-per-iteration and error-versus-iterations. It will consider two popular distributed SGD frameworks, the parameter server model and the communication-efficient SGD model. The research is expected to provide novel runtime and error analyses of distributed SGD in these frameworks and design the first adaptive distributed SGD algorithms that strike the best error-runtime trade-off.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2019-03-01
Budget End
2022-02-28
Support Year
Fiscal Year
2018
Total Cost
$175,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213