CNS Core: Small: Optimizing Distributed Machine Learning for Transient Resources using Loose Synchronization

Irwin, David; Gao, Lixin; Shenoy, Prashant

Abstract

The availability of large-scale data sets in many domains has driven the growth of large-scale distributed machine learning (ML) workloads on cloud platforms to derive insights from this data. To reduce the cost of executing these workloads, cloud platforms have begun to offer transient servers for a highly discounted price. Unfortunately, cloud platforms may revoke transient servers at any time, which can decrease distributed ML performance and eliminate any cost benefit. High revocation rates are especially problematic for distributed ML workloads that support synchronous processing, since revoked servers block others from continuing past predefined synchronization barriers until a replacement server can reach the barrier. While asynchronous processing eliminates this blocking and improves performance, it does not maintain the algorithmic properties of synchronous algorithms, resulting in slower algorithmic convergence or possibly preventing convergence. To maintain performance on low-cost transient servers, this project proposes re-designing traditional distributed ML algorithms to use looser forms of synchrony. Such loose synchronization minds the gap between synchronous and asynchronous processing by maintaining the algorithmic convergence properties of synchronous processing, while enabling some asynchronous processing to avoid blocking. The project combines this loose synchronization approach with adaptive policies for selecting transient servers based on their performance, cost, and volatility to significantly reduce the cost of executing large-scale distributed ML workloads on cloud platforms.

Distributed machine learning (ML) workloads that derive insights from large-scale data sets have become the foundation for numerous advances across multiple industry sectors. This project has the potential to accelerate these advances by significantly decreasing the cost and improving the efficiency of executing distributed ML workloads on cloud platforms using transient servers. To benefit the broader community, the project will publicly release its software artifacts as open source. The project will incorporate topics on transient servers and distributed ML into graduate and undergraduate courses on distributed and operation systems. The project will also involve undergraduates in research through related summer research experience projects and undergraduate theses.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Type: Standard Grant (Standard)
Application #: 1908536
Program Officer: Marilyn McClure

Project Start
Project End
Budget Start: 2019-10-01
Budget End: 2022-09-30
Support Year
Fiscal Year: 2019
Total Cost: $500,000
Indirect Cost

CNS Core: Small: Optimizing Distributed Machine Learning for Transient Resources using Loose Synchronization
Irwin, David Gao, Lixin Shenoy, Prashant
University of Massachusetts Amherst, Hadley, MA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments