In the era of large-scale deep learning (DL) and massive data, existing hardware systems have struggled to effectively accommodate heavy and complex computing workload due to difficulties in scheduling highly dynamic, heterogeneous, and competing tasks from many users over many machines in a cluster or data-center environment. This project aims to develop a "1-click" demand-aware and responsive software system capable of simultaneously training a wide spectrum of DL tasks, using a new resource management architecture that automatically and adaptively chooses the most effective distributed training/serving techniques and their hyperparameters to achieve best overall efficiency of multiple tasks in such environment.

This interdisciplinary project innovates in distributed systems design, DL algorithm design, and related industrial applications and theoretical analyses, with the following thrusts: 1: Develop a framework for "ML-aware" resource management and scheduling of multiple simultaneously running training tasks. 2: Develop principled strategies for resource management and scheduling for serving, streaming, and heterogeneous-task settings. 3: Optimize memory resources for training large-parameter models by developing holistic approaches to maximize computation throughput subject to device memory bounds. A limited-scope but rigorous and practical theoretical analysis of some of the proposed architectures will also be performed.

This project addresses the needs from the academic and industrial communities and will have a broad impact on both. It will provide easy-to-use tools that reduce the time to set-up and facilitate large-scale experimentation, while reducing the required costs, whether measured in cluster access quotas or dollars spent on cloud services. The impact on commercial practitioners will be even greater, by improving their productivity by an order of magnitude or more, as they must contend with heterogeneous computing and network resources that are shared among many users as well as the need to run many jobs on a regular basis.

The team will release and/or open-source the code at http://sailing-lab.wixsite.com/sailing-pmls to benefit researchers and practitioners, to share their lessons learned to advocate more research in machine learning (ML) systems problems, and also to democratize high-performance ML systems and make them accessible to non-ML-educated software developers and society at large, such as industrial and manufacturing, healthcare, biology, social science, and finance, where results may have a catalytic impact. The team will publish results at a variety of top tier conferences, including machine learning (NIPS, ICML), systems (OSDI, SOSP, USENIX), and data mining (KDD, WWW).

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
2008248
Program Officer
Erik Brunvand
Project Start
Project End
Budget Start
2020-10-01
Budget End
2023-09-30
Support Year
Fiscal Year
2020
Total Cost
$499,910
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213