Distributed machine learning (ML) is becoming an important way to allow multiple learning agents to train on separate slices of the same dataset simultaneously and exchange what they have learned with each other periodically over a network. Due to the significant bandwidth gap between network and processor units, the network is likely to become the bottleneck in these types of systems. To mitigate this issue, this project is developing a distributed ML algorithm and network system co-design to adapt training algorithms to make better use of network resources. First a programmable communication subsystem is proposed to accelerate training synchronization. Specifically, a comprehensive study on the impact of network congestion over distributed ML models will be conducted to provide unique insights. The project is also enhancing existing frameworks by integrating in-network control and exploring synchronization schemes that dynamically adjust learning hyper-parameters based on network signals. Next a scheduler that optimizes the utilization of heterogeneous computing resources is proposed. To that end, both deterministic and learning-based scheduling algorithms are being explored and a framework that enables operation-level scheduling for finer-grained control is being developed.

The proposed research investigates in-network control to mitigate network congestion which remains the biggest challenge for High Performance Computing (HPC) processors. It will significantly improve the training efficiency of the existing distributed training frameworks. In addition, the comprehensive and systematic studies will provide insights to the algorithm and system co-design solutions. The developed framework will also help students and researchers in their big data research projects. New courses will be developed based on the outcomes of the proposed work and new curriculum and training sessions on networking and distributed ML will be developed in High School Tech Camps during the summer. Source code, raw data, and simulation results generated in the project will be stored in standard formats and will be published in the public domain. All data will be archived on the departmental servers at Case Western Reserve University (CWRU) for increased availability and reliability.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
2008468
Program Officer
Erik Brunvand
Project Start
Project End
Budget Start
2020-10-01
Budget End
2023-09-30
Support Year
Fiscal Year
2020
Total Cost
$496,595
Indirect Cost
Name
Case Western Reserve University
Department
Type
DUNS #
City
Cleveland
State
OH
Country
United States
Zip Code
44106