CNS Core: Small: Mitigating Network Bottlenecks via Programmability for Distributed Machine Learning Systems

Wang, An

Abstract

Distributed machine learning (ML) is becoming an important way to allow multiple learning agents to train on separate slices of the same dataset simultaneously and exchange what they have learned with each other periodically over a network. Due to the significant bandwidth gap between network and processor units, the network is likely to become the bottleneck in these types of systems. To mitigate this issue, this project is developing a distributed ML algorithm and network system co-design to adapt training algorithms to make better use of network resources. First a programmable communication subsystem is proposed to accelerate training synchronization. Specifically, a comprehensive study on the impact of network congestion over distributed ML models will be conducted to provide unique insights. The project is also enhancing existing frameworks by integrating in-network control and exploring synchronization schemes that dynamically adjust learning hyper-parameters based on network signals. Next a scheduler that optimizes the utilization of heterogeneous computing resources is proposed. To that end, both deterministic and learning-based scheduling algorithms are being explored and a framework that enables operation-level scheduling for finer-grained control is being developed.

The proposed research investigates in-network control to mitigate network congestion which remains the biggest challenge for High Performance Computing (HPC) processors. It will significantly improve the training efficiency of the existing distributed training frameworks. In addition, the comprehensive and systematic studies will provide insights to the algorithm and system co-design solutions. The developed framework will also help students and researchers in their big data research projects. New courses will be developed based on the outcomes of the proposed work and new curriculum and training sessions on networking and distributed ML will be developed in High School Tech Camps during the summer. Source code, raw data, and simulation results generated in the project will be stored in standard formats and will be published in the public domain. All data will be archived on the departmental servers at Case Western Reserve University (CWRU) for increased availability and reliability.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Type: Standard Grant (Standard)
Application #: 2008468
Program Officer: Erik Brunvand

Project Start
Project End
Budget Start: 2020-10-01
Budget End: 2023-09-30
Support Year
Fiscal Year: 2020
Total Cost: $496,595
Indirect Cost

CNS Core: Small: Mitigating Network Bottlenecks via Programmability for Distributed Machine Learning Systems
Wang, An
Case Western Reserve University, Cleveland, OH, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments