OAC Core: SMALL: DeepJIMU: Model-Parallelism Infrastructure for Large-scale Deep Learning by Gradient-Free Optimization

Zhao, Liang; Cheng, Yue

Abstract

In recent years, the use of deep neural networks (DNNs) has been increasing to obtain useful insights for scientific explorations, business management, security, and healthcare. The constant improvement of DNN model performance has been accompanied by an increase in their complexity and size, which indicate a clear trend toward larger and deeper models. Such a trend is especially the case for numerous important application domains, such as remote sensing where super-high-resolution geospatial image processing is required. Such applications lead to a huge challenge for the training of very large models to fit on a single computing device (e.g., a graphics processing unit, GPU), and hence raises urgent demands for partitioning such models across multiple computing devices and parallelizing the training process (i.e., model parallelism). However, until now model parallelism for DNNs has been poorly explored and is very difficult due to the inherent bottleneck from the backpropagation algorithm, where the training of one layer closely depends on input from all the previous layers. To overcome these challenges, this project aims a radically new pathway toward model parallelism infrastructure for large-scale DNNs based on optimization methods that do not rely on backpropagation for training. This project plans to address the challenges of training very large and very deep neural network models that require huge amounts of high-dimensional data. The project will develop new optimization techniques and distributed DNN training software infrastructure to enable wider applications and deployment of model parallel deep learning training. The project includes educational and engagement activities that will greatly increase the community's understanding of distributed machine learning algorithms and systems. Those activities include teaching and training students and peers, providing graduate and undergraduate students with new courses, and research and internship opportunities, as well as broadening participation of underrepresented groups and students at local high schools.

This project brings together researchers in machine learning algorithms, distributed computing systems, remote sensing, and spatial data science, to boost the performance and scalability of deep learning applications enhanced by model parallelism. Specifically, this project focuses on proposing and developing a suite of new model parallelism optimization algorithms and system infrastructure for training large-scale DNNs, especially for image processing of massive datasets for geospatial scientific research. To enable model parallelism in the training, new gradient-free optimization methods are proposed to break down the whole problem of DNN optimization into subproblems, which can then be solved separately in parallel (by many workers) with high efficiency. The products of this project include new theories and algorithms for model parallelism, along with an efficient gradient-free DNN training framework with new scheduling and work balancing techniques. Specifically, this project has the following research thrusts: 1) Develop new gradient-free methods for training various types of DNNs; 2) Designing an algorithmic and theoretical framework of model parallelization based on gradient-free optimization; and 3) Building a scalable and efficient distributed training framework for a broad range of model parallel DNN training applications, such as deep learning for large graphs and very deep convolutional neural networks for image processing. This project also involves both theoretical and experimental comparison between the new techniques and current state-of-the-art methods, including those using gradient-based optimizations and pipeline parallelism.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Advanced CyberInfrastructure (ACI)
Type: Standard Grant (Standard)
Application #: 2007976
Program Officer: Alan Sussman

Project Start
Project End
Budget Start: 2020-10-01
Budget End: 2020-12-31
Support Year
Fiscal Year: 2020
Total Cost: $498,609
Indirect Cost

OAC Core: SMALL: DeepJIMU: Model-Parallelism Infrastructure for Large-scale Deep Learning by Gradient-Free Optimization
Zhao, Liang Cheng, Yue
George Mason University, Fairfax, VA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments