Machine learning systems are critical drivers of new technologies such as near-perfect automatic speech recognition, autonomous vehicles, computer vision, and natural language understanding. The underlying inference engine for many of these systems is based on neural networks. Before a neural network can be used for these inference tasks, it must be trained using a data corpus of known input-output pairs. This training process is very computationally intensive with current systems requiring weeks to months of time on graphic processing units (GPUs) or central processing units in the cloud. As more data becomes available, this problem of long training time is further exacerbated because larger, more effective network models become desirable. The theoretical understanding of neural networks is limited, so experimentation and empirical optimization remains the primary tool for understanding deep neural networks and innovating in the field. However, the ability to conduct larger scale experiments is becoming concentrated with a few large entities with the necessary financial and computational resources. Even for those with such resources, the painfully long experimental cycle for training neural networks means that large-scale searches and optimizations over the neural network model structure are not performed. The ultimate goal of this research project is to democratize and distribute the ability to conduct large scale neural network training and model optimizations at high speed, using hardware accelerators. Reducing the training time from weeks to hours will allow researchers to run many more experiments, gaining knowledge into the fundamental inner workings of deep learning systems. The hardware accelerators are also much more energy efficient than the existing GPU-based training paradigm, so advances made in this project can significantly reduce the energy consumption required for neural network training tasks.

This project comprises an interdisciplinary research plan that spans theory, hardware architecture and design, software control, and system integration. A new class of neural networks that have pre-defined sparsity is being explored. These sparse neural networks are co-designed with a very flexible, high-speed, energy-efficient hardware architecture that maximizes circuit speed for any model size in a given Field Programmable Gate Array (FPGA) chip. This algorithm-hardware co-design is a key research theme that differentiates this approach from previous research that enforces some sparsity during the training process in a manner incompatible with parallel hardware acceleration. In particular, the proposed architecture operates on each network layer simultaneously, executing the forward- and back-propagation in parallel and pipelined fully across layers. With high precision arithmetic, a speed-up of about 5X relative to GPUs is expected. Using log-domain arithmetic, these gains are expected to increase to 100X or larger. Software and algorithms are being developed to manage multiple FPGA boards, simplifying and automating the model search and training process. These algorithms exploit the ability to reconfigure the FPGAs to trade speed for accuracy, a capability lacking in GPUs. These software tools will also serve as a bridge to popular Python libraries used by the machine learning community.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
1763747
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2018-07-01
Budget End
2021-06-30
Support Year
Fiscal Year
2017
Total Cost
$1,199,849
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089