CSR: SMALL: Low-Latency Model Inference Using Cellular Batching

Li, Jinyang

Abstract

Successful cloud deployment of machine learning services, such as language translation, image search and home assistants require a high performance serving system that can process hundreds of thousands requests per second. It is particularly crucial for the serving system to ensure low latency, as even tens of milliseconds increase in delays can annoy users when using a service like the home assistant. Among the widely-used deep learning models, recurrent neural network (RNN) is an important class of models that incur high latency when processed by existing serving systems. This project aims to develop a new serving system that can handle a variety of Artificial Intelligence (AI) tasks using RNN-based deep learning models with significantly improved latency.

To achieve good throughput on modern hardware, one must perform batched computation. This project develops a new, dynamic approach to batching, called Cellular Batching. Cellular Batching performs batching and execution at the granularity of a "cell" (aka a subgraph with embedded model weights) instead of the entire dataflow graph, as is done in existing systems. Under Cellular Batching, a new request can immediately join the execution of ongoing requests to minimize queuing delays and increase effective batching. The project will complete research tasks that make Cellular Batching practical (by developing an efficient scheduler and supporting zero-downtime model upgrading) and generalize it to different models such as search-guided RNNs.

Deep learning models based on RNNs are becoming widely used to accomplish various AI tasks ranging from speech recognition and language translation, to question answering. As such, there is a pressing demand for a high-throughput and low-latency serving system, in order to improve end-user experience and reduce the cost of deployment. By demonstrating significant latency and throughput benefits, there is high potential for Cellular Batching to be widely adopted. This project will also develop a new course component on high performance machine learning systems as part of the graduate-level distributed systems course.

This project will produce data in the form of source code, various serving benchmarks, and experimental results. The source code and all benchmarks used in the experiments will be distributed via Github. A local copy of the source code and the publications produced by the project will also be made available at the URL (http://batchmaker.news.cs.nyu.edu) for at least three years beyond the award period.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Network Systems (CNS)
Type: Standard Grant (Standard)
Application #: 1816717
Program Officer: Erik Brunvand

Project Start
Project End
Budget Start: 2018-09-01
Budget End: 2021-08-31
Support Year
Fiscal Year: 2018
Total Cost: $411,325
Indirect Cost

CSR: SMALL: Low-Latency Model Inference Using Cellular Batching
Li, Jinyang
New York University, New York, NY, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments