Over the last decade, distributed computing and big data analytics have enabled unprecedented advancements in human life, including in medicine and health, education, business, and in stimulating new careers. And, it is fundamental to the computing industry, a significant economic engine for the US. However, traditional approaches to distributed computing are developed as ad hoc solutions to individual applications. In the classical paradigm, the system designer specifies a simple model of the network, along with a few low-level design goals, such as high utilization and low job completion time, and then develops a fixed algorithm to distribute the computation across workers. Although this paradigm has resulted in heuristics that work in practice, networks and applications continuously grow in complexity and heterogeneity, hence, the critical task of designing workload distribution algorithms that work well across a variety of conditions has become exceedingly difficult. This proposal addresses that challenge by developing a general framework that can be used as applications and networks grow. Ultimately, it will make distributed computing more explainable and better tailored to the needs of applications.

Workload distribution has a long and rich history. However, the existing literature lacks design principles for reasoning about compute versus communication tradeoffs in large-scale networks. This proposal seeks to develop a principled framework for workload distribution techniques. It aims to provide the mathematical foundations behind function computation in distributed networks, where a function is an abstraction of a computation task, such as training a neural network, indexing the web, query processing, etc. Hence, the operator does not have to rely on heuristics or simplified models to decide on workload distribution. Instead, the proposed framework offers the trade-off space between cost and performance for the best use of available resources. This proposal aims to address the fundamental challenge of parallel function computation in distributed networks and how to enable rigorous mathematical analysis of deployed approaches by (i) developing a series of core principles for workload distribution systems through analyzing a variety of applications, including datacenter job scheduling, decentralized Stochastic Gradient Descent training, and erasure coding for inference jobs, and (ii) devising a novel scheduling framework for distributing computation tasks in distributed networks. The proposed framework leverages Little’s Law to minimize both communication and computation times when designing practical, robust, and high-performance workload distribution algorithms. The PIs will evaluate the proposed scheduler against state-of-the-art heuristic algorithms and pin-point the constraints and features that makes each heuristic a special use case of the generic framework.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
2008624
Program Officer
Ann Von Lehmen
Project Start
Project End
Budget Start
2020-10-01
Budget End
2023-09-30
Support Year
Fiscal Year
2020
Total Cost
$333,500
Indirect Cost
Name
Massachusetts Institute of Technology
Department
Type
DUNS #
City
Cambridge
State
MA
Country
United States
Zip Code
02139