CRAM: A Congestion-Aware Resource and Allocation Manager for Data-Intensive High-Performance Computing

Burns, Randal; Griffin, John

Abstract

This project will develop a job scheduling and resource allocation system for data-intensive high-performance computing (HPC) based on the congestion pricing of a systems' heterogeneous resources. This extends the concept of resource management beyond processing: it allocates memory, disk I/O, and the network among jobs. The research will overcome the critical shortcomings of processor-centric resource management, which wastes huge portions of cluster and supercomputer resources for data-intensive workloads, e.g. I/O bandwidth governs the performance of many modern HPC applications but, at present, it is neither allocated nor managed. The research will develop techniques that (1) reconﬁgure the degree of parallelism of HPC jobs to avoid congestion and wastage, (2) support lower-priority, allocation elastic jobs that can be scheduled on arbitrary numbers of nodes to consume unallocated resource fragments, and (3) co-schedule batch-processing workloads that use system resources that are unoccupied due to asymmetric utilization and temporal shifts in the foreground jobs. These techniques will be implemented and supported for free public use as extensions to an open-source resource-management framework. If used broadly, the software has the potential to provide much better utilization of the national investment in HPC facilities.

Project Report

The goal of this project was to create techniques for computing with big data that fully utilize the capabilities of modern hardware. Prior techniques for allocating resources are processor-centric; they distribute compute cycles to parallel jobs and do not account for memory and disk bottlenecks. We developed a suite of job scheduling and storage management tools that are data-centric and provide huge performance gains for big data computing. High IOPS Storage Systems: We built high IOPS (I/O operations per second) storage systems that overcome the write bottlenecks for random workloads. This scales single-system I/O to the extreme, building engines that fully utilize the capabilities of shared-memory hardware, specifically massive non-uniform memory architectures and arrays of solid-state storage devices (SSDs). The process overcame obstacles to the scalability of systems, such as remote memory performance, processor/device affinities, and operating system resource contention to realize more than 1 million IOPS. Data-Driven Batch Scheduling: We built a data-driven scheduling framework for high-performance computing (HPC) applications that have overlapping data requirements. This embodies two principles: to schedule the execution of workload in the order that produces the most efficient I/O schedule and to identify shared I/O among different jobs and perform the I/O one time to meet the requirements of all jobs. This framework turns scheduling upside down: the HPC tradition schedules execution order and derives I/O requests from the execution order. Our techniques schedule I/O and derive a processing order from the preferred I/O schedule. Using data-driven scheduling, we compute queries to the Johns Hopkins Turbulence Database (http://turbulence.pha.jhu.edu) at the aggregate streaming I/O rate of disk array, improving performance by a factor of two to eight. Classroom Education: This grant also funded the development of two new undergraduate and graduate computer science courses that focus on big data. Parallel programming has taught more than 300 student to abandon the comfort of serial algorithmic thinking and to harness the power of superomputers, clouds, GPUs, and multi-core processors. Data-intensive computing is an experiential education course that uses 10 hours of classroom contact and team programming to build data systems and algorithms on the Amazon cloud.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Application #: 0937810
Program Officer: Almadena Y. Chtchelkanova

Project Start
Project End
Budget Start: 2009-09-01
Budget End: 2013-08-31
Support Year
Fiscal Year: 2009
Total Cost: $495,000
Indirect Cost

CRAM: A Congestion-Aware Resource and Allocation Manager for Data-Intensive High-Performance Computing
Burns, Randal Griffin, John
Johns Hopkins University, Baltimore, MD, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments