Trends in the design of high-performance computing systems are making them more difficult to use effectively. This EAGER award focuses on the problem of managing massive parallelism while effectively exploiting locality. Its goal is to develop breakthrough techniques for parallel runtime systems that will support libraries, applications, and other software infrastructure on the next generation of high-performance systems.
The PI proposes to develop the Parallel Runtime Scheduling and Execution Control system (PaRSEC), which will serve as a prototype for novel ideas about parallel runtime systems. The PI's approach will be based on scalable directed acyclic graph (DAG) scheduling techniques that will track data dependencies between tasks. The scheduling structures will be designed to handle billions of tasks running on millions of computational nodes. The scheduling framework will also have to handle task migration and load balancing to maximize parallelism while preserving data locality. The prototypes developed as part of this EAGER will provide the foundation for future research on parallel linear algebra routines for extreme-scale computing systems.
The purpose of the Parallel Runtime Scheduler and Execution Controller (PaRSEC) system is to provide a programming environment that facilitates development of scientific and engineering computing software packages for the largest supercomputers. Such supercomputers are built of hundreds of cabinets, each cabinet containing tens of circuit boards / server blades, each board containing multiple processor (CPU) sockets, each socket housing a chip with multiple CPU cores. The total number of cores in the largest systems reaches millions. The boards often also contain more exotic chips, such as Graphic Processing Units (GPUs), which serve as hardware accelerators to speed up critical portions of the computation. The system is interconnected with a high performance network, which connects all computing components. PaRSEC’s solution to programming such large and complex systems is twofold: 1) PaRSEC defines a programming model that shields the software developer from the complexities of modern supercomputers, and 2) PaRSEC provides the software layer that executes users’ code on such a system using dataflow principles, i.e., schedules work to cores and moves data around as necessary through the complex hierarchy of the memory system. PaRSEC accomplishes its objectives by representing an algorithm as a collection of tasks, i.e., atomic operations, executed by a single CPU core or a single GPU processing unit. The tasks are connected with edges representing the flow of data between the tasks, and referred to as the task graph, or more formally, a Direct Acyclic Graph (DAG). DAG scheduling is the main principle of PaRSEC's operation. One of the main areas of PaRSEC's application is the field of dense linear algebra, where DAGs are commonly encountered, which cannot be built entirely without exceeding the memory capacity of the hardware. To address this problem, PaRSEC relies on a symbolic representation of the DAG, called a Parametrized Task Graph (PTG), which allows for describing large DAGs in a compact manner. By the end of this project's funding cycle, the PaRSEC system was successfully used to automatically translate an important subset of the PLASMA numerical library for shared-memory systems, to form the core of the DPLASMA library for distributed memory systems with multicore processors and accelerators. The subset includes crucial numerical operations, such as solution of linear systems of equations and least square problems. This objective has been accomplished by providing the serial code from PLASMA (loop nests) to the front-end/compiler layer of PaRSEC, which automatically translates them to the PTG form. The PTG representation of the DAG is then executed by the runtime component of PaRSEC, which takes care of dynamically scheduling work to CPU cores and GPU accelerators and moving the data around as necessary. The performance delivered by PaRSEC for these routines is superior to the performance delivered by the ScaLAPACK software, which—at the time of this writing—is the only viable alternative. The impact of PaRSEC has both cyberinfrastructure and educational aspects. In terms of cyberinfrastructure, PaRSEC represents the type of software that most future software libraries and applications will require in order to achieve high performance on tomorrow’s large scale and highly hybrid supercomputers. The educational aspect of PaRSEC is in automating code development for large-scale machines through a transparent methodology aimed at broadening the user's insight into his/her computational problem and its performance/parallelism/scalability characteristics.