More than seven years after traditional processor designs hit the edge of their power envelope, the path of extreme scale Computational Science to a 100 petaflop (Pflop/s) system, which researchers had hoped to be using by the middle of the coming decade, has never looked steeper. On current high performance computing (HPC) systems, the 'application-architecture performance gap,' i.e. the gap between theoretical peak performance and the performance realized by full applications is already substantial. But with clock frequencies now capped below 4 GHz and trending downward, latencies in key areas (e.g. memory access, bus, system interconnect) expected to remain relatively stagnant, and software parallelism required to increase by at least three orders of magnitude to make effective use of the tens of thousands of processors and millions of cores that 100 Pflop/s systems are projected to contain, it is now reasonable to worry that a widening application-architecture performance gap will make such systems unproductive to use and therefore irrational to build.

The proposed research aims to provide the kind of coordinated math and computer science research effort needed to solve the interrelated cluster of software problems that threaten to cripple application performance on future large-scale systems. Under the PULSAR project, the PIs use a variety of both classic and novel dense linear algebra algorithms to explore the potential of well known, but now little used systolic array design principles to exploit all the power that future multi-core and heterogeneous systems, built to extreme scales. If a software platform that virtualizes classic systolic array architecture, allowing for suitable flexibility in the granularity of its application can be created, then libraries and applications that use this data-driven execution model to achieve outstanding performance and scalability on future massively parallel and data-starved HPC systems can be produced.

Project Report

The hypothesis of the PULSAR project was that systolic arrays could serve as inspiration for a programming model capable of scaling to the full size of the largest supercomputers. This challenge means targeting machines that include hundreds of thousands of CPU cores and tens of thousands of GPU accelerators. Sheer size being a challenge on its own, these machines also have very complex architectures, with multiple levels of memories and network interconnections. The motivation for seeking a new programming model is that no single unified programming paradigm for such machines exists. Instead, at least three different ones are mixed together. A common case would be a mix of the MPI, OpenMP, and CUDA programming systems. What PULSAR offers as an alternative is a programming model where the user defines the computation in the form of a Virtual Systolic Array (VSA), which is a set of Virtual Data Processors (VDPs), connected with data channels. The VDP is assigned a function, which defines its operation. Within that function, the VDP has access to a set of global parameters, its private persistent local storage, and its channels. The runtime invokes that function when there are packets in all of the VDP’s input channels. This is called firing. When the VDP fires, it can fetch packets from its input channels, call computational kernels, and push packets to its output channels. It is not required that these operations are invoked in any particular order. The VDP fires a prescribed number of times. When the VDP’s counter goes down to zero, the VDP is destroyed. The VSA contains all VDPs and their channel connections, and stores the information about the mapping of VDPs to the CPUs and GPUs. The VSA needs to be created first and then launched. First, an empty VSA is created, then VDPs are inserted in the VSA, followed by execution of the VSA by the PULSAR runtime system, and finally by its destruction. Figure 1 illustrates the basic concept. This programming model is accessible to the user through a very small and simple Application Programming Interface (API). All the complexity of executing the workload on a large-scale system is hidden in the runtime implementation. While the user invokes simple push and pull channel operations, the runtime takes appropriate actions, depending on the boundaries crossed by the channel, i.e., uses shared memory for VDPs residing in the same node, uses message-passing for VDPs residing in different nodes, uses Direct Memory Access (DMA) transfers between CPU VDPs and GPU VDPs. The runtime operates a dedicated management thread, called the proxy, which handles communication within each node. The proxy thread manages multiple message queues to handle both intra-node (DMA) communications and inter-node (MPI) communications. The left image in Figure 2 shows the communication system for nodes with CPUs only, while the right image shows the communication system for nodes with CPUs and GPUs. In the most complex case of VDP residing in GPUs located in two different nodes, the runtime will invoke a sequence of DMA and MPI transfers to move the data from one VDP to the other, all while the user only invokes simple push and pull actions. We used the PUSAR system to implement systolic algorithms for the LU factorization and the QR factorization. Our published results include a 2D systolic algorithm for the QR factorization of square matrices, and a 3D systolic algorithm for the QR factorization of tall-and-skinny matrices. We ran the first one on up to 23,000 CPU cores and the second one on up to 15,000 cores and showed that they consistently outperform existing numerical software. We envision the longevity of the PULSAR system far beyond its funding period. While we proved the benefits of PULSAR for implementing dense linear algebra algorithms, we clearly see its applicability to a much larger range of problems, including stencil computations and neural networks. We expect PULSAR to have significant impact on future cyberinfrastructure, as it represents the type of system that most of future software libraries and applications will require in order to achieve high performance on tomorrow’s large-scale and highly hybrid supercomputers. In our view, PULSAR also provides tremendous educational value, thanks to its simple model of mapping algorithms to very large and complex computing systems, and also thanks to its open-source nature, and a robust object-oriented implementation, accompanied by thorough documentation.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1117062
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2011-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2011
Total Cost
$499,996
Indirect Cost
Name
University of Tennessee Knoxville
Department
Type
DUNS #
City
Knoxville
State
TN
Country
United States
Zip Code
37916