Many high-end scientific applications perform stencil computations in their inner loops. A stencil defines the value of a grid point in a d-dimensional spatial grid at time t as a function of neighboring grid points at recent times before t. Stencil computations are conceptually simple to implement using nested loops, but looping implementations suffer from poor cache performance on multicore processors. Cache-oblivious divide-and-conquer stencil codes can achieve an order of magnitude improvement in cache efficiency over looping implementations, but most programmers find it difficult to write cache-oblivious stencil codes. Moreover, open problems remain in adapting these algorithms to realistic applications that lack the perfect regularity of simple examples. This project's investigation of cache-oblivious stencil compilation enables ordinary programmers of stencil computations to enjoy the benefits of multicore technology without requiring them to write code any more complex than naive nested loops.

The research project is developing a language embedded in C++ that can express stencil computations concisely and can be compiled automatically into highly efficient algorithmic code for multicore processors and other platforms. The Pochoir stencil compiler compiles stencil computations that exhibit complex boundary conditions, such as periodic, constant, Dirichlet, Neumann, mirrored, and phase factors; irregularities, including macroscopic and microscopic inhomogeneities, as well as irregular shapes; general complex dependencies, such as push dependencies, horizontal dependencies, and dynamic dependencies. To achieve these goals, the researchers are developing provably good algorithms for complex stencil computations; exploring how domain-specific compiler technology can achieve speedups from efficient cache management, processor-pipeline scheduling, and parallel computation; investigating how to run stencils efficiently on a wide variety of architectures such as multicore, distributed-memory clusters, graphical processing units, FPGA's, and future exascale machines; demonstrating the effectiveness of their research by developing a production-quality stencil compiler; developing a benchmark suite and benchmarking system for evaluating Pochoir.

This research enables scientific researchers and others to easily produce highly efficient codes for complex stencil computations. The codes make good use of the memory hierarchy and processor pipelines endemic to multicore processors and run fast on a diverse set of hardware platforms. This research eases the development and maintenance of a wide variety of stencil-based applications ? ranging across physics, biology, chemistry, energy, climate, mechanical and electrical engineering, finance, and other areas ? benefiting these application areas, as well as society at large.

Project Report

This research investigated the problem of generating high-efficiency cache-oblivious code for real-world stencil computations that make good use of the memory hierarchy and processor pipelines, starting with simple-to-write linguistic specifications. The goal was to make a wide variety of stencil-based applications --- ranging across physics, biology, chemistry, energy, climate, mechanical and electrical engineering, finance, and other areas --- easier to develop and maintain, benefiting these application areas, as well as society at large. The outcome was a robust stencil compiler that handles homogeneous stencil rules on rectangular shapes, simple boundary conditions, and simple dependency rules. Approaches to handle complex boundary conditions, inhomogeneities, irregular shapes, and complicated dependencies were also investigated. Results were also obtained on performing stencil computations on unstructured graphs, achieving pipeline parallelism on the fly, and delivering optimal cache performance without trading off parallelism. The researchers developed PRISM, a chromatic-scheduling algorithm for executing dynamic data-graph computations. Data-graph computations are an extension of stencil computations from simple grids to general unstructured graphs. PRISM uses a vertex-coloring of the graph to coordinate updates performed in a round, precluding the need for mutual-exclusion locks or other nondeterministic data synchronization. PRISM has provably good bounds on work and span, and these theoretical guarantees are matched by good empirical performance. The researchers also developed PRISM-R, a variation of PRISM that executes dynamic data-graph computations deterministically even when updates modify global variables with associative operations. PRISM-R satisfies the same theoretical bounds as PRISM, but its implementation is more involved. The researchers investigated how a concurrency platform for fork-join parallelism can be enhanced to support pipeline parallelism in a provably efficient manner. Indeed, one can imagine performing a stencil computation as a pipeline. The researchers proposed simple linguistics for specifying pipeline parallelism, designed a provably efficient scheduling algorithm that integrates pipeline parallelism into a work-stealing scheduler for fork-join parallelism, and analyzed the algorithm. The proposed linguistics allow the structure of the pipeline to be specified "on the fly," as opposed to the "build-and-run" approach taken by existing concurrency platforms, thereby allowing a more flexible structure of the pipeline. The proposed mechanism has been incorporated into Cilk-M, a Cilk-based work-stealing runtime system. Benchmark results indicate that the prototype implementation has low serial overhead and good scalability. The researchers invented parallel "Cache-Oblivious Wavefront" (COW) algorithms. Unlike most state-of-the-art cache-oblivious parallel algorithms for stencils and dynamic programming (DP) problems that trade off parallelism for better cache performance, COW algorithms aim to achieve the same without sacrificing parallelism. COW algorithms perform divide-and-conquer (DAC) of the stencil/DP grid in exactly the same way as standard cache-optimal recursive DAC-based algorithms, but consider a recursive subtask ready for execution as soon as all its real dependency constraints are satisfied. COW algorithms inherit the cache-obliviousness and (serial) cache-optimality properties of the standard recursive DAC-based algorithms. But by scheduling a recursive subtask ready for execution based only on its real dependency constraints, COW algorithms lead to potentially improved parallelism. Experimental results on stencil and DP computations support these claims. The researchers investigated high-performance parallel implementations of dynamic programs with more complicated dependency patterns than those encountered in stencil computations. Traditional looping codes implementing dynamic programs suffer from poor temporal cache locality that leads to palpably poor performance on multicore machines. Also often the loops cannot be suitably reordered in order to optimize for better spatial locality, parallelization and/or vectorization. The researchers considered recursive divide-and-conquer algorithms for DP, and showed that in addition to offering excellent (and often optimal) temporal locality, for many DPs with nonlocal dependencies the recursive decomposition reduces the problem into iterative kernels that are highly optimizable. Benchmark results indicate that recursive implementations achieve up to 30x speedup over their standard parallel loop based DP counterparts on multicore machines. Good results were also obtained on the hybrid CPU+MIC platform and on clusters of multicore machines. The researchers solved the "Range 1 Query" (R1Q) problem that arises during the generation of optimized kernels in Pochoir. Given a d-dimensional (d >= 1) input bit matrix A, the R1Q problem asks one to preprocess A so that for any given region R of A, one can efficiently answer queries asking if R contains a 1 or not. The researchers considered both orthogonal and non-orthogonal shapes for R including rectangles, axis-parallel right-triangles, certain types of polygons, axis-parallel right simplices and spheres, and provided space-efficient deterministic and randomized algorithms with constant query times (in constant dimensions) for solving the problem in the word RAM model. This two-year research project produced 8 publications, one of which won a Best Paper award at a highly competitive conference. It educated two Ph.D. students and two female postdocs.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
1162196
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2012-04-01
Budget End
2014-09-30
Support Year
Fiscal Year
2011
Total Cost
$173,584
Indirect Cost
Name
State University New York Stony Brook
Department
Type
DUNS #
City
Stony Brook
State
NY
Country
United States
Zip Code
11794