Computing systems and associated software have long had to make trade-offs in terms of performance due to an imbalance between fast processors and their slower memory hierarchies. Newly released technologies for 3D stacked memories provide an opportunity to reduce this imbalance by providing higher memory bandwidths and novel ways for accessing memory. One of the most promising techniques for using 3D stacked memory involves "memory-centric" computation that moves computation as close as possible to main memory. However, there is little understanding of how to best use these new memory technologies in libraries and applications, even as this hardware is slated to be integrated into near-term exascale supercomputing systems. The goal of this project is to understand the techniques and approaches that are needed to fully utilize 3D stacked memories and to demonstrate a useful set of computational primitives that can serve as a template for accelerating large scientific codes with these new memory components.
This research considers the specific problem of using "memory-centric" processors effectively to implement sparse primitives as part of a library that supports a number of key scientific applications including radiation transport, fluid flow, and fusion simulations. This library, SuperLU_DIST, is a sparse direct solver library designed for distributed memory multicore systems that has previously been accelerated on both NVIDIA?s graphics co-processors and Intel?s Xeon Phi co-processor. While this prior work has thus far yielded promising speedups, it has also revealed critical and fundamental algorithmic performance bottlenecks related to memory data transfers. This research project will investigate whether these bottlenecks may be mitigated by using emerging memory-centric co-processors. Such co-processors, which include Micron?s Hybrid Memory Cube (HMC) and High-Bandwidth Memory (HBM), combine 3-D stacked memories and FPGAs to provide lower latency, higher bandwidth data transfer, and support for near-memory data processing. The project will use high-level languages like OpenCL to take advantage of such technologies and will utilize a mix of algorithmic advances and software library development to improve application performance. Additionally, this work will lead to a new, open-source release of SuperLU called Super Stacked, Accelerated LU (SuperSTARLU), which will be made available to application developers and will be demonstrated on one of the next-generation systems with memory-centric co-processors, such as NERSC?s Cori.