As the High Performance Computing (HPC) field moves towards even more powerful (Exascale) systems, it faces two key challenges: resilience and power efficiency. Increasing computing power while not significantly increasing the power budget potentially involves new architectural designs, such as the ones with low power and/or low margins. Though future systems are expected to experience an increase in the number of faults (because of increasing number of cores and decreasing feature sizes), energy efficient designs will likely suffer even more errors due to tighter margins and other design compromises. Of particular concern are the errors that escape hardware detection: such errors are said to cause Silent Data Corruption (SDC). Maintaining correctness of a numerical simulation in the presence of SDCs is a very challenging problem. The intellectual merit of this project is in combining ideas from numerical methods, programming model design and architecture design to address the correctness of numerical simulations. The broader significance and importance of the project includes impact on scientific and high performance computing, system software for parallel computing, and architectural designs and research. This project will also make several contributions towards education, human resource development, and increasing diversity, with activities like teaching parallel computing (and programming) to diverse audience, mentoring of doctoral students, including those from underrepresented groups, and an interdisciplinary training program in Mathematical Biology for undergraduates.
Technically, the project addresses the challenge of developing and executing scientific applications with energy efficient low-power/margin architectures that experience occasional faults, while maintaining programmer productivity and accuracy of results. This project develops a synergistic research program combining advances in HPC programming models, runtime systems, Near Threshold Voltage (NTV) architectures, and numerical methods (algorithms). Specifically, the project involves close collaboration between researchers from three areas: (a) parallel programming models, applications, and runtime systems, (b) architecture, and (c) finite difference and finite volume numerical models.