High-performance computing (HPC) enables breakthroughs in different science domains that lead to an improvement in national economics, health, welfare, and defense. Unfortunately, the ability of HPC systems to deliver productive science is now beginning to be significantly hampered by hardware-related errors and failures. Consequently, computational applications of national importance will need to spend a large fraction of execution time in resilience mechanisms to make forward progress in the presence of failures. Despite that, a huge amount of resources and time will be wasted on future HPC systems due to the high frequency of failure interruptions during application execution.
To address these challenges, this project, called REYAZ, explores new territory in HPC job scheduling: maximizing the amount of useful work done on reliability-constrained HPC systems by jointly exploiting dynamic reliability state of the system components and resilience characteristics of applications. REYAZ will enable two novel capabilities: (1) a reliability-aware job scheduling approach that optimizes useful work done per unit time on unreliable large-scale computing systems while individual applications are guaranteed "fair" performance. (2) a family of techniques to reduce the input/output (I/O) overhead - a side-effect of widely used resilience mechanisms such as checkpoint-restart - while retaining the performance improvements obtained via reliability-aware scheduling.
Maximizing the useful work per unit time on future reliability-constrained HPC systems will directly translate into more productive science - leading to faster advancements of different science fields and societal impact. Capabilities developed in this project will also help reduce the wastage of energy on large-scale systems resulting in economic benefits for the society. This project will integrate the research tasks and outcomes into educational activities to train the next generation of engineers who will face the challenges of operating unreliable large-scale systems. Undergraduate students from underrepresented groups will be engaged and trained in the field of large-scale fault-tolerant parallel computing.
The project website (https://github.com/GoodwillComputingLab/REYAZ) will host all the documentation of research findings and software artifacts developed as a part of the project, including system software, runtime systems, analytical tools, modeling methodologies, experimental data, and traces. The project website will be maintained actively for at least five years beyond the project end date.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.