As the scale of high performance computing continues to grow, application robustness becomes increasingly important. Checkpointing is the conventional method for fault tolerance. However, it only deals with failures after their occurrence through rollback. In case of one process failure, all processes including non-faulty processes have to be restarted from the previously saved state prior to the failure. Thus, significant performance loss can be incurred due to the work loss and failure recovery. Proactive approaches take preventive actions (e.g. preemptive process migration) before failures, thereby avoiding failures with low cost. Nevertheless, its effectiveness relies on perfect fault prediction, which is hardly achievable in practice.
This project investigates a new approach called adaptive fault management by intelligently integrating proactive and reactive robustness techniques such that it will enable applications to avoid anticipated faults if possible, and in the case of unforeseeable faults, to tolerate these faults in such a way that their impact is kept to a minimum. The project consists of three major components: (1) cooperative anomaly diagnosis (CAD) to improve fault prediction in large-scale systems by developing meta-learning methods; (2) adaptive control manager (ACM) to allow runtime decision making in response to imperfect fault prediction; and (3) integrated runtime support (IRS) to enable cost-effective coordination of fault handing techniques at runtime. The resulting framework will enhance robustness of high performance computing applications by improving their performance in the presence of failures. This project also enhances the systems-area curriculum at Illinois Institute of Technology and helps train the future-generation scientific computing workforce.