High Performance Computing is strategically important to national competitiveness. Advances in computational capabilities involve the use of unprecedented levels of parallelism: programming methods that involve billions of concurrent activities. Computational Frameworks allow these parallel programs to be organized in a modular fashion, achieving higher reliability, scalability, and better resource management capabilities.
The project develops formally based reliability enhancement mechanisms when Computational Frameworks are developed or optimized in response to the arrival of newer hardware/software technologies. These mechanisms include a formal specification of the expected behavior of these frameworks, and ways to verify them before deployment and during live operation. Correctness issues addressed in this proactive manner will considerably reduce the time it takes to proceed from idea to science. This project will enable a scientific understanding of which formal methods are likely to work at the scale of millions of cores, and which formal methods are best recommended to capture intended behavior versus those that are best suitable for run-time use. The project will also result in broad impact in terms of: the incorporation of our verification tools and techniques within popular tool-integration frameworks; achieving large-scale case studies on the use of formal methods within a computational framework that is under development; and training of undergraduate and graduate students on advanced correctness verification methods. It will also help build talent pool vital to continued progress in high performance computing with applications to science and engineering, energy/sustainability, and homeland security.
High performance computing is of strategic importance to every nation,by providing the basic ability to conduct scientific research on anever-increasing scale. In order to accelerate the rate of discoveriesin science, medicine, and engineering, researchers are making use ofthe largest computers today and are banking on continued advances inour HPC capabilities. The project investigated the use of rigoroustechniques to enhance an engineer's ability to debug these HPCsystems, and localize errors when they occur. This enables thescientist performing simulation based studies to quickly recover fromsoftware errors -- especially in a large-scale concurrent systeminvolving thousands of processors -- and continue along withscientific research. INTELLECTUAL MERIT: The main result of this research was the development of formal methodsthat exploit the structured software architecture of computationalframeworks as well as high level program control flow to localizebugs, and hence enhance debugging efforts. The project also lead tothe discovery of scalable solutions that are deployable in real HPCframeworks to enhance debugging. The project directly contributes toerror localization in future monitoring-based error detectionframeworks. BROADER IMPACT: The project contributed to tools for localizing errors in HPCframeworks, as well as released these tools open source. This willdirectly lead to enhanced amounts of science per invested HPC resourcedollar. The project had a significant training component, helping totrain undergraduate students in concurrent programming, formalmethods, and high performance computing. It also trained apost-doctoral researcher in research and development in highperformance computing software architectures and helping to advancethese infrastructures. By enhancing this nation's HPC capabilities and helping maintain thesoftware that underlies it, the project directly impacts science andengineering practices, and directly addresses this nation's energy andhomeland security needs.