Extreme scale computing introduces many new challenges to parallel program design, where a computation may involve hundreds of thousands of processes with multiple-level parallelism. It is very difficult to debug such large-scale parallel programs. Scalable and light-weight correctness tools are critical to combat this challenge.
This research seeks to design innovative algorithms and develop a scalable toolkit to efficiently and effectively analyze parallel programs and detect potential errors on the emerging heterogeneous and extreme scale computing platforms. Specifically, the objectives of the research are to: (1) develop instrumentation tools and optimized monitoring systems to support building tools for error detection, (2) design various optimization strategies and techniques to improve scalability and reduce overhead, (3) integrate static and dynamic program analyses to improve reporting accuracy and code coverage, (4) design more accurate and efficient detection techniques on large-scale parallel systems, and (5) investigate domain-specific techniques for error detection and optimization.
This research will greatly help the development of extreme scale parallel programs for scientific computing and discover hard-to-find errors in early stage. It will significantly reduce the burden of tedious debugging activities, so researchers can focus on scientific problems. The toolkit is targeted for general computing platforms, from local clusters to extreme scale supercomputers. In the education thrust, the research results will facilitate the development of new courses and enhance existing ones. High-school, undergraduate, and graduate students will have opportunities to get involved in the research.