Debugging at large scale is a constant challenge in the operation of leadership class systems. While there has been substantial attention paid to parallel debugging in general in both the research and commercial arenas, most parallel debuggers rely on a visual motif for conveying information about program execution and communication. These approaches tend to be invaluable for understanding and debugging program function at tens of tasks, but are simply not suitable for finding elusive problems that occur at tens of thousands of tasks. Even if the interfaces were useable at that scale (and licenses were available), current iterations of many commercial debuggers fail to function properly beyond 2,048 cores. The challenge of debugging at large scale is exacerbated by the normal mode of operation of very large systems. These systems tend to run 24x7 in a batch mode intended to maximize system utilization and throughput. Negotiating the large blocks of time to operate a substantial fraction of the system in an interactive mode is difficult at best, and cost-prohibitive at worst.
In this project investigators will lay out the design for the GDBase debugging framework. GDBase is designed to attack three of the key problems of debugging on the largest systems. 1.) Lack of Scalability - most commercial debuggers are too heavyweight to launch or provide a useable interface at many thousands of cores. GDBase has been tested to 12k cores. In our testing, the commercial DDT debugger fails after about 2k. 2.) High Cost of use - GDBase runs jobs through the production batch systems with no need for long periods tying up large fractions of the machine for interactive debugging runs and no need to lower utilization through fixed time reservations that prevent other jobs from running. 3.) Information overload - Rather than a graphical interface that fails to scale, or gigabytes of printf text, all debugging information is stored in a relational database that can be mined later by a number of analysis tools.
The intellectual merit of this proposal lies in the novel approach taken to tackle scalable debugging. GDBase incorporates a lightweight, offline architecture for data collection inspired by performance analysis tools, a database repository for vast quantities of debugging information and cross-run analysis, and an API to create new agents for providing a variety of interface and analysis tools. The broader impacts of this proposal are the ability to lift productivity across all domains that develop applications for large scale cyberinfrastructure. GDBase will also make large scale debugging more widely available, and will be incorporated in a range of training and education activities.