Modern distributed systems are extremely complex, due in large part to individual node complexity, node unreliability and asynchrony, and unpredictable network message delays and orderings. Further complicating development of these systems is both the presence of multiple potentially incompatible versions of the systems, and the need to build correct systems also exhibiting high performance. Prior testing and simulation frameworks are characterized either by extensive manual effort, or automated search for violations of a binary decision problem---the presence or absence of a bug.
We are developing automated and interactive techniques for helping developers understand the behavior of distributed systems implementations. By leveraging the evolved frameworks, and instrumenting implementations in structured, straightforward ways, we are building development tools focused on understanding system behavior rather than merely identifying correctness errors. This change in focus will enable more general tools that improve development productivity in addition to testing productivity.
This research is proceeding on three fronts: 1) developing automated tools using data mining with repeated executions to extract execution behaviors and performance, 2) developing flexible execution descriptions suitable for both use-case descriptions and automated processing, allowing more intuitive interaction between users and their tools, and 3) incorporating testing tools with revision control systems, enabling multi-version analysis and long-term progress tracking. When completed, this research will reduce developer effort necessary to design, update, and debug distributed systems, and may inspire creation of a new class of systems debuggers analysing not just correctness, but also performance and complexity.