This work builds on recent instrumentation approaches that provide low-overhead end-to-end tracing (e.g., Stardust and Dapper) that can capture the flow (i.e., path and timing) of individual requests within and across the components of a distributed system.
In addition to tools for profiling and examining system behavior, the PIs are creating tools that compare request flows between two executions to guide understanding of changes in performance, as contrasted with determining why a system has always been slow. Comparing request flows can help diagnose a large class of performance problems, such as degradations resulting from software changes/upgrades or from usage over time (e.g., due to resource leakage or workload changes).
This research is evaluating the efficacy of this approach, addressing scalability challenges in instrumentation data collection and analysis, and exploring the effects of virtualization on variability.