This project is developing scalable mechanisms to debug, monitor, and assess the quality of the complex distributed systems that represent the backbone of modern software infrastructure. These methods are necessarily highly-automated; they reason about the operation of distributed systems while treating the components of such systems as black boxes. This means that the methods do not require source code, programmer annotation, or developer input to troubleshoot a distributed system. Instead, they rely on detailed information gleaned from pre-existing log messages that are nearly ubiquitous in every large-scale distributed system and data extracted via binary analysis of components as they run.
These new methods, termed telescopic analysis, combine the ability to collect extremely detailed, low-level information about systems executing large numbers of requests with "big data" analysis that mines insights and create models of system operation from the corpus of detailed observations. Telescopic analysis uses targeted, sample-based logging and/or binary analysis to generate substantial quantities of high-precision data about specific runs of the system under observation. It then combines these observations into models that capture the aggregate behavior of the system. Comparing the general model with the detailed observations of each run allows understanding of how that run conforms to or deviates from the common operation of the system. The project is also developing tools and query languages that allow understanding of the results of such comparisons, both in aggregate and as pertains to specific runs, for performance analysis, debugging data quality failures, understanding outlier behavior, and performing "what-if" analysis.