Distributed applications running in data centers are critical to society (e.g., for shopping, banking). Engineers must diagnose and fix problems observed in data centers quickly; however, doing so is extremely challenging. A significant hurdle is that engineers must spend significant time and effort exploring what instrumentation (e.g., log messages about specific application behaviors) is needed to provide visibility into a new problem. To assist in this front, this project will develop an instrumentation framework that, in response to a new problem, will automatically search the space of possible instrumentation choices and enable the instrumentation needed to provide insight into it.
This project addresses fundamental challenges associated with creating an automatic instrumentation framework: (a) What algorithms and heuristics are suited for automatically and efficiently exploring the instrumentation search space? (b) What architectural support is needed within the framework to enable automatic exploration? (c) How can the search space be explored without significantly impacting application performance? The proposal will explore the utility of algorithms based on operator knowledge, statistics, and machine learning to explore the search space. It will build on end-to-end tracing, as this will enable the framework to work for problems that affect different sets of requests.
This project will inform the architecture of next-generation instrumentation frameworks, which are needed to keep pace with the ever-increasing complexity of distributed applications. The critical issues identified in popular open-source distributed applications while evaluating the framework will improve their robustness. Researchers will be able to leverage the software artifacts released by this project to create novel distributed-application-management tools that leverage the framework's unique capabilities. They will be able to deploy the framework in research clouds to obtain valuable workload traces from them. The project will generate course modules on diagnosis practices for distributed applications.
The artifacts produced by this project, including framework source code, workload traces, instrumented applications, and research results, will be freely disseminated online at: https://massopen.cloud and www.rajasambasivan.com. All software artifacts will be stored in Github as well. All artifacts will be available for a minimum of seven years from the start of the project.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.