As a critical backend for many of today's applications and services, large-scale distributed systems must be highly reliable. In the last couple of years the field witnessed a phenomenal scale of deployment; Google is known to run clusters with thousands of machines each, Apple deploys over 100,000 database machines, and Netflix runs tens of database clusters with 500 nodes each. This new era of cloud-scale distributed systems has given birth to a new class of faults, scalability faults---faults whose symptoms surface in large-scale deployments but not necessarily in small/medium-scale deployments. The CP2 project is proposed to solve the problem of correctness checkability and performance predictability of systems at extreme scale. Specifically the project will analyze over 500 real-world scalability faults in over a dozen large-scale systems, develop a single-machine scale-checking framework that allows developers to test large distributed code on one or a few machines, and provide groundwork for compute- and I/O-performance predictability of large-scale jobs on both existing and future architectures. These tasks will advance debugging, testing, learning, and prediction methods both on traditional hardware platforms and emerging ones and ultimately lead to correct-by-construction development methods. The CP2 project will have impact in multiple disciplines including systems (cloud/datacenter systems reliability), programming languages/compilers (new static/dynamic analysis techniques), architecture (compute/storage prediction for heterogeneous hardware), algorithms (the use of learning methods), and high-performance computing (benchmarking of HPC systems/applications).
In terms of societal benefits, the CP2 project addresses paramount issues mentioned in the NSF Strategic Plan for 2018-2022. More specifically, society increasingly depends on complicated systems that are products of human ingenuity, including ecosystems of large and complex software with millions of lines of code running on thousands of machines. CP2 will address the challenges of understanding and predicting the behavior of such systems. Furthermore, as societyâ€™s reliance on complex systems grows, learning about their robustness and understanding how to strengthen them are of increasing importance. In terms of education, the CP2 project gives unique hands-on research and education with cutting-edge systems technology in which students will be trained to operate software on a large number of machines and analyze their performance and correctness. The results of the CP2 project will be released through the classic medium of publication, through the development of numerous software artifacts which will be open-sourced, and finally through collaboration with various industry partners to help shape the next generation of large-scale systems.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.