Collections of heterogeneous, network-connected computational devices have emerged as a new computational paradigm. This new network-based paradigm simplifies constructing flexible and scalable hardware infrastructure. Unfortunately, making such systems robust to failures can be difficult. Many of these distributed software systems contain a large number of devices, and therefore, have a significant chance of experiencing a hardware failure. Developers have to manually develop code to allow the software system to recover from such hardware failures. As a result, developing robust distributed software systems is typically more difficult than robust centralized software systems.
This work builds upon the Principal Investigator's prior work on the Bristlecone language for developing robust software systems. The key insight behind the Bristlecone language is that most errors propagate through software systems to cause further damage either by corrupting data structures or through the control-flow--induced coupling between conceptual operations. Bristlecone programs are architected as a set of decoupled tasks that are linked through a set of task specifications that describe how these decoupled tasks interact and what consistent data structures look like. Bristlecone then uses these specifications to adapt the program's execution in response to failures.
This project extends the previous work to support distributed software systems. This project develops static analyses to help developers understand how failures in the underlying hardware will affect the software system and to manage data and tasks so that hardware failures have a minimal affect on the computation.