Failures of computers or an unexpected order of messages leads to subtle bugs in distributed programs. This project is investigating algorithmic and implementation issues in monitoring and controlling multithreaded distributed computations. The techniques in this project are useful for testing distributed Java programs and for fault-tolerance during their execution. The project is investigating techniques in four areas: slicing, dependency tracking, global predicate detection and controlling a computation. Computation slicing is useful in reducing the size of the computation that needs to be analyzed. The project is developing online and distributed algorithms for slicing. Dependency tracking is required for online monitoring of global predicates and is currently done using vector clocks of dimension equal to the number of processes and threads in the system. The project is investigating a technique calledchain clocks that can track dependency in a scalable way even for a large-scale system. Global predicate detection is required to detect bugs during testing or runtime. The project is investigating detection of temporal logic predicates interpreted over the lattice of global states of a computation. Controlling a computation is useful during the testing phase to steer the computation toward software bugs and during the operation phase to steer it away from any existing software bugs.
The project is implementing a framework in Java and it will result in theoretical and practical advances in monitoring and testing of concurrent programs. The project is expected to significantly improve the quality and fault-tolerance of distributed software.