Software systems are getting more complex, creating reliability issues that cause millions of dollars in economic loss. Beyond local software, distributed cloud software infrastructures (i.e., cloud systems) have emerged as a dominant backbone for many modern applications. Users expect high reliability from these systems, but guaranteeing their reliability proves to be challenging. Cloud systems run on hundreds/thousands of machines, execute complicated distributed protocols, and face a variety of hardware faults. This combination makes cloud systems prone to distributed concurrency bugs, which can cause catastrophic failures such as data loss, downtimes, and data loss/inconsistencies. This Distributed Concurrency Bugs Annihilation (DCBA) project will address this important issue and bring many direct benefits to the society; users from many areas (science, healthcare, business, education, military, and government) increasingly use cloud computing services and demand high availability and predictability. Combating distributed concurrency bugs is an important ingredient to such success.

Distributed concurrency bugs are caused by non-deterministic order of distributed events such as message arrivals, faults, and reboots. This project, Distributed Concurrency Bugs Annihilation (DCBA), will find, remove, and prevent buggy interleavings of concurrent distributed events with the development of four approaches: (1) full, automated, and deep distributed system model checkers, (2) fast inference, detection, testing and fixing of order violations, (3) runtime statistical debugging, prevention, and recovery, and (4) design advancements that reduce the possibilities of distributed concurrency bugs to appear. This DCBA project will advance the state of cloud dependability research. Existing literature on distributed systems reliability focuses on monitoring, post-mortem debugging, deterministic record and replay, and verifiable programming language frameworks. The DCBA project will introduce advancements to approaches related to model checking, bug detection, bug fixing, runtime debugging, prevention and recovery.

As more organizations build more distributed systems on farms of machines and services in cloud era, it is time for the dependability community to address distributed concurrency bugs in systematic and comprehensive manners. The DCBA initiative will have a profound impact to future distributed cloud systems.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
1563956
Program Officer
Matt Mutka
Project Start
Project End
Budget Start
2016-08-01
Budget End
2021-07-31
Support Year
Fiscal Year
2015
Total Cost
$799,977
Indirect Cost
Name
University of Chicago
Department
Type
DUNS #
City
Chicago
State
IL
Country
United States
Zip Code
60637