Configuration errors (i.e., misconfiguration) are a major cause of system failures according to several studies. For example, misconfiguration has caused serious crashes and center-wide outages in a number of data centers and commercial cloud infrastructures affecting millions of customers. In addition to system down time, misconfiguration also wastes engineers' or administrators' time in troubleshooting and corrections, leading to significant maintenance and support costs.
Although recent work on detecting misconfiguration has improved the situation to some degree, the fundamental root cause needs to be better addressed. Based on the insights gained from the PIs' recent empirical study on 546 real world configuration errors in commercial and open source systems, the intellectual merit of this project is to take a more fundamental approach to addressing misconfiguration problems from the root cause in a proactive, anticipatory way. This work has three objectives: (1) to improve configuration design to make them less error-prone; (2) to harden software systems to better tolerate and gracefully react to users' configuration errors; and (3) to detect hard-to-check configuration issues such as compatibility and cross-component parameter inconsistency.
The broader impacts include significantly reducing the amount of system downtime in data centers, decreasing vendors' customer support cost for troubleshooting configuration issues, and planned educational, outreach, and broadening participation activities.