This project seeks to quantify the frequency, duration, causes, and impact of faults across a variety of network classes. Unlike previous efforts that have relied on substantial special-purpose instrumentation and monitoring infrastructure, the PIs are conducting their analysis using only commonly available data sources, such as device logs and configuration records maintained by network operations staff. In this way, they hope not only to provide concrete data regarding the particular networks being evaluated, but to define a repeatable methodology that can be employed by other researchers and even commercial operators to assess the reliability of other networks. Through partnerships with an academic backbone network (CENIC), an enterprise network services company (Hewlett-Packard), and large-scale Web services provider (Microsoft), the PIs have obtained access to device logs, operational maintenance records, and configuration information for a significant number of real networks.
Concretely, this project is working to deliver a fault analysis methodology based on readily available data like device logs, configuration information, and operator records such as email lists and trouble tickets; comparative studies regarding the differing failure characteristics of wide-area, enterprise, and data center networks; and a generative model of network faults that can be used to evaluate the suitability and efficacy of different applications and protocols to various network designs.
Broader Impact: In addition to these concrete technical contributions, the broader impacts of this project include the potential to cause network operators to reconsider how current networks are designed to make them less likely to fail, or to fail in more straightforward and easily manageable ways. Moreover, the research effort is imparting onto the next generation of computer scientists the skills necessary to assess, analyze, and model the performance characteristics of operational networks. This project supports two graduate students who assist the PIs in conducting the described work, and receive significant exposure to commercial network environments through industrial internships.
The project both advanced the state of the art in terms of understanding network failure and contributed to the education and development of two gradulate students who were funded under this effort. In particular, Danny Turner completed his PhD dissertation based upon the work described below. The second PhD student, Feng Lu, also successfully graduated just after the completion of this project, although his dissertation focused on a different topic. Technically, the project had the following outcomes: Wide Area: We established the utility of using low-fidelity logging information to monitor network failures. In particular, our first-of-its-kind comparison of the failure patterns reported by Syslog-based analysis to those extracted through direct IGP monitoring extends what was known regarding the methods available to analyze failure in the wide area. In particular, we find that there is significant disagreement between the two sources, with roughly one quarter of all events reported by one data source not appearing in the other. Clearly, IS-IS monitoring is likely to be more accurate, as traffic fate shares with the routing protocol. That said, our analysis indicates that Syslog's omissions are heavily biased toward short failures, and that the larger statistical properties of the network obtained through analyzing Syslog, e.g., annualized downtime, number of failures, time to repair, are reasonably accurate. Indeed, the overall character of the distributions is similar; the mean, median, and even 95th percentile are frequently within a factor of two for these basic properties. Certainly, one would draw many of the same basic conclusions from both data sources. As such, we conclude that Syslog-based analyses are quite useful for getting at the aggregate failure characteristics of a network where IGP data is not available. It is less well suited to situations requiring more precise failure-for-failure accounting. Enterprise: Our findings here are somewhat less definitive. Conducted in collaboration with researchers at HP Labs, our characterization of failures as seen in managed enterprise networks represents the first publicly available data in this area. Among our findings, we show how low-level network event data is unable to capture the full range of problems handled by ESPs, that high severity errors are dominated by connectivity problems with third-party ISPs and that trouble tickets are dominated by lower-severity problems. Finally, we document that there is significant variation in the prevalence of different problem types and severities across different customers. Hence, we conclude that a study that focuses on only a few enterprise customer sites will have a high chance of mis-characterizing the breadth of customer problem types. Datacenter: In collaboration with Davd Maltz and others at Microsoft, Danny Turner helped developed NetPilot, a tool for automating datacenter network failure mitigation. Recent research efforts have focused on automatic failure localization in datacenters. Yet, resolving failures still requires significant human interventions, resulting in prolonged failure recovery time. Unlike previous work, NetPilot aims to quickly mitigate rather than resolve failures. NetPilot mitigates failures in much the same way operators do: by deactivating or restarting suspected offending components. NetPilot circumvents the need for knowing the exact root cause of a failure by taking an intelligent trial-and-error approach. The core of NetPilot is comprised of an Impact Estimator that helps guard against overly disruptive mitigation actions and a failure-specific mitigation planner that minimizes the number of trials. Turner, Maltz, and collagues have demonstrated that NetPilot can effectively mitigate several types of critical failures commonly encountered in production datacenter networks.