Network-data-plane-disrupting bugs can cause the disruption of critical network applications. Unfortunately, today, there exists no comprehensive solution for detecting data-plane-disrupting bugs. Consequently, the network operator has few useful leads for investigating bug-triggered problems. To bridge this gap, this project develops novel measurement-based solutions for automatic bug detection that are accurate and scalable. The results from this project will significantly improve the reliability of current and future generation networks. The expected results include network measurement techniques, data analysis methods, network-wide monitoring strategies, data structures and algorithms for high performance bug detection, and bug localization techniques. These solutions will (1) save IT personnel resources because network operators no longer need to troubleshoot an entire complex network but instead can focus on finding the best solution for a detected bug; (2) reduce negative impact of bugs because the operator can immediately take proper defensive actions; (3) improve assurance after software or hardware updates since incorrect data plane behaviors resulting from updates could be automatically caught; (4) improve feedback to vendors since vendors can receive valuable information correlating detected bugs to configurations, updates, configuration changes, etc., which helps vendors diagnose the problem's root cause and develop solutions. The results from this project will be broadly disseminated through scientific workshops, conferences and journals, as well as through a project web site hosted at Rice University. Software and tools that are produced by this project will be released to the public under open source licenses.
The outcomes of this project have drastically improved networked systems' ability to detect and survive disruptions caused by network bugs. We have developed novel network monitoring techniques to improve the efficiency of failure detection by smartly monitoring groups of routers in aggregate, while keeping the benefits of fine-grained monitoring. One important use of these techniques is to detect errors in traffic trajectories (i.e., packet forwarding paths) which, if left undetected, may cause applications to fail and create security loopholes for network intruders to exploit. We have developed network control protocols to mitigate the ill effects of the design flaws in existing protocols. Our new protocol is able to improve network correctness verification time by several orders of magnitude, and can significantly reduce traffic delays. We have analyzed the reasons behind the poor performance of large scale computing frameworks under failures. These frameworks are crucial to many applications ranging from web-indexing, image and document processing to high-performance scientific computing. In light of these findings, we have developed significantly more efficient strategies to cope with the impact of failures that are based on job recomputations. This project has provided many exciting opportunities for graduate student training in cutting edge networking technologies. The project has also supported students from under-represented minority groups. Specifically, this project has provided training opportunities to two female graduate students. Four graduate students partially supported by this project have received the Ph.D. degree.