Reliable Internet performance and availability are essential for many existing and future network applications. While the Internet works well enough most of the time for most people, nearly everyone has experienced outages and service degradation that make the network unusable, and we are far from five nines of reliability that critical services require. Improving Internet connectivity requires action against all sources of unavailability and poor performance.

The research community has made substantial progress toward understanding and developing technologies to address short-term outages due to BGP (border gateway protocol) routing convergence. However, much less progress has been made at reducing the impact of long-term outages and route misconfiguration. Despite being rare, these events have a large impact on overall network availability because repairs happen on a human timescale. Additionally, many users suffer from the use of sub-optimal (high latency or lossy) paths to network services due to misconfigurations and ineffective route selection. Operators at an affected ISP or service often encounter stumbling blocks at each step: identifying that a problem exists, localizing the root cause of the problem, and affecting a repair.

The researchers on this project will develop a system to transform this largely manual troubleshooting process into a fully automated one. The goal of the research is that persistent outages and performance problems can be identified in real-time, rather than today's matter of hours. While automated diagnosis and identification of root cause is fundamentally hard, the project will benefit from dramatic recent progress in Internet measurement technologies, specifically reverse path measurement that provides a much more complete picture of the Internet topology than ever before.

Intellectual Merit: The goal of the research project is to change the paradigm of network diagnosis on the Internet -- from blind to informed. The state of art with network troubleshooting is to use ad-hoc techniques. For instance, it is common occurrence on the NANOG (North American Network Operators? Group) mailing list for operators to post requests asking other operators to manually issue traceroutes and report them in order to identify network anomalies. The network could thus benefit from a continuously operated service that can not only detect network problems in realtime but also identify misbehaving network elements at the granularity of routers. There are also a number of challenges to deploying a functional diagnosis system, and the researchers will address them using the following key components. First, the project will produce a scalable measurement system that will synthesize measurements from different techniques to provide snapshots of routing behavior in real-time. Second, the research will focus on developing a general theory of Internet path changes that will help model the propagation of routing events and identify the candidate set of responsible ASes (autonomous systems). Third, the researchers will develop inference techniques that will operate on measured data and identify the origin of failures and path changes in the wide area even when the measurement data is incomplete or subject to transient dynamics.

Broader Impact: Our society is increasingly relying on the Internet for critical telecommunications services, such as home health monitoring, e-911, smart grids, and so forth. It is no longer simply an inconvenience when the Internet is unavailable or inefficient. If this project is successful, it will help operators address the major sources of unavailability and misconfigurations in the Internet, benefiting all of its users. In addition, because of a lack of automated tools, operators currently spend huge amounts of time chasing down individual outages and performance misconfigurations; this raises the barrier to entry for small ISPs, ultimately raising the costs of Internet service for everyone.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1318396
Program Officer
Darleen L. Fisher
Project Start
Project End
Budget Start
2013-09-01
Budget End
2017-08-31
Support Year
Fiscal Year
2013
Total Cost
$499,351
Indirect Cost
Name
University of Washington
Department
Type
DUNS #
City
Seattle
State
WA
Country
United States
Zip Code
98195