Routing is arguably the most fundamental aspect of networking, since it answers the basic question: how do you place the appropriate state in routers or switches so that packets can travel from source to destination? There is a huge literature on routing, covering many topics (intradomain and interdomain, wireline and wireless, convergence and policy oscillations, etc.), and the router vendors have been honing their routing implementations for many years. After all this academic and commercial work, one might expect that there would be little new of fundamental importance to say about routing, and that all current work would involve small, incremental improvements to algorithms and implementations.
However, there are two conflicting trends that are changing the context in which routing is being used, and which necessitate a new round of routing research:
Reliability requirements: Because networks are being increasingly used for critical services (hospitals, financial institutions, etc.), the reliability expectations for networks are becoming more stringent i.e., 'five nines' of reliability). Routing is responsible for directing traffic around failures (i.e., failure recovery) and avoiding hotspots (i.e., load distribution), so the required increases in reliability must come from improving the failure recovery and load distribution mechanisms embedded in routing protocols.
Network size: Networks are growing at a rapid pace, and a new class of networks - datacenters, which can have hundreds of thousands of hosts and millions of VMs - are pushing the scaling limits as never before. In all routing algorithms, because they are essentially distributed consistency algorithms, the convergence times (for responding to failures and hotspots) and/or the routing overhead (in terms of the number and size of routing messages) increase with size.
The upshot of these two developments is that routing algorithms are being asked to do a better job (in terms of reliability) on a harder task (because of the increases in network size and complexity). As a result, both the commercial world and the academic community have embarked on a new round of routing research. These efforts first produced several ad hoc rerouting methods (such as MPLS Fast Reroute and ECMP) and then concentrated on developing multipath routing methods (such as Path Splicing and a variety of other approaches). However, all of these developments are retrofitted on top of the traditional approach to routing, which builds a single path from the source to the destination. These mechanisms significantly improve the reliability of networking, but they do not tell us how to incorporate more effective failure recovery and load distribution into the core foundation of routing algorithms.
More recently, we (along with others) have proposed a new routing paradigm, one that changes the basic output of routing from a path to a directed acyclic graph (DAG). This new approach, which here will be called Routing Along DAGs (RAD), automatically provides multiple paths for local failure recovery and load distribution. This allows RAD to, without any global route recomputations in response to failures or hotspots, guarantee connectivity (as long as the graph is connected) and provide optimal load distribution (in a simple single-destination traffic model). This project is investigating the RAD approach from many angles: design, simulation, implementation, and theory. The goal is to have a put this new routing paradigm on a firm scientific footing.
Broader Impacts: There is a pressing commercial and governmental need for increased reliability and easy-to-manage routing algorithms. This proposed project will produce prototypes of new routing and traffic engineering approaches built on commercial routing hardware (using the OpenFlow interface) that could be used to test these ideas in commercial settings. This could have a significant impact on how datacenter networking is done, and more generally improve network reliability. Promising preliminary discussions have already been held with router vendors in this regard.
This project investigated new directions for routing and traffic engineering in large-scale computer networks. One might think that routing, which is the most basic of networking tasks, would be fully understood by now. However, as in most engineered systems, the current routing solutions were designed to meet the then-current needs. In recent years, as our requirements have changed — to include greater scales (up to hundreds of thousands of nodes), more stringent reliability (requiring faster responses to failures), and complete programmability (to support SDN) — current routing solutions have been found wanting. Our investigations explored several radically new approaches to routing, including: An entirely new modularity for routing, where connectivity is the responsibility of the data plane and optimal routing is the responsibility of the control plane. A recursive style of routing where a single code-base can be used to compute routes at every level of the hierarchy. The first rigorous analysis of the power of failover routing, which has led to an intriguing conjecture about whether failover routing can withstand up to the min-cut number of failures. Opportunities for using SDN at Interconnection points to provide more flexible interdomain routing. These developments are not just of academic interest. They have the potential to make our networks much more reliable, scalable, and flexible, which would lead to better Internet service for all of us. Our investigations have opened up these and other avenues for future exploration, and suggest that there is much more to be done in the routing arena.