This project addresses the challenge of achieving extreme low latency in datacenter networks for Online Big Data (OLBD) applications which are critical workloads in datacenter computing. Remote Direct Memory Access (RDMA), which is a promising alternative to traditional TCP, significantly reduces datacenter network latencies by about an order-of-magnitude. However, RDMA adoption poses two major challenges as RDMA suffers from performance fragility under congestion, and RDMA incurs either wasted memory or significant programmer burden for typical OLBD traffic. This project develops two novel networking technologies -- Blitz and RIMA -- which enable scalable datacenter networks that achieve the low latency benefits of RDMA while avoiding its drawbacks (performance fragility, programmer burden, and wasted memory).
Blitz addresses performance fragility by decoupling edge-congestion and in-network congestion. Blitz handles edge-congestion using receiver-directed congestion control (RDCC) unlike prior approaches where senders have to infer sending rates indirectly from round-trip-times and/or dropped packets. RDCC enables accurate and fast (within-one-round-trip-time) convergence, which leads to lower latency and higher throughput. Blitz handles transient in-network congestion by deflecting packets along longer yet less-congested paths. Remote Indirect Memory Access (RIMA) addresses the second challenge by enabling reactive, on-demand memory allocation as opposed to RDMAs proactive memory allocation for the worst case, which minimizes the memory footprint without programmer effort. Together, Blitz and RIMA enable extreme low datacenter network latency for OLBD applications. The project will extensively involve Ph.D, Masters, and undergraduate students in cross-layer research activities. Results from the projects will be broadly disseminated via publication in scientific conferences.