Design and Implementation of Algorithms for an Experimental High-Radix Network Switching system

McCalpin, John

Abstract

Technology and economic trends in computer hardware have led to the widespread adoption of extremely large clusters of multicore servers as today?s supercomputers. As applications are modified to exploit the increased parallelism of these additional nodes and cores, the application?s network messages typically become both smaller and more frequent. Although the bandwidth of the interconnect networks in current supercomputer systems is almost keeping up with increases in compute performance, there has been little improvement in the overhead of sending messages, and correspondingly little improvement in the throughput of the network when sending short messages. As a consequence of these two trends, many applications scale poorly on existing supercomputing systems, and many more applications are expected to scale poorly as the industry moves forward with even larger supercomputer systems.

In this study, TACC is investigating a possible solution to these issues with a new network switch and network interface architecture that can sustain full network bandwidth using very short messages. This project is investigating the programming models required to use this network efficiently and is evaluating the performance of the new interconnect network in direct comparison with a modern (quad-data-rate Infiniband) interconnect that is widely used in current supercomputers.

The evaluation of the new system is being conducted through a number of case studies. Implementations of algorithms known to exhibit poor parallel scaling on standard systems due to network performance limitations are being ported to the new system and instrumented with timers and hardware performance counters to document the detailed performance characteristics. This study will provide an initial evaluation of the technical viability of the current implementation of this new architecture for these algorithms, and is expected to provide recommendations for future enhancements to the architecture.

Project Report

For the past decade, most supercomputers used in high performance scientific computing have not been custom-designed systems, but have been assembled as "clusters" of hundreds (initially) to thousands (currently) of general-purpose computers (referred to as "compute nodes"), often linked together by a specialized high-performance network. The most popular high-performance networks in recent systems are designed to deliver messages from one compute node to another in exactly the same order in which they were sent. This has the advantage of making software easier to write, but has the disadvantage that messages are more likely to interfere with each other, causing subsequent messages to be delayed, and therefore slowing down the programs that are waiting on those messages. On clusters made up of thousands of compute nodes, it is not unusual for these "in order" networks to deliver only 1/10 to 1/2 of the expected data transfer rates, and it is not generally possible to predict what level of performance will be provided by a specific combination of programs running on the system. Both the loss of performance and the inability to predict the loss of performance reduce the productivity of supercomputers using these networks, and discourage users from running applications that spend a large fraction of their time transferring data across the network. For this project, we evaluated the issues involved in using a network that does not guarantee that messages are delivered in order, and that uses what amounts to random routing of messages to minimize the ability of messages to interfere with each other. This enables much higher utilization of the network, but means that extra work is required if the program needs to receive the messages in the same order that they were sent. Most programs don't require all messages to arrive in order, but almost all programs require that some specific sequences of messages arrive in the order in which they were sent. Rather than forcing all messages to arrive in order, we considered what additional hardware and software mechanisms were required to deliver particular messages in order, but only when "in-order delivery" was requested. From previous work we knew that we could force messages to be delivered in order by sending the first message, waiting for an acknowledgment from the recipient, then sending the second message. We also knew that these extra delays while waiting for acknowledments could (in some cases) cause a greater loss in performance than using a network with guaranteed "in-order" delivery. What we discovered during this project was that certain mathematical proofs about the theoretical properties of message-passing systems could be used to point us to a much more efficient way to provide in order delivery when it is needed. Instead of waiting for round trip acknowledgments, most of the time we are able to combine the use of "sequence numbers" added to the messages with the use of multiple receiving buffers to eliminate the round trip synchronizations. The "sequence numbers" are set by the sending compute node and are used by either hardware or software on the network interface of the receiving compute node to specify the order in which the messages should be delivered to the processor(s). We combine this with the use of multiple buffers to allow a compute node to continue working after sending a message and then to send additional messages to the receiving compute node without waiting to find out if the receiving node had finished reading the first set of messages. Although round trips are still required in some special cases (such as coordinating the startup of a program running on multiple compute nodes), our analysis shows that the overhead required to enable in-order message delivery should almost always be much smaller than the benefit that we expect from using an "out of order" network that operates more efficiently (because of reduced interference between messages and reduced delays when that interference occurs).

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 1240652
Program Officer: Almadena Chtchelkanova

Project Start
Project End
Budget Start: 2012-10-01
Budget End: 2013-09-30
Support Year
Fiscal Year: 2012
Total Cost: $149,888
Indirect Cost

Design and Implementation of Algorithms for an Experimental High-Radix Network Switching system
McCalpin, John
University of Texas Austin, Austin, TX, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments