This project aims to improve the performance of large scientific simulations on next-generation high-performance computing (HPC) systems by developing new strategies for task mapping, the assignment of specific parts of an application to each of the many processing nodes in an HPC system. A key determinant of application performance on HPC systems is the speed of message delivery between related parts of the application. This in turn depends upon the network connecting the processing nodes. Task mapping has the potential to improve network performance by arranging communicating parts of the application in a way that distributes messages more evenly through the network, preventing any part of it from becoming overloaded. The PI has previously shown that task mapping can reduce application running time on current HPC systems by up to 30%. New algorithms are needed for larger next-generation systems, which must use novel network topologies for the internode connections due to power limitations.

The specific network topology studied is Dragonfly, two variations of which are used in commercial systems. Dragonfly organizes network switches into groups that form a high-radix "virtual switch". This allows a direct connection between every pair of groups. Together with high connectivity between the switches within a group, this guarantees that every pair of nodes is connected by a short path. The problem is that having only a single direct connection between each pair of groups makes that connection a potential bottleneck. The project will develop task mapping algorithms that balance the goal of localizing related tasks to exploit intra-group connectivity with the need to spread the job across the system so that it can utilize many inter-group links simultaneously. The project will also develop node allocation algorithms to support jobs of varying size, providing each job with nodes that are well-connected while minimizing contention between jobs.

By improving application performance on next-generation HPC systems, the project will help realize the full potential of these powerful systems. In addition, the project will heavily involve undergraduate student researchers, who will be trained as future leaders in science and engineering.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1423413
Program Officer
Marilyn McClure
Project Start
Project End
Budget Start
2014-12-01
Budget End
2019-05-31
Support Year
Fiscal Year
2014
Total Cost
$266,631
Indirect Cost
Name
Knox College
Department
Type
DUNS #
City
Galesburg
State
IL
Country
United States
Zip Code
61401