The physical and economic principles that enabled Dennard scaling and Moore's law in the semiconductor industry have reached their breaking point. However, as the number of transistors economically fabricated on a single chip plateaus, the processor industry has pivoted to create single-package computing systems, composed of multiple sub-components known as chiplets. Chiplets, which communicate via high-bandwidth on-package networks, offer the potential for transparent performance scaling into the next decade. However, chiplets introduce challenging non-uniform memory access characteristics into single-package systems that have traditionally not been subject to these effects. This project develops techniques to overcome the challenges of non-uniform memory accesses on high-performance single- and multi-package systems without programmer intervention. Exploring programmer-transparent scaling mechanisms improves the portability and lifetime of programs, decreasing the cost and complexity of software. Through the creation of course content and undergraduate summer internships, the project fosters an understanding of how to program machines in a post-Moore world and how compute accelerators should be designed to minimize the impact on the end-programmer as system complexity increases.

This project develops coordinated data placement and thread scheduling algorithms that leverage static information from the compiler and dynamic information from the runtime system to inform data placement and hardware-based thread scheduling. It advances the state-of-the-art by developing an open-source Graphic Processing Unit (GPU) simulator with a hierarchical interconnect that can be used to model both chiplet-based GPUs and multi-GPU systems. The researchers are exploring compiler informed data placement and thread scheduling in GPUs. Initial results demonstrate that a static analysis of the code can predict the data accessed by GPU threadblocks. Analysis shows that it is possible to determine which threads in a grid share memory pages, and the manner of that sharing, by building new static techniques that add an additional dimension to decades of work on compilers for sequential code. Using static information, in combination with runtime information provided by GPU drivers, the researchers are developing advanced data placement, prefetching, and thread scheduling algorithms. Both future chiplet-based designs and existing multi-GPU systems benefit from the development of these algorithms. Looking beyond the high-bandwidth memory used in GPUs today the project explores the system-level implications of heterogeneous memory in a chiplet-based system. Data placement and thread scheduling have even more importance in GPU systems of the future that make use of high bandwidth memory, traditional dynamic random-access memory, and non-volatile memory. The problem sizes in such systems are anticipated to be so large that opportunistic data placement and thread scheduling are even more critical than in conventional systems. The project uses sharing patterns based on the inter-kernel producer-consumer nature of machine learning workloads to change the program's code layout, runtime data placement, and threadblock scheduling algorithm to maximize locality in multi-node systems.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1910924
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2019-10-01
Budget End
2022-09-30
Support Year
Fiscal Year
2019
Total Cost
$495,380
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907