Computational science, which involves modeling and simulation of phenomena such as combustion in engines, has become the third pillar of science and engineering research. Computer simulations test design parameters much more cheaply than physical experiments. Also, computer simulations participate in a fortuitous cycle with theory by enabling inexpensive experimentation of theoretical models. Mapping computer simulations to high performance computer architectures is a challenging computer science problem; constraints include achieving high performance and effective use of computing resources while not overburdening scientific programmers. This challenge is becoming more severe as architectures continue to evolve in ways that make them ever more difficult to use. In this project, the PIs will remove programmer burden by developing a programming abstraction called loop chaining, which enables architecture-specific program optimizations by compilers. This work enables scientists to spend less time dealing with annoying performance programming details and more time evolving their scientific models that help push science and engineering forward.
Exposing opportunities for parallelization while explicitly managing data locality is the primary challenge to porting and optimizing existing computational science simulation codes. The most popular programming models used in these codes such as MPI (Message Passing Interface) require that programmers explicitly determine the data and computation distribution. This has led to good scaling between compute nodes, but parallelism and locality are needed within a node as well. There are many approaches for implementing shared memory parallelism, but with most of them it is the programmer's responsibility to group computations to improve data locality. This project focuses on the development of the loop chain abstraction to provide compilers with sufficient information to automate the parallelism versus data locality tradeoff. Preliminary results show that using the loop chain abstraction can significantly improve parallel scalability. The intellectual merits are that the loop chain abstraction will enable existing codes to maintain their software modularity while exposing information critical to performance optimizations that improve parallel scalability. Some important contributions of this research are the re-casting of existing program optimizations to use the loop chain abstraction as input and the eventual incorporation of the loop chain abstraction into parallel programming languages. The broader impacts include reducing the burden on scientists developing computational simulations, sharing the developed compiler prototypes as open-source software, and providing tutorials for doing source-to-source loop chain-based tiling transformations in C++ and Fortran code. The testbed for loop chaining will include atmospheric science, materials, and combustion codes, therefore tunable versions of these applications will be released. Additionally, a new course module will be developed, through which students will be trained in computational science and specifically, on how to expose loop chains within simulation software.