Multi-core (parallel) processors are becoming ubiquitous. The use of such systems is key to science, engineering, finance, and other major areas of the economy. However, increased applications performance on such systems can only be achieved with advances in mapping such applications to multi-core machines. This task is made more difficult by the presence of complex memory organizations which is perhaps the key bottleneck to efficient execution, and which was not previously addressed effectively. This research involves making the mapping of the program to the machine aware of the complexities of the memory-hierarchy in all phases of the compilation process. This will ensure a good fit between the application code and the actual machine and thereby guarantee much more effective utilization of the hardware (and thus efficient/fast execution) than was previously possible.

Modern processors (multi-cores) employ increasingly complex memory hierarchies. Management of such hierarchies is becoming critical to the overall success of the compilation process since effective utilization of the memory hierarchy dominates overall performance. This research develops a new cache-hierarchy-aware compilation and runtime system (i.e., including compilation, scheduling, and static/dynamic processor mapping of parallel programs). These tasks have one thing in common: they all need accurate estimates of data element (iteration, task) computation and memory access times which are currently beyond the (cache-oblivious) state-of-the-art. This research thus develops new techniques for iteration space partitioning, scheduling, and synchronization which capture the variability due to cache, memory, and conditional statement behavior and their interaction. This research will have a broad impact on the computer industry as it will allow the ubiquitous multi-core systems of the future to be efficiently exploited by parallel programs.

Project Report

In a world where muti-core processors have become ubiquitous, execution of a given program on multiple cores (processors) is the key to further performance and productivity improvements. To achieve this, the program has to be broken up into multiple "chunks" which can then execute concurrently on different cores. Such breakup is exceedingly difficult to achieve, especially while significantly increasing performance, because the meaning of the overall program has to be preserved despite the complex changes made to the code and in the presence of the extreme complexity of the organization of modern computers. For example, modern processors have increasingly complex memory organization, and the growing number of cores dramatically complicates coordination of the interaction between the various chunks executing concurrently. This makes faster execution very hard to achieve, even when done by hand by experienced users – an extraordinarily tedious, time consuming and error-prone task. This project has developed several new techniques that automate these very difficult tasks allowing the user to focus on the scientific challenges rather than having to deal with such low-level details. Our results demonstrate the wide applicability of the techniques, often doubling the execution speed on a variety of important scientific and engineering programs on state of the art Intel multi-core processors.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Application #
0811882
Program Officer
Almadena Y. Chtchelkanova
Project Start
Project End
Budget Start
2008-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2008
Total Cost
$355,717
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697