In the multi-core processor era, microprocessors will only continue to scale in performance in the presence of abundant thread level parallelism. Achieving this goal of continuously scaling software parallelism will clearly require the exploitation of new compilation, language, and execution paradigms. One huge barrier to the viability of many proposed execution paradigms and the introduction of new paradigms is the inability of modern processor architectures to execute short threads efficiently. Many of these new execution models can be highly effective at exposing parallelism to the hardware if they have the freedom to identify and exploit opportunities for parallelism that are 10s to 100s of instructions long. However, current machines are not designed to execute short threads well. The goal of this research is to significantly reduce the startup cost for a new thread (or thread new to a core). This in turn reduces the break-even point that determines whether a piece of code is parallelizable or not.

The term "thread migration" is used to indicate a large number of parallel execution operations, all of which involve moving stored or cached state from one core to another. These operations include forked threads, migrated/moved threads for thermal management or load balancing, loop-parallel threads, task-level parallelism, helper threading, transactional execution, and speculative multithreading - all of these operations will be accelerated to some degree by this research. This research will attack all sources of the thread startup cost, including software (e.g., operating system) overheads, branch predictor state, cached data and instruction state, the commit latency, and the overhead of transferring the primary thread state between cores. In addition to reducing the parallel programming complexity, the broader imapcts of this research include graduate and undergraduate student training, availability of an open-souce simulation infrastructure.

Project Report

As we progress into the manycore era, we become increasingly dependent on high levels of thread-level parallelism for performance scaling. This will require new programming and execution models that expose more parallelism. An important barrier to the viability of many proposed execution models is an inability to efficiently execute short threads, due to the overhead of copying a thread's cache state between cores. Several new execution models significantly decrease the mean core occupancy times of threads, or otherwise increase the frequency of thread state transfers between cores. This work explored both new architectures to better exploit execution paradigms that employ short threads, as well as introducing new execution paradigms that could exploit such an architecture. These included: 1. Architecture Support for Fast Migration via Cache Working Set Migration -- The most significant source of lost performance when a thread migrates between cores is the loss of cache state. A significant boost in post-migration performance is possible if the cache working set can be moved, proactively, with the thread.This work accelerates thread startup performance after migration by predicting and prefetching the working set of the application into the new cache. Working set prediction as much as doubles the performanceof short-lived threads. 2. Data Triggered Threads -- This work introduces the concept of data-triggered threads. Unlike threads in parallel programs in conventional programming models, these threads are initiated on a change to a memory location. This enables increased parallelism and the elimination of redundant, unnecessary computation.This work shows that 78% of all loads fetch redundant data, leading to a high incidence of redundant computation. By expressing computation through data-triggered threads, that computation is executed once when the data changes, and is skipped whenever the data does not change. Significant followup to this initial work is addressed in another NSF grant. 3. Inter-Core Prefetching -- Multicore processors have become ubiquitous in today's systems, but exploiting the parallelism they offer remains difficult, especially for legacy application and applications with large serial components. The challenge, then, is to develop techniques that allow multiple cores to work in concert to accelerate a single thread. Inter-core prefetching uses one compute thread and one or more prefetching threads. The prefetching threads execute on cores that would otherwise be idle, prefetching the data that the compute thread will need. The compute thread then migrates between cores, and finds the data already waiting for it. Inter-core prefetching improves performance by an average of 31 to 63%, and speeds up some applications by as much as 2.8x. It reduces energy consumption between 11 and 26%. 4. Execution Migration in a Heterogeneous-ISA Chip Multiprocessor -- Prior research has shown that single-ISA heterogeneous chip multiprocessors have the potential for greater performance and energy efficiency than homogeneous CMPs. However, restricting the cores to a single ISA removes an important opportunity for greater heterogeneity. To take full advantage of a heterogeneous-ISA CMP, however, we must be able to migrate execution among heterogeneous cores in order to adapt to program phase changes and changing external conditions (e.g., system power state). This is non-trivial because program state is kept in an architecture-specific form; therefore, state transformation is necessary for migration. This work identifies large portions of program state whose form is not critical for performance; the compiler is modified to produce programs that keep most of their state in an architecture-neutral form so that only a small number of data items must be repositioned and no pointers need to be changed. The result is low migration cost with minimal sacrifice of non-migration performance. 5. Architecture Optimization of Multiple-ISA Heterogeneous Architectures -- The work described just above makes multiple-ISA heterogeneous architectures viable. This work demonstrates the inherent architecture advantages and develops first principles for the design of such processors..This work exploits the diversity offered by three modernISAs: Thumb, x86-64, and Alpha. This architecture has the potential to outperform the best single-ISA heterogeneous architecture by as much as 21%, with 23% energy savings anda reduction of 32% in Energy Delay Product. 6. Coalition Threading -- Non-traditional parallelism provides parallel speedup for a single thread without the need to manually divide and coordinate computation. This paper describes coalition threading, a technique that seeks the ideal combination of traditional and non-traditional threading to make the best use of available hardware parallelism. Coalition threading provides up to 2x gains over traditional parallel techniques on individual loops. This work provides heuristics for identifying loops that benefit from a combination of traditional and non-traditional parallelism and those that will perform best with a single technique. Using this heuristic, coalition threading provides an average gain of 17% across all the loops and an average speedup of 16.7% for the full applications over traditional parallelism.

Project Start
Project End
Budget Start
2010-08-01
Budget End
2014-07-31
Support Year
Fiscal Year
2010
Total Cost
$468,000
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093