The AfterBurner project looks at improving single-thread performance on both simple and high-performance out-of-order cores in an energy efficient way. Aside from explicit parallelism, this is the primary challenge of multi-core architectures going forward. The most energy-efficient way to improve single-thread performance is to accelerate low-performing program regions. This approach yields the greatest benefit. It also has a low cost because it doesnot require high-bandwidth execution, making it applicable to both simple and high-performance cores. Low single-thread performance is caused by squashes due to control and data mis-speculations and by long latency loads and stores which clog the pipeline. AfterBurner unifies two recently proposed techniques---speculative retirement which can efficiently buffer large numbers of completed instructions and selective re-execution which can re-execute dynamically generated program subgraphs to back-patch program state---and uses them to tolerate all four classes of low-performance events. AfterBurner's multi-purpose infrastructure approach to performance reduces cost, simplifies design, and expands applicability to code that suffers from different low-performance events simulatenously.
In addition to education and student tarining, the AfterBurner project marks the beginning of a systems research collaboration between Uniersity of Pennsylvania and Drexel computer science departments.
With computing devices important to all aspects of society and increasingly used in mobile environments, it is essential that the next generation of microprocessors are energy-efficient as well as able to run the software applications users care about faster than the old processors. In addition the design of energy-efficient processors requires cooperation between both processor architects and the circuit designers to implement designs that reduce energy consumption without running programs more slowly. This project invented new architecture techniques for the next generation of microprocessors. The project's three major results were: (1) finding that register reference counting can implemented efficiently and used to manage register allocation; (2) saving energy through intelligent management of the processor's register file through new gating and allocation algorithms; (3) inventing an energy-efficient latency-tolerant microarchitecture. The project disseminated results through public conference publications and PhD dissertations, and it provided training for three supported PhD students now employed by the high-technology industry in the United States. The Afterburner project was a collaboration between Professors Milo Martin and Amir Roth at the University of Pennsylvania and Professors Andrew Hilton at Duke University and Mark Hempstead at Drexel University under a subcontract. The goal of the Afterburner project was to study new microarchitecture techniques to improve the performance of regions of code that exhibit small amounts of instruction-level parallelism. These regions were target because they are a key factor limiting program performance. These new microarchitecture techniques were evaluated in terms of both performance and energy efficiency. Modern microprocessors contain large register files to store temporary values of the many simultaneous calculations, and the register file consumes a significant fraction of the total power consumed by a processor core. This project explored using register reference counting to track which registers have been allocated to software-visible registers using a single vector of bits, including the first detailed analysis of reference counting using detailed circuit and architecture simulations. The analysis indicates that reference counting can be used to assign registers to one region of the register file for more efficient power gating of the registers and also to save registers by eliminating unnecessary move instructions. These and other techniques using reference counting improved performance and energy-efficiency. This work was nominated for best paper at the conference on High Performance Computer Architecture (HPCA) 2012. The project also included studies on reducing the power consumption of a modern microprocessor through gating of unused portions of the processor’s register file. These results are described in a manuscript published in the IEEE International Conference on Computer Design (ICCD) 2013, and it analyzes several different algorithms for allocating registers to register banks and policies for when to disable and enable a register file bank for the best power savings with the lowest costs. The study concluded that the best combination of algorithms is gating banks immediately when they are empty and allocating a new register to the fullest (but not completely full bank). Such approaches can save up to 50% of register file energy. The third contribution is the invention of a energy-efficient latency-tolerant microarchitecture. This architecture provides the performance benefits delivered by other latency-tolerant microarchitectures without the significant energy cost. The system predicts long-latency instructions---such as load instructions that miss in the cache---and sends these instructions and all of their dependent instructions to an in-order queue structure. This technique extends the instruction window of the microprocessor, allowing other instructions that are not dependent on the load instruction to execute. This work is further described in the dissertation of supported PhD student Steven Battle.