Dennard's scaling, which governs the growth of power, voltage and frequency of CMOS integrated chips, has been as instrumental as Moore's law in enabling the exponential growth of the number of active transistors on a chip. Unfortunately, the recent slowing down of Dennard's scaling of the supply voltage in future multicores may result in dark silicon where an increasing number of cores must be kept powered down due to lack of power. One alternative is to improve power efficiency by customizing the cores for specific functionalities. While the dark silicon option obviously degrades performance, the customization option puts multicores on a potentially arduous path of increased effort for hardware design, verification, and test, and degraded programmability. The challenge that architects face is to design around the reality of the slowing of Dennard's scaling while avoiding either of the two harsh consequences (dark silicon, or the increased cost/effort of customized core design).

This project addresses the above challenge by pursuing an alternative, gentle (i.e., non-arduous) path for multicore scaling, while remaining within the power envelope imposed by the slowing of Dennard's scaling. The design employs successive frequency unscaling, where all the cores are kept powered and run at successively slower clocks every generation to stay within the power budget. An analytical model (developed as part of this project) for the performance of systems with and without successive frequency unscaling makes the surprising prediction that despite considerably slower clocks in later generations (e.g., sub-GHz), successive frequency unscaling would exceed the dark silicon performance limit. The key research goal of this project is to validate the predictions of the model with real applications and detailed system simulation. Validating an alternative, gentle path for multicore scaling has the potential to offer significant benefits for the microprocessor and computer industry. Beyond the research impacts, the project's integration of education components in both graduate and undergraduate curricula helps expand its educational impact.

Project Report

The slowing down of Dennard's scaling of the supply voltage imposes power-limits that can potentially choke future multicore performance. The prevalent approach to this slow down is that an increasing number of cores over technololgy generations are kept powered down (i.e., dark silicon) due to lack of power; but this option imposes a low upper bound on performance. Our goals were (1) to study the dark silicon effect on multicores (i.e., the number of cores increases over technology generations but the total power budget remains constant so an increasing number of cores have to be turned off to stay within the power budget), (2) propose alternative ways to cope with the effect., (3) to educate graduate students in multicore power-performance issues and (4) to disseminate the findings to the computer architecture community. We proposed and evaluated two ideas. First, we proposed a general evolutionary path for multicores, called successive frequency unscaling (SFU). SFU keeps powered signicantly more cores (compared to the option of keeping them `dark') running at clock frequencies on the extended Pareto frontier that are successively lowered every generation to stay within the power budget. The higher active core count enables more memory-level parallelism, non-linearly osetting the slower clock and resulting in more performance than the previous bound. Second, we proposed and evaluated a specific microarchitectural power optimization called PreTrans to reduce the power dissipation in processors. Using detailed full-system simulations, the project's findings are as follows: For memory-intensive workloads (on-line transaction processing. apache webserver, and SPECJBB), full SFU, where all the cores are powered up, performs 46% better than the dark-silicon bound at the 11 nm technology node. For enterprise workloads where both throughput and response times are important, we employ controlled SFU (C-SFU) which moderately slows down the clock and powers many, but not all, cores to achieve 21% better throughput than the dark-silicon bound at the 11 nm technology node. The higher throughput non-linearly reduces queuing delays and thereby compensates for the slower clock, resulting in C-SFU's total response latency to be within +/- 10% of that of a conguration which uses full-speed cores. We designed a simple pretranslation predictor that is correct 75% of the time for the ARM architecture and 52% for the x86 architecture on data translation, and more than 99% accuracy on instruction translation for both ARM and X86 architectures. The predictor enables signi?cant reduction in the number of accesses to highly associative TLBs thus saving 90% and 85% of TLB energy, on average, for ARM and x86, respectively. Our results show that SFU and PreTrans are simple yet effective schemes that enable a viable, evolutionary path of higher performance within the usual constant power budget for multicores with higher core counts at virtually no design effort or complexity. This path is attractive for the microprocessor industry. We have trained two graduate student research assistants in multicore power-performance issues and computer architecture principles. We have published PreTrans in ISLPED 2013. We have prepared a technical report on SFU and are in the process of submitting a paper on SFU to a computer architecture journal. Our results will likely be of value to the VLSI community in circuit design for future technoiogy generations. Finally, our results may help the computer hardware industry continue to provide higher performance over technology generations while staying within a constant chip power budget. Such products will be of value to society at large.

Project Start
Project End
Budget Start
2012-07-15
Budget End
2014-03-31
Support Year
Fiscal Year
2012
Total Cost
$100,000
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907