CPA-CSA: Accelerating Architectural-level, Full-system Multiprocessor Simulations using FPGAs

Hoe, James

Abstract

The programmability challenges to general-purpose parallel computing desperately need computer hardware and software researchers to work together on the solutions. However, the unacceptably slow speed of current full-system simulators limits the multiprocessor/multi-core research.

The proposed ProtoFlex project develops an FPGA-accelerated simulation technology to deliver the necessary simulation performance to enable full-scale software research on top of simulated experimental architectures. ProtoFlex simulators are not FPGA prototypes. The ProtoFlex simulation architecture relies on hardware virtualization to achieve full-system fidelity and system scalability, while mitigating the complexities associated with conventional FPGA prototyping. This project will develop a hybrid simulation with transplanting and with interleaving of multiple processor contexts, with the goal to decouple the required complexity of the hardware construction from the complexity of a very large target multiprocessor system. This project will also investigate in-hardware techniques for real-time, deep instrumentation and analysis of simulation events.

Project Report

Background In the ProtoFlex project, we developed a practical approach to accelerate the simulation of a multicore processor using FPGAs. The ProtoFlex approach centers on the concept of â€˜virtualizationâ€™ to decouple the complexity of the emulated multicore target from the implementation complexity of its FPGA-accelerated simulator. In the current effort presented below, we extended the ProtoFlex approach by expanding our simulation capabilities and addressing new challenges in simulating the "uncore" portion of the multicore. Instrumenting FPGA-based Simulations [CPN09] The ProtoFlex simulation architecture offers significant advantages over conventional software-based simulation tools when high performance fine-grain instrumentation is necessary. While conventional software-based full-system simulators suffers dramatic slowdowns (of up to 10x or more), FPGAs offer a parallel substrate for carrying out such monitoring activity with minimal interference to the instruction execution performance. We developed new monitoring tools on top of the baseline ProtoFlex FPGA-accelerated simulator (which models a 16-way UltraSPARC III symmetric multiprocessor server on a single FPGA). The first form of instrumentation provides computer architects with the means to simulate realistic CMP functional cache configurations for the purposes of performance monitoring. In addition, such tool can also be used to provide checkpointing of state for accelerated sampling-based cycle-accurate simulations. We demonstrated that such instrumentation can be carried out at very high speeds with minimal impact on the baseline simulator performance (less than 10% overhead). Accelerating Network-on-Chip Simulation [PHO11] We developed an FPGA-accelerated uncore simulation framework to be used in conjunction with the ProtoFlex core models to support uncore design evaluation and exploration using realistic full-system application workloads. We have investigated a new hybrid analytical network-on-chip (NoC) simulation approach that delivers a desired level of accuracy at a high throughput and low implementation cost. At an abstract level, any type of interconnection network, regardless of topology, can be decomposed into a set of routers that are connected by a set of links. At this abstract level, all buffering and logic within the network is contained inside the network router, which can be viewed as a black box with a set of input and output ports. We borrow an idea from analytical network modeling where for a given configuration and traffic pattern, each network router is modeled only as a set of delay-load curves. These delay-load curves, obtained by off-line training using a high-fidelity software-based network simulator, relate the injection load at the input ports of a network router against the average latency of a packet going through this router. After a packet is injected at the edge of the network, it is processed by routing logic that determines which routers are traversed by this packet. The affected routers are updated to reflect the increased load caused by this packet. The packet is assigned a network latency by summing up the estimated delays returned by the affected routers. We have been able to obtain good correspondence between this analytical approach and a traditional high-fidelity network simulation in determining packet latency and network hotspots under different loads and traffic patterns. We have implemented both software and FPGA-accelerated versions of this hybrid analytical simulation model. We have shown by experimentation that the performance and accuracy of this hybrid analytical NoC simulation is very suitable for supporting system-level multicore performance studies. Cycle-Accurate NoC Models for FPGA Accelerated Simulation [Pap11] In addition to the hybrid analytical model above, we developed a parameterized generator of register-transfer-level NoC models that can be synthesized for FPGA-accelerated cycle-accurate performance simulation. The NoC simulator generator accepts common router design parameters as well as arbitrary topology and routing algorithms. For small NoCs that can be mapped directly on an FPGA, a flattened NoC model can achieve extremely high speedup (3 orders of magnitude) relative to the reference cycle-accurate software model. For very large NoC designs, the generated NoC model supports resource virtualization such that the FPGA's logic resource is time-multiplexed by many logic router models to extend beyond the FPGA's mapping capacity limitation. Bibliography [ChH10] E. S. Chung and J. C. Hoe, "High-level Design and Validation of the BlueSPARC Multithreaded Processor," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume 29, Issue 10, pp 1459-1470, October 2010. [CPN09] E. S. Chung, M. K. Papamichael, E. Nurvitadhi, J. C. Hoe, B. Falsafi and K. Mai, "ProtoFlex: Towards Scalable, Full-System Multiprocessor Simulations Using FPGAs," ACM Transactions on Reconfigurable Technology and Systems (TRETS), Volume 2, Issue 2, June 2009. [Pap11] M. K. Papamichael, "Fast Scalable FPGA-Based Network-on-Chip Simulation Models," Proc. Formal Methods and Models for Codesign (MEMOCODE), July 2011. [PHO11] M. Papamichael, J. C. Hoe and O. Mutlu, "FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations," Proc. International Symposium on Networks-on-Chip (NOCS), May 2011.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 0811702
Program Officer: Ahmed Louri

Project Start
Project End
Budget Start: 2008-07-15
Budget End: 2012-06-30
Support Year
Fiscal Year: 2008
Total Cost: $314,000
Indirect Cost

CPA-CSA: Accelerating Architectural-level, Full-system Multiprocessor Simulations using FPGAs
Hoe, James
Carnegie-Mellon University, Pittsburgh, PA, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments