This research aims to overcome the extreme challenges that need to be solved to realize a 1000-core (kilocore) processor. Processors with tens of cores are already in commercial products today. A kilocore processor could take us into the era of Server-on-Chip and Supercomputer-on-Chip. On-chip network is the medium through which two nodes in a processor can communicate, and therefore constitutes the backbone of a kilocore processor. Unfortunately, current on-chip network solutions are inadequate as they do not scale in terms of both power and performance beyond a few tens of cores. To reach the ambitious design goal of 1000+ cores with realistic power budgets, the interconnect technology needs to be at least 15 times more power efficient while providing at least the same level of throughput-per-core as today.

This project investigates three interrelated solutions to meet the above challenge in an evolutionary manner: (1) Developing a low-power and energy-proportional interconnect architecture that employs a larger number of narrower networks, (2) Using high-radix Swizzle-Switches as the building blocks for interconnecting the multiple networks, and (3) Re-designing network architecture with multiple networks and Swizzle-Switches using 3D integration with Through-Silicon-Vias to achieve scalability beyond 1000 cores. This project will demonstrate the feasibility of kilocore processors. If such processors can be built, they could have a tremendous impact on future exascale systems such as cloud computing servers and HPC systems that have many applications including drug discovery, defense, information analysis, and social networking.

Project Report

Advancements in processor technology have enabled diverse applications of computing which have significantly influenced the progress of various facets of society such as health care and medical science, information technology, large scale scientific research, improvements in human quality of life, finance and commerce, transportation and manufacturing industries. A number of societal projections and industrial roadmaps are driven by the expectation that these rates of improvement will continue. For the past three decades, Moores Law (the doubling of transistors on chip every 18 months) has been the cardinal driver of advancement of processor technology. However, voltage scaling began to stagnate after the 180nm technology node and as a result, accelerating single core performance has become prohibitive from a power perspective. Designers have instead resorted to increasing the number of cores per die as a power-efficient approach to throughput scaling, leading to the dawn of multicore era. The paradigm shift towards manycore designs has led to a renewed interest in interconnect design. Interconnects play a dominant role in shaping the power and performance profiles of manycore processors designed using deep submicron technologies. The lack of a good interconnect fabric can be envisioned to result in problems similar to traffic chaos in a large city without a proper roadway infrastructure. The trend towards integrating 100’s of cores onto the same chip is further accentuating the importance of on-chip interconnect design. In this project we focus on scaling networks for on-chip communication in manycore systems with up to 1000 cores, that will be the critical enabler for exascale supercomputers and cloud computing platforms of the future. To accomplish our ambitious goal, we must overcome several challenges. The largest of these barriers is related to energy and power. As the number of cores increases, it is going to become a significant challenge to sustain the current per-core bandwidth given the super-linear increase in network power consumption. The projected interconnect power by scaling up existing network designs far exceeds a practical power budget. A low power interconnect design is imperative to the realization of a 1000 core processor. While meeting the power constraint is important, it is equally necessary to meet the performance goals in terms of network bandwidth and latency. We adopt a power-driven approach towards scaling on-chip interconnects to meet the bandwidth and latency targets necessary in high-performance processors. Our research investigated two interrelated thrusts to scale the interconnect: (1) high-radix swizzle switches and (2) fine grained power-gating to achieve energy proportionality. Existing on-chip network topologies require large number of on-chip routers. The performance and power cost of these routers becomes prohibitive as we scale up to a manycore system with 100’s of cores. Consolidating these routers into a few large but efficient high-radix switches is a potential solution to this problem. Each high-radix switch should be able to provide connectivity for 32 to 64 cores and connectivity to other high-radix switches. Typically, such a design is generally considered impractical due to power and area complexity of these large switches. Along with our colleagues at Michigan, we developed a novel high-radix switch design called Swizzle-Switch which challenges this conventional notion and can readily scale up to a radix of 64. Using Swizzle-Switches as building blocks we architected asymmetric high-radix topologies that can scale on-chip networks to 1000 cores. As a natural next step we designed Hi-Rise 3D high-radix switch which extends scalability to radix 96 from that of the 64 radix supported by 2D switches at the same operating frequency. 3D integration is a promising technology to enable performance scaling for future systems and Hi-Rise can serve as an efficient communication fabric for 3D systems. Energy proportional computing requires that computing systems consume proportionally lower power when the computation demand is lower. Energy proportionality can be achieved for NoCs through power gating unused network components, which reduces leakage power. Power gating is challenging for a distributed system like network-on-chip. The problem is that even under low network load with only a few active flows, a majority of the routers in the network needs to be kept active to service those flows, significantly reducing the opportunities for power-gating and leading to significant performance degradation. We observe that, unlike a traditional network- on-chip design, a design with multiple parallel narrower sub-networks is more amenable to power gating because an entire sub-network can be turned off without compromising the connectivity of the network. To exploit this opportunity, we propose Catnap network architecture. To further enable power-gating in NoCs we also developed an architecture which steers away packets from sleeping components by employing dynamic topology and routing reconfiguration.

Project Start
Project End
Budget Start
2012-09-15
Budget End
2014-08-31
Support Year
Fiscal Year
2012
Total Cost
$89,948
Indirect Cost
Name
Regents of the University of Michigan - Ann Arbor
Department
Type
DUNS #
City
Ann Arbor
State
MI
Country
United States
Zip Code
48109