Assuring deadlines of embedded tasks for contemporary multicore architectures is becoming increasingly difficult. Real-time scheduling relies on task migration to exploit multicores, yet migration actually reduces timing predictability due to cache warm-up overheads and increased interconnect traffic.
This work promotes a fundamentally new approach to increase the timing predictability of multicore architectures aimed at task migration in embedded environments making three major contributions:
1. The development of novel strategies to guide migration based on cost/benefit tradeoffs exploiting both static and dynamic analyses.
2. The devising of mechanisms to increase timing predictability under task migration providing explicit support for proactive and reactive real-time data movement across cores and their caches.
3. The promotion of rate- and bandwidth-adaptive mechanisms as well as monitoring capabilities to increase predictability under task migration.
The work aims at initiating a novel research direction investigating the benefits of interactions between hardware and software for embedded multicores with respect to timing predictability. This project fundamentally contributes to the research and educational infrastructure for the design and development of safety- and mission-critical embedded systems.
Many embedded systems use multi-core/many-core processors. The communication among these processors are through on-chip interconnects and the delay/power/energy of the on-chip interconnect design has significant impact on the system. The project aims at reducing the power/energy and improving the performance of the on-chip interconnect. During this report period, we proposed two techniques. (1) NOC-sprinting, which can help reduce the thermal impact and energy consumption of the network-on-chip. To maintain a constant power envelope, the fraction of a silicon chip that can be operated at full frequency is dropping exponentially with each generation of process technology. Consequently, a large portion of silicon chips will become dark or dim silicon, i.e., either idle or significantly under-clocked. However, most previous work focuses on energy-efficient core/cache design while the impact of on-chip interconnect is neglected. In fact, Network-on-chip (NoC) plays a vital role in message passing and memory access that directly influences the overall performance of many-core processors. Moreover, network components dissipate 10% - 36% of total chip power. Therefore, how to design the interconnection network is critical to tackle the challenges of multicore scaling in the dark silicon age. Recently, a concept of computational sprinting was proposed, in which a chip improves its responsiveness to short-burst of computations through temporarily exceeding its sustainable thermal design power (TDP) budget. All the cores will be operated at the highest frequency/voltage to provide instant throughput during sprinting, and after that the chip must return to the single-core nominal operation to cool down. While such mechanism sheds light upon how "dark" cores can be utilized for transient performance enhancement, it exposes two major design issues: First, the role of interconnect is neglected. NoCs consume a significant portion of chip power when all cores are in sprinting mode. When switching back to the nominal mode, only a single core is active. However, the network routers and links cannot be completely powered down, otherwise a gated-off node would block packet-forwarding and the access of the local but shared resources (e.g., cache and directory). As a result, the ratio of network power over chip power rises substantially and may even lead to higher NoC power than that of the single active core. Second, the mode-switching lacks flexibility and only provides two options: nominal single-core operation and maximum all-core sprinting. Depending on the workload characteristics, an intermediate number of active cores may provide the optimal performance speedup with less power dissipation. To address these two issues, we propose fine-grained sprinting, in which the chip can selectively sprint to any intermediate stages instead of directly activating all the cores in response to short-burst computations. The optimum number of cores to be selected depends on the application characteristics. Scalable applications may opt to a large number of cores that can support highly parallel computation, whereas other applications may mostly consist of sequential programs and would rather execute on a small number of cores. Apparently, fine-grained sprinting can flexibly adapt to a variety of workloads. In addition, landing on intermediate sprinting stages can save chip power and slow down the heating process by power-gating the remaining inactive on-chip resources, which is capable of sustaining longer sprint duration for better system performance. (2) NOC-Delta, which can compress the information to be transferred on the network, so that energy can be saved and performance can be improved. The idea is to conduct data encoding prior to packet injection and decoding before ejection in the network interface. The key idea is to store a data packet in the Network-on-Chip as a common base value plus an array of relative differences. It can improve the overall network performance and achieve energy savings because of the decreased network load. Moreover, this scheme does not require modifications of the cache storage design and can be seamlessly integrated with any optimization techniques for the on-chip interconnect. These two outcomes have been submitted to conferences and have been presented by students in Design Automation Conference and Asia/South Pacific Design Automation Conference.