Future integrated circuits will contain tens, hundreds, or even thousand cores per chip. However, technology downscaling that can make this possible may also make the underlying hardware less reliable due to an increasing number of defects and wear out mechanisms. Therefore, one of the major problems facing the design of multiprocessor systems-on-chip is reliability. Because either the cores or the network-on-chip (used for communication between the cores) can become a reliability bottleneck for these systems, it is imperative that the reliability be addressed in a unified manner. To address the reliability challenge, this research develops a novel unified theoretical lifetime reliability modeling framework. This framework is based on efficient Monte Carlo methods to treat multiprocessor systems-on-chip as a combination of computation and communication units. The goal of this research is to develop new dynamic reliability management techniques based on dynamic voltage and frequency scaling and application remapping. Based on control theory concepts, these techniques proactively improve the lifetime reliability of multicore systems.
The proposed dynamic reliability management techniques enable the development of more reliable multiprocessor systems-on-chip, which have a dramatic impact on society via applications ranging from entertainment and gaming to bio-engineering, military and space. More broadly, the results of this project impact significantly the design of future integrated systems by advancing the understanding of the tradeoffs between reliability as a new design concern and power consumption, performance and area as traditional objectives.