With technology scaling coupled with increasing power densities, modern processors suffer from potential soft errors and hard errors. The reliability analysis of such multi-threaded processors, e.g. Simultaneous Multithreading (SMT) and Chip-Multiprocessors (CMP), where inter-thread resource contention exists, is a relatively unexplored area. Furthermore, the modeling complexity is exacerbated by two additional factors: (1) increasing number of cores in a chip; and (2) heterogeneity brought by manufacturing process variation. Software wise, traditional compiler designs are aimed at providing high performance and recently low power when generating object codes. With increasing hardware vulnerabilities, however, high performance computing programs suffer from unexpected errors and exceptions, which might be mitigated by using fault-tolerance techniques such as error detections and check pointing, but still eventually hurt their performance. Apart from a reliable hardware platform, software designers can further improve system reliability by generating error resilient codes. Moreover, analysis of software's architectural vulnerability is still in an ad hoc stage. Therefore, this project proposes a predictive framework to handle the above challenges by employing modern statistical and machine learning methods. The outcomes of this project include a predictive framework which guides for reliable software and hardware optimization and its applications to high performance computing.
The broader impact plans include outreach activities and undergraduate and graduate training. The interdisciplinary nature of the proposed work allows students to learn cutting-edge knowledge from different areas to broaden their scope of training as well as to enhance their productivity. Students from the under-represented groups will be encouraged and given priorities for joining the project.