All of computing today relies on an abstraction where software expects the hardware to behave flawlessly for all inputs under all conditions. However, for emerging circuits/devices, the cost of maintaining the abstraction of flawless hardware will be prohibitive due to variations and we may need to rethink the correctness contract between hardware and software.
The primary focus of the project is application robustification ? fundamental algorithmic methodologies to transform arbitrary applications such that they can continue to make forward progress in spite of errors produced by the hardware. In this project, our preliminary research effort is focused on a) techniques to convert different classes of application kernels into robust, efficiently solvable stochastic optimization problems that can tolerate hardware errors, b) techniques based on Krylov subspace methods, gradient projection, quasi-Newton approaches, stochastic approximation theory-based approaches, preconditioning techniques, and intelligent step sizing to reduce the cost of robustness for different forms of hardware variations, and c) low overhead checksum-based techniques robustifying sparse linear algebra libraries and graph algorithms. Broader impact of this project includes development of a potentially promising approach to ride Moore's Law and training students in both the hardware and software aspects of computing in face of errors. Broader education will also be achieved through research artifacts (e.g., library of error tolerant kernels) that will be made available for research and education.
As late-CMOS process scaling leads to increasingly variable circuits/logic and as most post-CMOS technologies in sight appear to have largely stochastic characteristics, hardware reliability has become a first order design concern. This research focuses on making applications robust to hardware errors. One of the major goals for this project was the investigation of general methodologies for taking arbitrary applications and converting them it into more error-resilient forms. Another major goal of this project was the investigation of application-specific methodologies for improving error resilience by leveraging inherent application and algorithm characteristics (e.g. natural error resilience, spatial and temporal reuse, and fault containment).These approaches include a) application-specific techniques for low-overhead fault detection, which was the first work to focus on algorithmic error tolerance and were shown to yield up to 2x reductions in the performance overhead over traditional checks, b) an algorithmic approach for error correction using localization, which was shown to improve performance of the conjugate gradient solver by 3x-4x and increased the probability that the solver completes successfully with a maximum iteration limit by up to 60% as the fault rates increased, and c) a numerical optimization-based methodology for converting applications into a more error tolerant form, which was shown to be applicable to a large class of applications, since linear programming, which is P-complete, can be implemented in this way, all applications in class P may employ the proposed approach. This numerical optimization based approach showed significant robustness benefits over the baseline, e.g. the optimization formulation achieved 100% correctness with even up to error rates of 50%. For some application classes, the approach was shown to provide benefits larger than 1 order of magnitude in energy savings by exploiting hardware power/accuracy tradeoffs. This research shows that application and algorithm-awareness can significantly increase the robustness of computing systems, while also reducing the cost of meeting reliability targets.