The Duke FaultFinder Project seeks to provide the first hardware mechanisms for dynamically verifying the correctness - not just necessary properties - of shared memory multiprocessor systems. The memory consistency model determines the correctness of a design. FaultFinder will dynamically detect violations of the specified memory consistency model, which is the highest level of error detection possible in hardware. FaultFinder mechanisms will detect hardware errors at the system level (e.g., violation of consistency), unlike existing schemes that only detect localized errors (e.g., bit flip on message). Combining FaultFinder error detection with existing hardware mechanisms for checkpoint/recovery of shared memory multiprocessor systems enables the system to guarantee correct behavior.
As society has increasingly relied upon computer systems to provide important infrastructure, computer engineers have not correspondingly improved the ability to detect faults in these systems. While recent advances in hardware checkpoint/recovery have improved computer system availability, a system recovery mechanism can only recover from those errors that are detected. Currently, computer systems cannot detect whether a memory system is behaving correctly. The Duke FaultFinder Project seeks to provide the first hardware mechanisms for comprehensive error detection in computer systems. Achieving this goal would provide a qualitative benefit to a society that depends on computer availability.