With the emergence of computer systems that have several processor cores on a single chip (called a ?multicore chip?) comes the promise of high-performance never before experienced on a low-cost desktop computer. The availability of massive computing horsepower to everyday users would provide benefits in many domains, including scientific, consumer, and business, as more sophisticated applications could be created and used. Computationally intensive applications such as pharmaceutical development, scientific simulation, and financial forecasting could be run on a desktop computer, providing faster and cheaper access to the results of these vital applications. However, the small size of the transistors that will be used in future multicore chips with hundreds of processors will make the computer system exceptionally fragile. With small transistors, it becomes more difficult to avoid introducing defects and variations in operational characteristics during chip manufacturing. These defects and variations can cause hardware failures. This problem is particularly pronounced in transistor devices used to implement on-chip memory. Because a multicore chip will have hundreds of millions of memory transistors, it is likely that a particular chip may not operate correctly. Indeed, there will be few fully functional chips, which will be expensive due to their scarcity. If future multicore chips are to attain their promise for low cost desktop computing, the obstacles posed by failures in memory must be addressed.
This research proposes the new concept of ?soft yield,? where defects and operational variations remain during chip manufacture, but are virtually repaired after chip deployment. Based on soft yield, a novel approach, called Test and Continuous Adaptive Repair (T-CAR), is proposed to mitigate the impact of defects and operational variations in memory transistors. T-CAR plans for failed memory components by identifying the conditions that lead to failure and repairing the memory to account for those conditions. The approach makes repairs by reconfiguring the hardware as a software application runs to avoid harming application performance. The intellectual impact of this research will be to develop new test and repair algorithms, mechanisms for hardware repair, and models and metrics to evaluate the benefit of soft yield. The societal broader impact is to develop more capable, reliable and lower cost systems, which will lead to a new class of consumer, business and scientific applications. The project will also train Ph.D. graduate students to serve as future educators, scientists and engineers equiped to deal with this emerging reliability problem.