Adoption of each new generation of nano-scale technology is accompanied by lower yields, stagnant performance or increasing chip-to-chip performance variability, and decreasing robustness to environmental stress. The objective of this research is to identify new avenues for design, analysis, and testing to help compensate for these trends. This research takes a global view of the role of a module in the overall system architecture. This research will focus on two specific issues, namely (1) the impact of a fault in a module on the overall system operation and performance, and (2) the ability to reconfigure one module to compensate for a fault in another module by preventing it from affecting the correct operation of any system or user task. Additional case studies will be conducted to further demonstrate that both these aspects of a global view significantly improve system yield and performance. A systematic approach will be developed to exploit such a global view to dramatically improve yield, performance, and robustness. A completely new framework ? models, information, analysis, and algorithms ? for assembling systems using faulty (and fault-free) components ? will also be developed.

The utilitarian gains to society of this project are likely to be substantial. First, without changing any existing design, the proposed analysis and test approaches will provide significant improvements in yield, performance, and robustness to soft-errors. Second, the proposed analysis, design, test, and global compensation techniques will also help improve yields. Since the types of systems where this research is directly applicable include high-performance processors, the benefits provided will be amplified by the high price such processors fetch and the high volumes in which they are manufactured. Furthermore, since this research is orthogonal to much of the on-going research for improving yield and performance, improvements it provides can be combined with those provided by other approaches. Finally, this project will provide unique educational and training opportunities, for USC students as well as working professionals in the field.

Project Report

Manufacturers of large chips are concerned with low yields (percentage of fabricated chips that can be sold) due to their direct impact on their revenues. This concern is growing since future technologies are forecasted to be plagued with decreasing yields due to aggressive scaling of device sizes and operating voltage. In this project we identified several new opportunities for yield enhancement for emerging technologies and emerging system architectures. In contrast with all existing design and test approaches that extensively use divide-and-conquer, our research took a global view of the role of a module in the overall system architecture. In addition, our approach exploited the basic characteristic of large chips, especially multicore CPUs, GPUs, and SoCs: These chips contain a large number of copies of a relatively small number of modules, such as cores and caches, which are typically connected using interconnects that have simple logical and physical topologies. We focused on the physical characteristics of these chips: If a chip contains many copies of a module, then instead of adding a spare copy for each module, multiple modules may share one or more spares. We also focused on the functional implications of the characteristics of these chips, i.e., the manner in which most applications use these chips: Most applications can obtain significant performance even from chips that are slightly imperfect in a variety of ways. We developed new physical models and a new notion of generalized spares sharing to derive new approaches for adding spares that significantly improve the ratio of chip yield to chip area, and hence significantly improve the revenue per wafer. For example, for a GPU, for a high defect density expected in the near future, our approach provides dramatic improvement and achieves yield-per-area that is about 72.5% of that obtained for the ideal scenario where defect density is zero and no spares are added. We also identified new approaches for relaxing the functional specifications of these chips in multiple ways, developed new approaches for design of spares, and demonstrated that these enhance the expected value of total performance extracted from each wafer. For example, for a GPU, we showed that for a high level of defect density projected for the future, our new number-of-processors binning approach provides a figure of merit that is up to 84% of that of the ideal case, i.e., for a process with zero defect density and design with zero spares. Our research on identifying fundamental properties of redundancy in hardware systems has provided new insights about studying the role of redundancy in the field of networking and distributed computing. We are now harnessing these insights by developing new approaches in these domains. Our theory and conceptual framework are embodied in an extensive, effective, and extendible toolkit for defect-tolerant designs that maximize the total useful computation that we can provide to users for each fabricated wafer. We are making our new models and tools available to the community of researchers. Several doctoral students worked on this project and were trained via the development of the first systematic architecture-level framework for improving yields of processors fabricated using the highly non-ideal nano-scale processes of the future. We have given seminars at universities and companies, and reached out to K-12 students as well as entering freshmen from various departments and schools. By developing the foundations for deriving efficient designs to combat high defect rates, the results of our research will be integral to the design of all large chips fabricated in high volume in the near future. In this manner, our research will help provide affordable information processing devices and infrastructure to support a large class of applications of great societal importance.

Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-12-31
Support Year
Fiscal Year
2010
Total Cost
$449,997
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089