Networked systems have always been designed to operate even in the presence of failures, especially in communication links and storage. Until recently other components of such systems had relatively low probabilities of failures and for most networked systems, desired levels of resilience could be achieved using minimal redundancy added in an ad hoc manner. Two opposing trends are likely to make the task of achieving resilience significantly more difficult in the coming years: (a) increasing hardware failure probabilities: with the move towards finer nano-scale fabrication, chips are increasingly vulnerable to soft errors caused by external noise and are increasingly likely to fail early due to fatigue; (b) higher resilience requirements: as critical services continue to migrate to clouds, service providers are compelled into more stringent service-level agreements (SLAs), including higher reliability, higher availability, and tighter guarantees on service times. The above combination can dramatically increase the overhead of existing approaches for achieving desired levels of resilience.
Intellectual merit: The first outcome of this project will be a holistic roadmap for resilience of networked systems. This resilience roadmap will take the roadmaps from the nano-scale CMOS (trends in chip cost, functionality, performance, power, and resilience that can be attained at chip level) and attempt to realistically project the future cost of currently-used networking and systems techniques for achieving desired level of resilience. The second outcome of this project is to develop resilience methods that scale gracefully in the face of increasing hardware failures. Such techniques will use novel partitioned redundancy strategies that achieve reliability at different levels across hardware and software layers. Broader Impacts. The resilience roadmap will provide unprecedented understanding of the trends in resilience and a uniquely realistic assessment of challenges and opportunities. This will significantly influence the research in the hardware as well as networking communities. A systematic design of scalable resilience methods will lead to significantly higher levels of resilience, lower costs - capital (equipment) as well recurring (especially, energy), and/or higher levels of performance. The utilitarian gains to society by the proposed project are likely to be substantial, since networked systems now constitute one of our most critical infrastructures and consume an increasingly large proportion of our resources.
This project will draw upon two different disciplines, hardware architecture and networked systems, and involve detailed case studies and development of completely new theory and techniques, and will therefore provide unique educational and training opportunities for students and working professionals in these fields.
Budget Impact Statement: The item numbers in this paragraph refer to those in Figure 9 and Section 3.2 (entitled 'Proposed Research Tasks and Plan') of our original proposal. We will undertake all tasks and sub-tasks proposed in item-1 (and all its sub-items). In item-2, we will undertake the development of a general framework to consider all basic redundancy schemes and alternative ways of deploying them (sub-item-2.1). We will also characterize the associated tradeoffs (sub-item-2.2) and the consequences of realistic constraints (sub-item-2.3). However, we will pursue the development of prototype tools (as outlined in sub-item-2.4), to the extent necessary to demonstrate the benefits of our approach and to conduct case studies (described in item-3). Finally, we will undertake the case studies as originally proposed in item-3.