Reliability, Performability and Scalability of Large-Scale Distributed Systems

Najjar, Walid

Abstract

Large-scale distributed multicomputer systems, consisting of several thousand processing elements, are rapidly demonstrating their potential as a low cost high performance supercomputer. Not only can these system speed-up program execution, but they also allow significantly larger problems to be addressed. A wide-spread use of these systems, however, in mission critical as well as commercial applications, depends on their demonstrated reliability, availability and scalability. The objective of this research project is to investigate the reliability, scalability and performability of large-scale distributed systems. As the number of elements in a system increases, the rate of failure of the system is expected to increase given a constant technology. Therefore system reliability and scalability are important considerations in the design of large-scale systems. The research will focus on two essential issues: the analysis of network reliability and performability, and the evaluation of techniques that can exploit the inherent redundancy of these systems. The network reliability analysis will examine the effects of multiple node and link failures on the connectivity of the network and on its communication bandwidth, investigating the probability of occurrence of network disconnection, saturation and communication bottlenecks. The inherent hardware redundancy of large-scale systems can be exploited to achieve a higher reliability, albeit, at the cost of a reduced computing power. The second objective will be to investigate the achievable performance/reliability tradeoff and system scalability using various redundancy schemes. The research is essentially analytical in nature but will rely on simulation techniques whenever an exact analytical evaluation is not feasible.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Type: Standard Grant (Standard)
Application #: 9010240
Program Officer: Yechezkel Zalcstein

Project Start
Project End
Budget Start: 1990-07-01
Budget End: 1992-12-31
Support Year
Fiscal Year: 1990
Total Cost: $64,235
Indirect Cost

Reliability, Performability and Scalability of Large-Scale Distributed Systems
Najjar, Walid
Colorado State University-Fort Collins, Fort Collins, CO, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments