Parallel and distributed computing systems, consisting of a heterogeneous set of machines, software, and networks, frequently operate in environments where their performance degrades due to circumstances that change unpredictably, such as sudden machine failures or inaccuracies in the estimation of system parameters. An important question then arises: what extent of departure from the assumed circumstances will cause the performance to degrade to the point where the system cannot meet the specified requirements i.e., how robust is the system? The focus of this work is the design of methodologies for generating robustness metrics and using them in resource management.
A resource allocation is defined to be robust if degradation in system performance remains within specified limits when certain perturbations in specified system parameters occur. Furthermore, a resource allocations degree of robustness must be mathematically quantified e.g., how many machines can fail, how inaccurate can estimates in system parameters be before a performance requirement violation occurs? Specifically, this research addresses the design of:
mathematically precise and widely applicable techniques for modeling and quantifying the robustness of a resource allocation against multiple perturbations in system components and environmental conditions.
resource allocation algorithms that continually plan and develop strategies for responding to potential faults, resource degradation, and other changes in system environment.
This work represents a partnership between university and industry/government laboratories that are committed to developing high availability computing systems for industry and defense applications. Its results will be widely disseminated through presentations, publications, interdisciplinary workshops, and technology transfer.