Recent years have seen a growing deployment of distributed computing infrastructures such as Grids, PlanetLab, @home, and peer-to-peer systems, that run a variety of Web, commercial, and scientific applications. Many of these infrastructures are unsupervised---they consist of large number of loosely-connected nodes that contribute computational and storage resources but are not centrally managed. Such unsupervised infrastructures are characterized by uncertainty in their resource availability caused by failures, varying load conditions, and node churn, thus putting undue burden on application writers and system administrators for the successful deployment and execution of applications. This project is developing a self-managing resource allocation framework that would hide the infrastructure uncertainties and dynamics from applications, while transparently adapting to changing conditions within the infrastructure. As part of this framework, this project is developing techniques for: (i) Predictable resource aggregation to provide resource guarantees to applications in the presence of dynamic loads and changing resource availability, (ii) Reliability-aware resource management to provide desired levels of reliability and availability, and (iii) System inference and prediction to enable decentralized inference of global system conditions for proactive response to dynamic infrastructure changes. These techniques are based on cooperation and redundancy among nodes in the infrastructure to provide scalability and decentralization. The proposed research will have significant impact on distributed computing by enabling effective deployment of large-scale scientific and commercial applications on resource-rich but unreliable infrastructures

Project Report

The work carried out as part of the project has resulted in the development of computing techniques that provide reliability and manageability in large-scale computer systems. The goal of this project was to develop self-managing resource management techniques for large-scale distributed computing systems, such as Clouds, Grids and peer-to-peer networking systems. These techniques are meant to hide the infrastructure uncertainties and dynamics (such as load and failures) from applications, while transparently adapting to changing conditions within the infrastructure. The intellectual merit of the project consists of the development of novel resource scheduling algorithms that can handle faulty computers without sacrificing performance, help us determine the overall 'health' of a large distributed computing system in a scalable manner, and allocate resources to applications based on their requirements across multiple computing, storage, and communication resources. Specific research outcomes include a reliability-aware scheduling algorithm called reputation-based scheduling, predictable resource aggregation mechanisms called Resource Bundles, log-analysis based failure analysis techniques for large-scale systems, a decentralized network affinity-aware virtual machine migration technique called Starling, a scalable passive approach for network performance estimation called OPEN, and a framework for running co-hosted MapReduce applications in cloud environments called STEAMEngine. The project results have been primarily disseminated through academic journal, conference, and workshop publications. As part of student training and development, the students working on the project have gained significant experience in using diverse research methods, including algorithm development, experimental design and data analysis. The students have also gained experience with the implementation and deployment effort of a large-scale computer system. The educational goal of the project was to integrate research and student education by supplementing the teaching curriculum as well as by exposing students to research in real-world environments. As part of the project, new courses were designed and existing curricula was enhanced with focus on topics in distributed systems and hands-on student programming projects. Active participation of undergraduates in research was achieved through course projects and additional undergraduate support for the proposed research. As a result, several undergraduate students were exposed to many research methods and gained hands-on implementation experience in building large-scale computer systems. In terms of broader impact, the project has developed computational techniques for reliable and efficient execution of applications in large-scale systems. This work will enable effective deployment of large-scale scientific and commercial applications on resource-rich but unreliable infrastructures.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Application #
0643505
Program Officer
D. Helen Gill
Project Start
Project End
Budget Start
2007-01-01
Budget End
2012-12-31
Support Year
Fiscal Year
2006
Total Cost
$416,000
Indirect Cost
Name
University of Minnesota Twin Cities
Department
Type
DUNS #
City
Minneapolis
State
MN
Country
United States
Zip Code
55455