Ordinary PCs are widely employed for large scale scientific computing today. Released in 2004, BOINC middleware is managing over 500,000 volunteered PC nodes and providing computation power to around 30 scientific research projects including Rosetta@home, Climateprediction.net, and IBM World Community Grid. The main attraction of such "virtual clusters" is that computing is "free" as minimal additional hardware and personnel resources are needed. The potential for exploiting such idle cycles is immense since well under 1% of the world's 1 billion PCs are currently participating. Currently idle PCs are exploited for sequential and independent parallel tasks, but not communicating parallel tasks, a common HPC paradigm. The central goal of this project is to achieve robust execution of communicating parallel applications on networked ordinary PCs. The challenge is that ordinary desktops are "volatile", i.e., their availability changes suddenly and frequently based on desktop owner's actions. Checkpointing of parallel applications, the state of the art in fault tolerant scientific computing, is not sufficient for high failure rate environments. This project is developing the VolPEx framework (Parallel Execution on Volatile nodes) that employs managed redundancy as the core mechanism to achieve seamless forward application progress in the presence of routine failures. The canonical execution model consists of two or more concurrent replicas of each process with failed process replicas regenerated on-demand from healthy ones. The following communication APIs are provided for application development. 1. Virtual Dataspace: An abstract API for asynchronous anonymous Put/Get communication among tasks. The BOINC programming model of independent tasks is being extended with the dataspace API to allow inter-task communication 2. Volpex MPI: A subset implementation of MPI with a communication layer customized for execution on volatile nodes. The validation of this research will include execution of selected parallel applications on 100s of nodes across campus LANs and 1000s of nodes across the globe under Volpex/BOINC. The ability to transform ordinary PCs into a virtual cluster to run parallel codes will have wide ranging impact. Virtually all scientists will get access to HPC while the need for clusters and the cost of purchasing, operating, and maintaining clusters will diminish.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
0834807
Program Officer
Mohamed G. Gouda
Project Start
Project End
Budget Start
2008-09-01
Budget End
2011-08-31
Support Year
Fiscal Year
2008
Total Cost
$50,000
Indirect Cost
Name
University of California Berkeley
Department
Type
DUNS #
City
Berkeley
State
CA
Country
United States
Zip Code
94704