Project Proposed: This RAPID project, aiding the process of recovering Information Technology (IT) infrastructure damaged by catastrophic events, conducts research on the use of virtualization technologies to provide such aid. The work includes IT infrastructure needed to recover damages to non-IT infrastructures and human beings. Machine virtualization offers key mechanisms to move applications from one location (e.g., a data center) potentially affected by a disaster to another safe location. The project responds to many challenges such as: The - Inability to migrate Virtual Machines (VMs) from a disaster site to an unaffected site maintaining live services; - Severe limitation of power of network failures that limit the ability of performing live-migrations; - Need for coordination with recovery efforts to effectively prioritize critical services. Machine virtualization offers the ability to checkpoint VMs, thus enabling the creation of back-ups not only of data but also of partial application executions. VM checkpoints can be used to recover an IT infrastructure in a different location with minimal loss of data. The challenge lies in how to efficiently manage the massive amount of data and network traffic generated by the VM check-point process. With the main goals of keeping alive IT services as long as possible, and restoring recovery-critical IT services as quickly as possible during and after a disaster, the project focuses on - Analyzing data and events associated with damaged IT services due to the Great-East Japan Earthquake, - Studying scalability of wide-area VM live-migration and Back/checkpoint, and - Developing a resilient architecture to partial physical infrastructure failure in order to deploy IT infrastructures in virtualized and distributed datacenters. The investigators collaborate with Dr. Satoshi Sekiguchi, Director of the Information Technology Research Institute (ITIR) within the National Institute of Advanced Industrial Science and Technology (AOST), an Institution under the Ministry of Economy, Trade, and Industry (MET), Japan. This group are experts in the area of virtualization and has had some interactions with the Florida group. Broader Impacts: The work develops an understanding of how well virtualized IT systems can cope with partial physical damages, of what changes in hardware, software, and general practice are needed, and how to determine the best way to adopt them. In the long term the project should enable informing the adoption of a virtualized datacenter to host essential IT services. Hence, the project is likely to enable informed decisions and should also contribute in graduate student education.
In today’s society, Information Technology (IT) is applied in many critical infrastructures and systems, thus it is key for IT services to quickly recover from damages caused by catastrophic events. This project conducted research on the use of virtualization technologies to architect IT infrastructures resilient to partial physical infrastructure failures. The key idea is to quickly move IT services damaged by a disaster to a safe location, taking advantage of machine and network virtualization mechanisms that allow the migration of an entire IT infrastructure from one geographical location to another. This approach has the potential to be substantially cost efficient, application independent, and offer lower downtime of services compared to traditional disaster recovery (DR) mechanisms, which requires (a) applications to be modified for a particular DR implementation and (b) expensive on-line replication of data. Given the scale in which IT services are deployed, it is prohibitively expensive to protect all of them through traditional DR services – thus, research for low cost alternatives that can be invoked on demand is needed. Many challenges need to be addressed for a virtulization-based DR approach to work: (a) massive amounts of data needs to be moved; (b) power and network failures may severely limit the ability of performing VM migration; and (c) migration techniques designed for local-area network (LAN) environments need to be adapted to efficiently work in wide-area networks (WAN). This project conducted research to address these challenges in the context of the Great East-Japan Earthquake. In order to assess the damages to IT systems caused by the Great East-Japan Earthquake, information about available IT resources during and after the disaster and timeframe of events was collected from datacenters operated by research institutions in the affected area (East region of Japan). The study revealed that physical damages to servers and network equipments were minimal, and uninterruptible power supplies (UPSs) and power generators kept servers and network devices operational for tens of minutes. With this information, (1) a feedback-based system that controls the transfer of multiple VMs was designed and evaluated (improvements of up to 5.7 times compared to uncontrolled transfers); (2) a WAN- optimized VM storage migration mechanism was implemented (the time required to relocate a VM from Japan to the US was reduced from 25 minutes to less than 40 seconds). The research results are available in 3 publications, with additional papers being submitted and reviewed. This project established a research relationship between the National Institute of Advanced Industrial Science and Technology (Japan) and the University of Florida (USA) that will hopefully last beyond the NSF RAPID and JST J-RAPID support period.