In today's high-end computing (HEC) systems, the parallel file system (PFS) is at the core of the storage infrastructure. PFS deployments are shared by many users and applications, but currently there are no provisions for differentiation of service - data access is provided in a best-effort manner. As systems scale, this limitation can prevent applications from efficiently utilizing the HEC resources while achieving their desired performance and it presents a hurdle to support a large number of data-intensive applications concurrently. This NSF HECURA project tackles the challenges in quality of service (QoS) driven HEC storage management, aiming to support I/O bandwidth guarantees in PFSs by addressing the following four research aspects: 1. Per-application I/O bandwidth allocation based on PFS virtualization, where each application gets its specific I/O bandwidth share through its dynamically created virtual PFS. 2. PFS management services that control the lifecycle and configuration of per-application virtual PFSs as well as support application I/O monitoring and storage resource reservation. 3. Efficient I/O bandwidth allocation through autonomic, fine-grained resource scheduling across applications that incorporate coordinated scheduling and optimizations based on profiling and prediction. 4. Scalable application checkpointing based on performance isolation and optimization on virtual PFSs customized for checkpointing I/Os.

Project Report

High-performance computing (HPC) systems are important platforms for solving challenging computational problems in many disciplines. They deliver high performance to applications through parallel computing on large numbers of processors and parallel I/O accesses across large numbers of storage devices. However, HPC applications are becoming increasingly data intensive. On one hand, there is a rapidly growing number of data-driven applications which rely on the processing and analysis of large volumes of data. On the other hand, as applications employ more processors to solve larger and harder problems, they have to checkpoint more data to tolerate the more frequent failures. At the same time, HPC applications are increasingly deployed to shared computing and storage infrastructures because of the significant economic benefits brought by consolidation to both HPC users and providers. In addition, hosting large datasets on shared infrastructure allow the data to be efficiently shared by applications. The combination of the above trends makes HPC resource management, particularly the management of shared storage I/O resources a critical and challenging problem. Although several techniques exist to partition processors in an HPC system, parallel storage bandwidth is difficult to allocate because it has to be time-shared by applications with varying I/O demands. Without proper isolation of competing I/Os, an application’s performance may degrade in unpredictable ways under contention. Nonetheless, the support of such storage management is generally lacking in HPC systems. In fact, today’s HPC storage stacks are unable to recognize different applications’ I/O workloads—they only see generic I/O requests arriving from compute nodes; they are also incapable of satisfying different storage bandwidth needs from different applications—they are often architected to meet the throughput target for the entire HPC system. These prevent applications from achieving their desired performance while making efficient use of the HPC resources. This project presents a multi-faceted approach to addressing the aforementioned research problems and providing application Quality-of-Service (QoS) driven storage management to HPC applications. First, it provides an application-specific storage bandwidth management framework based on the virtualization of parallel file systems commonly used in HPC systems. The virtualization layer, named vPFS, is able to transparently interpose parallel file system I/Os, differentiate them on a per-application basis, and schedule them according to the applications’ I/O demands. Second, based on vPFS, the project enables the study of various I/O scheduling algorithms for different storage management objectives. Specifically, it is among the first to study proportional-share schedulers for managing both data and metadata I/Os in HPC storage systems and a two-level scheduling architecture for achieving both I/O throughput and latency objectives. Finally, this project has also contributed a novel parallel file system simulator (PFSsim) capable of simulating different I/O scheduling algorithms for current and futuristic HPC architectures. A prototype of vPFS which virtualizes PVFS2, a widely used parallel file system implementation, has been developed and evaluated with experiments using typical parallel computing and I/O benchmarks as well as real MPI applications. The results demonstrate that the overhead of the parallel file system virtualization framework is small (less than 3% in terms of I/O throughput) compared to native PVFS2. The results also show that the I/O schedulers enabled by vPFS achieve good proportional sharing of both data and metadata services (at least 96% of any given target sharing ratio) for competing applications with diverse I/O patterns. The results of this project have generated broader impacts in several key aspects. First, the QoS-driven storage management framework is contributed to the broader community, including both HPC users from different disciplines and HPC providers of current and future systems, as open-source software. In particular, the investigators have reached out to interact with practitioners in leading HPC, cloud, and storage companies and federal laboratories and help to solve the practical problems that they are facing. Second, the project has provided research experience to six PhD students, including two Hispanics, one African American, and one woman, and supported the development of their PhD dissertations. The project has also provided research experience to 23 undergraduate students, including 18 Hispanics, three African Americans, four women, and four veterans, who were encouraged to pursue advanced studies and careers in computing. Among these undergraduates, five have continued into graduate study and 11 have become IT professionals. Third, the development of a larger body of students in HPC-related topics has taken place through the undergraduate and graduate classes offered at FIU and UF, the unique virtualization-based educational infrastructures (vMoodle) created as part of the project, and the K-12 outreach events organized by the investigators for students from the local Hispanic community.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
0938045
Program Officer
Almadena Y. Chtchelkanova
Project Start
Project End
Budget Start
2009-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2009
Total Cost
$456,343
Indirect Cost
Name
Florida International University
Department
Type
DUNS #
City
Miami
State
FL
Country
United States
Zip Code
33199