The availability and robustness of the I/O system is crucial to large-scale applications that generate and analyze terabytes of data. Storage systems are vulnerable to numerous hardware failures (I/O and metadata server crashes) and contribute to as much as 25% of all system failures. Actually, highly available data storage for high end computing is becoming increasingly more critical as high-end computing systems scale up in size. To achieve high availability storage systems, a challenging issue is to characterize the availability metric in addition to performance of these systems.
This research investigates high-availability data and I/O services and benchmarking. The investigators take an organized approach to developing a benchmarking framework to measure the storage performance in consideration of availability under various faulty conditions. The research involves four tasks: 1) develop faults/errors model and design fault injection schemes for storage systems; 2) develop an innovative benchmarking framework for high availability distributed storage systems under different faulty conditions; 3) implement an Availability and Performance Evaluation Toolset (APET) to integrate the fault injection and stress testing libraries and capture raw performance of storage systems at block level under various faults; 4) validate the benchmarking framework using APET for block-level storage systems.
This research has direct contributions to understanding highly available data and I/O services for HEC systems, establishing a general benchmarking framework for characterizing storage systems under faulty conditions, and thus benefiting the society by guiding develop high-availability oriented distributed storage systems which are crucial to many applications.