With the advent of scalable parallel computing, thousands of devices are connected and managed collectively. This era is confronted with a new challenge: performance failure; systems often perform worse than expected due to large-scale management issues such as hardware failures, software bugs, and configuration mistakes. This project targets one overlooked cause of performance failure: "lagging hardware" -- hardware whose performance degrades significantly compared to its specification. Many reports indicate that a single lagging hardware can easily cascade and make the performance of a whole cluster collapse. Here, parallelism is unexploited, productivity is reduced, the system is underutilized, and energy is wasted. The goal of the LigHTS project is to transform computing systems into Lagging-Hardware Tolerant Systems. The LigHTS project will bring many direct benefits to the society; users from many areas (science, healthcare, business, education, military, and government) increasingly use large-scale storage and computation services. Here, predictable performance is a key to success, and in this context lagging-hardware tolerant computing is a critical ingredient.
The LigHTS project consists of three major objectives. The first is lagging-hardware data analysis and instrumentation. To improve the robustness of future parallel systems, it is crucial to study lagging characteristics exhibited by modern hardware and to devise new instrumentation methodologies that can collect cases of lagging hardware in deployment. The second is lagging-failure system analysis. It is important to rigorously analyze the impact of lagging hardware (including disk, network, processor) to currently deployed systems. The results will unearth design flaws and provide valuable reevaluations of how deployed systems should evolve. The last is LigHTS principles, design, and implementation. There is a need to establish foundational principles of lagging-hardware tolerant computing and apply the principles in building prototypes of cross-layer LigHTS systems spanning distributed storage, computing framework, operating and runtime systems.