Large-scale computing environments such as data centers and cloud computing are becoming the core computing infrastructure, making the availability of such services extremely critical. However, these environments are increasingly vulnerable to both hardware and software failures. This project designs failure-aware techniques for modeling, prediction, and resource management in large-scale computing environments with the presence of hardware and software failures at various levels. Intellectually, this project develops fundamental understanding of workload and reliability characteristics, and investigates how improved capacity planning models and prediction techniques can obtain useful information for system design and maintenance. This project further provides insights of the impact of software/hardware component failures in the area of resource management.
The results of this project will include new capacity planning models that evaluate both reliability and performance of a given system and new prediction techniques that forecast the future failure occurrences by taking advantage of temporal dependence in failure events. Based on the modeling and prediction techniques, this project will develop new failure-aware runtime strategies for job scheduling, node allocation, and system maintenance, aiming to achieve high system performance and reliability in complex large scale systems.
The main goal of this research work is to address the fundamental challenges (with respect to performance and reliability) in large-scaled cluster systems, like data centers and cloud computing. To accomplish this goal, we design and implement new techniques and schemes for resource management in such cluster systems that have huge positive impacts on different industry sectors and our everyday lives. Intellectual Merit: The key contributions of this proposal include developing fundamental understanding of workload and reliability characteristics in large-scaled cluster systems, learning how improved performance models and prediction techniques can obtain useful information for system design and maintenance, and providing new runtime schemes for resource allocation and job scheduling in cluster systems. Broader Impact: In this research project, we devote the outreach agenda to developing an education module centered on applications that have large-scale computing challenges. As a woman, the PI naturally attracts women undergraduates and graduate students interested in computer engineering. We aggressively motivate young female students towards science and engineering. We continue the participation in the Boston Area Girls STEM Collaborative through the College of Engineering at NEU and extend our influence to middle school girls. As part of this project, we incorporate the concepts of failure-aware resource management into the existing courses of computer architecture, capacity planning, and performance evaluation at both the graduate and undergraduate levels. This project offers a natural vehicle in place for technology transfer by building collaborations with a number of companies (e.g., HP, IBM, VMware) on this project. Project Outcomes: During the project years, we have achieved the following outcomes on both research and education. First, we focus the development of new load balancing algorithms to efficiently balance the computational load among servers in a large-scaled cluster system. Large-scaled cluster systems have been employed in various areas by offering pools of fundamental resources. Efficient allocation of the shared resources in a cluster system is a critical but challenging issue, which has been extensively studied in the past few years. However, we found that performance benefits of the existing policies (e.g., Join Shortest Queue and size-based polices) diminish when workloads are highly variable and temporally correlated. Thus, we designed a new load balancing policy, which attempts to partition jobs according to their present sizes and further rank the servers based on their loads. By dispatching jobs of similar sizes to the corresponding ranked servers, this scheduler can adaptively balance user traffic and system load in a cluster and thus achieve significant performance benefits. Second, we investigate new approaches for data placement and data migration in data centers with tiered storage systems. One popular approach of leveraging Flash technology in the virtual machine environment today is using it as a secondary-level host-side cache. Although this approach delivers I/O acceleration for a single VM workload, it might not be able to fully exploit the outstanding performance of Flash and justify the high cost-per-GB of Flash resources. We designed a new VMware Flash Resource Manager, which aims to maximize the utilization of Flash resources with minimal CPU, memory and I/O cost for managing and operating Flash. It borrows the ideas of heating and cooling from thermodynamics to identify the data blocks that benefit most from being put on Flash, and lazily and asynchronously migrates data blocks between Flash and spinning disks. Third, we focus on the scheduling problem for parallel data processing applications in a Mapreduce cluster. The MapReduce framework has become the de facto scheme for scalable semi-structured and un-structured data processing in recent years. The Hadoop ecosystem has evolved into its second generation, Hadoop YARN, which adopts fine-grained resource management schemes for job scheduling. One of the primary performance concerns in Hadoop/YARN is how to minimize makespan (i.e., total completion length) of a set of MapReduce jobs. We proposed a class of new techniques and algorithms for MapReduce applications, which effectively leverage the information of requested resources, resource capacities, and dependency between tasks to schedule the execution of MapReduce jobs and determine the optimal cluster configuration. Fourth, as part of this project, we continued to involve the concepts of resource management and performance modeling in a graduate-level course ("Simulation and Performance Evaluation)" for senior undergraduates and graduate students. We also participated the Boston Area Girls STEM Collaborative through the College of Engineering at NEU and joined a week-long summer enrichment program by leading a session for 30 middle school girls in summer 2013. In this session, the girls worked in pairs with the Machine Science kits which are widely used in our Engineering and Computation classes, for a hands-on activity to understand the basic queuing theory techniques for cluster management. Two undergraduate students (one is female) were involved to help prepare the kits, write the code, and give a tutorial to the girls on their NEU’s visiting day.