High-End Computing systems, such as cloud computing setups, are increasingly employing many-core compute resources and computational accelerators, e.g., GPUs and IBM Cell processors, for high performance. However, the use of such components results in a performance and communication mismatch, which in turn makes large-scale systems with heterogeneous resources difficult to design, build and program. Moreover, the increased data demand of modern advanced applications, coupled with the asymmetry between computation speed and data transmission speed, threaten the benefits of employing accelerators in such setups.
This project addresses the above problems by designing a flexible, scalable, and easy-to-use programming model, AMOCA. AMOCA supports innovative workload distribution techniques, which enables it to be used toward scaling modern scientific and enterprise applications on high-end asymmetric clouds comprising heterogeneous accelerator-type compute nodes. Moreover, AMOCA utilizes component-capability matching and adaptive inter-component data transfers for parallel programming models, automatically handles heterogeneous resources, and auto-tunes the model parameters to the specific instance of resources on which it is run.
AMOCA lays the foundation for adapting the cloud computing paradigm for HPC, creates open source and transformative technologies for scalable any-core system architectures, and is expected to improve the efficiency and performance of advanced applications in a broad range of disciplines that perform simulation-based experimentation including computational physics, biology, and chemistry. AMOCA employs an integrated research and education approach for training both undergraduate and graduate researchers, especially from underrepresented groups. The training will instill critical system development skills and increase the use of accelerator-based clouds in HPC.
Modern scientific and enterprise High Performance Computing (HPC) applications have varying behavior. Such variations can benefit from the opportunities presented by the increasing degree of heterogeneity in underlying computing infrastructure, especially in systems that support the emerging cloud computing model. For example, GPUs can be used to speed up compute-intensive tasks while large number of cores and adaptive data placement can help I/O operations. In this project, we designed and developed an Adaptive programming MOdel for Clusters of Accelerators (AMOCA), a software substrate optimized for scaling applications on High-End Asymmetric Clouds (HEACs). AMOCA offers a flexible, scalable, and easy-to-use programming model that automatically handles component asymmetry and adapts to the varying capabilities of the system resources on which the applications are executed. We extended the MapReduce programming framework to program asymmetric clusters comprising traditional multicores and specialized accelerators such as Cell and GPU. For this purpose, we designed a custom easy-to-use lightweight library that is unique to the target architecture. We further fine-tuned our implementation to improve the slow I/O operations using techniques such as multiple buffers to overlap I/O latencies with computations. This yielded an efficient model that users can leverage to utilize asymmetric clusters in HPC using cloud-based programming techniques. Our investigation showed that our framework scales well with the number of different compute nodes. Furthermore, it runs simultaneously on different types of accelerators, successfully adapts to the resource capabilities, and performs 26.9% better on average for representative applications than a static execution approach. Next, we designed a workflow scheduler for the widespread Hadoop framework, which is aware of the execution behavior of the applications and schedules tasks based on application-hardware affinity. The scheduler is targeted at modern datacenters that may have several clusters each boasting different characteristics. Extant workflow schedulers are not aware of such heterogeneity, and thus cannot ensure high performance in terms of execution time and resource consumption. Similarly, the Hadoop Distributed File System (HDFS) that forms the storage substrate for several large cluster deployments is unable to exploit heterogeneity in the storage stack. To this end, we adopted a quantitative approach where we first studied detailed behavior of various representative Hadoop applications running on different hardware configurations. Next, we incorporated this information into our hardware-aware scheduler to improve the resource-application match. Evaluation of our approach shows that our optimized task placement performs 18.7% faster, on average, than hardware oblivious schedulers. We also designed Heterogeneity-Aware Tiered Storage (hatS), which is a novel redesign of HDFS into a multi-tiered storage system that seamlessly integrates heterogeneous storage technologies into the Hadoop ecosystem. Our evaluation showed that, hatS improves I/O performance by 36% on average, which results in up to 26% improvement in the overall job completion time. In the next phase, we optimized the HEACs I/O stack using GPUs available at nodes and locality-aware scheduling. We designed a system that uses GPUs to support RAID in Lustre file system used in HPC centers. We observed that our system can reduce the cost of HPC I/O systems and improve their overall efficiency. We also designed a cloud scheduler to better place HPC workload virtual machines (VM) in a distributed cloud setting. The goal was to place a VM close to the data it needs, and to adapt the placement via migration as the application progresses and its characteristics or needs change. We employed a min-flow based graph optimizer for this purpose, and achieved high performance gains. Evaluation of our approach, using both real deployments as well as simulations, demonstrates the feasibility of the approach and improvement in I/O throughput. The work on designing AMOCA is complete and has resulted in improving the efficiency of HPC workloads under heterogeneous resources. The use of GPUs for I/O management is currently being considered for a small-scale test deployment at Oak Ridge National Lab. We have also integrated our techniques with the open source Hadoop programming framework to enable easy adoption of our approaches by other researchers and practitioners. The results from the experiments were disseminated via publications in appropriate venues and presentations at various high-quality peer-reviewed conferences. The project has contributed to three PhD theses (one female) and two MS theses (one female). The PI also mentored 5 REU students including female and minority students during the course of the project. The PI also created related introductory lectures for Virginia Tech College of Engineering Freshmen Seminar. The research was leveraged and extended to develop courses on introduction to HPC and cloud computing, as well as in integration of a file and storage systems component in the PI’s undergraduate and graduate operating systems classes, and in the systems and networking Capstone course. The participants (graduate and undergraduate students) were educated in HPC, distributed systems, storage, data management and principles of cloud computing.