Commodity processors are highly programmable, but their need to support general purpose computation limits both peak and sustained performance. Such observations have motivated the use of "accelerator" boards, which are co-processing elements that interface with the host server through a standard hardware bus such as PCI-Express but have their own computational engine and typically their own memory as well. Unlike the main processor, the accelerator does not support general applications; instead, its hardware and software is tuned for only specific types of computations. Accelerators can offload the most demanding parts of an application from the host processor, speeding up the desired computation using their specialized resources. This improved performance enables various forms of high-performance computing (HPC), but comes at a high cost in programmability.
This research targets high-performance computing research using PC-based clusters for cost and scalability combined with accelerators for high performance. The Purdue Everest project encompasses several related efforts in achieving high performance, low power consumption, and high programmability for highly heterogeneous systems. Acquiring a 30-node Gigabit Ethernet-based cluster of multicore PC-based workstations equipped with various accelerator boards (e.g., GPU, Cell, FPGA, Crypto) will enable research into effective and highly-programmable use of accelerator-based clusters. Supporting multiple accelerators per node allows applications to use different accelerator boards in different phases. This cluster also allows fair apples-to-apples comparisons of different accelerators by keeping the other system factors constant. This research also investigates the use of multiple concurrency domains, with parallelism across the cluster, across the cores in a single node, among the host processors and accelerators in a single node, and across the processing elements of a given accelerator.
Automatic compiler techniques have been developed that fit large-data computations into limited memory sizes on GPUs/accelerator devices. The techniques use pipelining methods to overlap computation on the device with the communication needed to move data to and from the devices. In doing so, computations could be run on GPUs that previously had failed due to insufficient memory and significant speedups were achieved as a result of the new pipelining methods. A key challenge faced by users of public clouds today is how to request for the right amount of resources in the production datacenter that satisfies a target performance for a given cloud application. An obvious approach is to develop a performance model for a class of applications such as MapReduce. However, several recent studies have shown that even for the class of well-studied MapReduce jobs, their running times can be seriously affected by numerous external factors ranging from dozen or so configuration parameters, to the physical machine characteristics (CPU, memory, disk, and network bandwidth), to implementation deficiencies such as Java, garbage collection. These factors make direct performance modeling extremely difficult. In this study, we proposed a more practical systematic methodology to solve this problem. In particular, we developed a projection model that can prescribe the right amount of resources for MapReduce jobs to meet a given job completion time. The model is based on insights into performance bottlenecks of MapReduce jobs and their scaling properties, and parameterized with component running times based on profiling on small clusters with sampled inputs. Using the CAP testbed, we developed the projection model. We then evaluated its effectiveness using a wide variety of MapReduce benchmarks running on CAP. Our evaluation results show our projection model can predict job running times with 2.7% of accuracy when scaling to 32 nodes on the CAP testbed. CAP turns out to be valuable for this project as our experience has confirmed that evalution on Amazon EC2 faced performanec unpredictability as different jobs would compete for the shared network resources; current Amazone EC2 does not provide network isolation.