With the emergence of the multicore architecture comes the promise of integrating enormous computing power in a single chip, thereby enabling parallel computing in all types of platforms including handheld computers and desktop machines. Providing proper software support for applications is critical to harness the true power of this architecture. An inherent characteristic of multicores that presents a significant obstacle is runtime variation: reliability, energy/thermal behavior and process variation will vary across identically designed components of a multicore, producing a negative impact on application power consumption and performance. Runtime variation has been identified as one of the key problems that could block further scaling of circuits if not properly addressed.
This research project is developing an advanced execution system, called a Robust Execution Environment (REEact), that dynamically mediates, controls and adapts an application's execution to the runtime resource landscape originating from runtime variations. It employs a combination of techniques in adapting both the hardware resources and the application software code to overcome the impact of runtime variations. At the hardware level, it adapts the resources, such as setting the speed/voltage of a node on the multicore. At the software level, REEact dynamically optimizes code, taking into account performance and power consumption due to runtime variations. It elicits the help of the OS in determining what resources to use in running the application. REEact informs the OS about information it dynamically discovers about latency, power, and application behavior. REEact is built as multi-layer hierarchical runtime system that interacts with the parallel application, the OS, and the underlying multicore architecture to ensure that maximum performance is achieved.
With chip multiprocessors (CMPs) comes the promise of high-performance computing on a desktop. CMPs impact the design, implementation and the way that high performance applications execute. These applications, which have become increasingly more complex, larger in scale, and handle huge data sets, can benefit greatly from CMPs. These applications are expected to use parallel systems with tens to several hundreds of nodes to handle their ever growing problem sizes. For example, simulating complex ocean circulation models requires exploiting significant parallelism. Similarly, emerging applications in biomedical computing, automated surgery, and data mining have inherent parallelism and CMPs can increase their performance by several factors. With the shift to CMPs, managing shared resources has become a critical issue in realizing their full potential. In this research, we focused on the contention for memory resources in a CMP. To develop approaches to reduce shared resource contention for emerging multi-threaded applications, we studied how their performances are affected by contention for a particular shared resource. We developed a general methodology for characterizing multi-threaded applications by determining the effect of shared-resource contention on performance. We characterized the applications using the PARSEC benchmark suite for shared-memory resource contention. The characterization revealed several interesting aspects. Three of twelve PARSEC benchmarks exhibit no contention for cache resources. Nine exhibit contention for the L2-cache, with only three exhibiting contention among their own threads–most contention is because of competition with a co-runner. Interestingly, contention for the Front Side Bus is a major factor and degrades performance by more than 11% Effective resource and application management on CMPs requires consideration of user specific requirements and dynamic adaption of management decisions based on the actual run-time environment. However, designing an algorithm to manage resources and applications that can dynamically adapt based on the run-time environment is difficult because most resource and application management and monitoring facilities are only available at the OS level. We developed REEact, an infrastructure that provides the capability to specify user-level management policies with dynamic adaptation. REEact is a virtual execution environment that provides a framework and core services to quickly enable the design of custom management policies for dynamically managing resources and applications. We evaluated REEact on three case studies, each illustrating the use of REEact to apply a specific dynamic management policy on a real CMP. Through these case studies, we demonstrated that REEact can effectively and efficiently implement policies to dynamically manage resources and adapt application execution. Previous research has shown that thread mapping is a powerful tool for resource management. However, the difficulty of simultaneously managing multiple hardware resources and the varying nature of the workloads has impeded the efficiency of thread mapping algorithms. We developed an in-depth analysis of PARSEC benchmarks running under different thread mappings to investigate the interaction of various thread mappings with microarchitectural resources, including L1 I/D-caches, I/D TLBs, L2 caches, hardware prefetchers, off-chip memory interconnects, branch predictors, memory disambiguation units and the cores. Our experiments show that when only memory resources are considered, thread mapping improves an application’s performance by as much as 14% over the default Linux scheduler. In contrast, when both memory and processor resources are considered the mapping algorithm achieves performance improvements by as much as 28%. We also developed ReSense, the first run-time system that uses application characteristics to dynamically map multi-threaded applications from dynamic workloads. ReSense mitigates contention for the shared resources in the memory hierarchy by applying a novel thread-mapping algorithm that dynamically adjusts the mapping of threads from dynamic workloads using a pre-calculated sensitivity score. Using three different sized dynamic workloads, ReSense was able to improve the average response time of the three workloads by up to 27.03%, 20.89%, and 29.34% and throughput by up to 19.97%, 46.56%, and 29.86% respectively, over the native OS on real hardware. As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as warehouse scale computers (WSCs) depends heavily on colocation of multiple workloads on each server to take advantage of the computational power provided by modern processors. However, many of the applications running in WSC are user-facing and have quality of service (QoS) requirements. We developed ReQoS, a static/dynamic compilation approach that enables low-priority applications to adaptively manipulate their own contentiousness to ensure the QoS of high-priority co-runners. Applying ReQoS to SPEC2006 and SmashBench workloads, we improved machine utilization by more than 70% in many cases, and more than 50% on average, while enforcing a 90% QoS threshold. We are also able to improve the energy efficiency by 47% on average over a policy of disallowing co-locations. The broader impacts include the development and training of two Ph.D. students and an REU student. We also held a workshop for women graduate students and faculty interested in learning about runtime systems.