The microprocessor industry has moved toward multicore designs to leverage increasing transistor counts in the face of physical and micro-architectural limitations. Unfortunately, providing multiple cores does not translate into performance for most applications. Rather than pushing all the burden onto programmers, this project advocates the use of the implicitly parallel programming model to eliminate the laborious and error-prone process of explicit parallel programming. Implicit parallel programming leverages sequential languages to facilitate shorter development and debug cycles, and relies on automatic tools, both static compilers and run-time systems, to identify parallelism and customize it to the target platform. Implicit parallelism can be systematically extracted using: (1) decoupled softwarepipelining, a technique to extract the pipeline parallelism found in many sequential applications; (2) low-frequency and high-confidence speculation to overcome limitations of memory dependence analysis; (3) whole-program scope for parallelization to eliminate analysis boundaries; (4) simple extensions to the sequential programming model that give the programmer the power to refine the meaning of a program; (5) dynamic adaptation to ensure efficiency is maintained across changing environments. This project is developing the set of technologies to realize an implicitly parallel programming system with scalable, lifelong thread extraction and dynamic adaptation. At the broader level, the implicitly parallel programming approach will free programmers to consider the problems they are trying to solve, rather than forcing them to overcome the processor industry's failure to continue to scale performance. This approach will keep computers accessible, helping computing to have the same increasingly positive impact on other fields.
With chip multiprocessors (CMPs) comes the promise of high-performance computing on a desktop. CMPs impact the design, implementation and the way that high performance applications execute. These applications, which have become increasingly more complex, larger in scale, and handle huge data sets, can benefit greatly from CMPs. These applications are expected to use parallel systems with tens to several hundreds of nodes to handle their ever growing problem sizes. For example, simulating complex ocean circulation models requires exploiting significant parallelism. Similarly, emerging applications in biomedical computing, automated surgery, and data mining have inherent parallelism and CMPs can increase their performance by several factors. With the shift to CMPs, managing shared resources has become a critical issue in realizing their full potential. In this research, we focused on the contention for memory resources in a CMP. To develop approaches to reduce shared resource contention for emerging multi-threaded applications, we studied how their performances are affected by contention for a particular shared resource. We developed a general methodology for characterizing multi-threaded applications by determining the effect of shared-resource contention on performance. We characterized the applications using the PARSEC benchmark suite for shared-memory resource contention. The characterization revealed several interesting aspects. Three of twelve PARSEC benchmarks exhibit no contention for cache resources. Nine exhibit contention for the L2-cache, with only three exhibiting contention among their own threads–most contention is because of competition with a co-runner. Interestingly, contention for the Front Side Bus is a major factor and degrades performance by more than 11% Effective resource and application management on CMPs requires consideration of user specific requirements and dynamic adaption of management decisions based on the actual run-time environment. However, designing an algorithm to manage resources and applications that can dynamically adapt based on the run-time environment is difficult because most resource and application management and monitoring facilities are only available at the OS level. We developed REEact, an infrastructure that provides the capability to specify user-level management policies with dynamic adaptation. REEact is a virtual execution environment that provides a framework and core services to quickly enable the design of custom management policies for dynamically managing resources and applications. We evaluated REEact on three case studies, each illustrating the use of REEact to apply a specific dynamic management policy on a real CMP. Through these case studies, we demonstrated that REEact can effectively and efficiently implement policies to dynamically manage resources and adapt application execution. Previous research has shown that thread mapping is a powerful tool for resource management. However, the difficulty of simultaneously managing multiple hardware resources and the varying nature of the workloads has impeded the efficiency of thread mapping algorithms. We developed an in-depth analysis of PARSEC benchmarks running under different thread mappings to investigate the interaction of various thread mappings with microarchitectural resources, including L1 I/D-caches, I/D TLBs, L2 caches, hardware prefetchers, off-chip memory interconnects, branch predictors, memory disambiguation units and the cores. Our experiments show that when only memory resources are considered, thread mapping improves an application’s performance by as much as 14% over the default Linux scheduler. In contrast, when both memory and processor resources are considered the mapping algorithm achieves performance improvements by as much as 28%. We also developed ReSense, the first run-time system that uses application characteristics to dynamically map multi-threaded applications from dynamic workloads. ReSense mitigates contention for the shared resources in the memory hierarchy by applying a novel thread-mapping algorithm that dynamically adjusts the mapping of threads from dynamic workloads using a pre-calculated sensitivity score. Using three different sized dynamic workloads, ReSense was able to improve the average response time of the three workloads by up to 27.03%, 20.89%, and 29.34% and throughput by up to 19.97%, 46.56%, and 29.86% respectively, over the native OS on real hardware. As multicore processors with expanding core counts continue to dominate the server market, the overall utilization of the class of datacenters known as scale computers (WSCs) dependsheavily on colocation of multiple workloads on each server to take advantage of the computational power provided by modern processors. However, many of the applications running in WSC are user-facing and have quality of service (QoS) requirements. We developed ReQoS, a static/dynamic compilation approach that enables low-priority applications to adaptively manipulate their own contentiousness to ensure the QoS of high-priority co-runners. Applying ReQoS to SPEC2006 and SmashBench workloads, we improved machine utilization by more than 70% in many cases, and more than 50% on average, while enforcing a 90% QoS threshold. We are also able to improve the energy efficiency by 47% on average over a policy of disallowing co-locations. The broader impacts include the development and training of two Ph.D. students, one a female. They both will continue their career in the USA.