The increasing problems of power, heat dissipation, and design complexity have caused a shift in processor technology to favor multicore multiprocessors. Along with that shift, the sharing of memory hierarchy becomes deeper, heterogeneous and more complex, causing cache contention, increased conflicts, and also, synergy sharing. Without understanding the implications of this change, current multicore systems suffer from considerable performance degradation, poor performance isolation and inferior fairness guarantees. The urgency of these issues increases as the degree of processor-level parallelism increments rapidly.
Prior studies, mostly in areas of architecture and operating systems, rely on simple heuristics to estimate cache requirement of corunning programs; the inaccuracy and overhead limits their scalability and effectiveness. This work tackles these challenges uniquely from the compiler aspect by constructing predictive behavior models for corunning processes, developing cache-sharing-aware program transformations and loop scheduling, and combining the program-level knowledge of programming systems with the proactive resource management by runtime systems. Specifically, this work proposes inclusive reuse signatures to characterize inclusive locality---the memory behavior of corunning programs on shared caches, and inter-thread affinity models to capture data locality among parallel threads. It tackles the challenges facing the measurement, prediction and exploitation of inclusive locality. The analysis opens new opportunities for shared-cache optimizations by both compilers and runtime systems. The PI develops a series of program transformations, such as inter-thread memory reorganization and cache-sharing aware loop scheduling, to increase inter-thread spacial locality and ameliorate conflicts, contention and false sharing. For runtime systems, this work invents proactive cache management which partitions caches or schedules processes according to predicted inclusive locality proactively, overcoming the limitations of current reactive schemes on scalability, accuracy and effectiveness.
Modern processors have shown a trend towards deep sharing of memorysystems among many cores. The sharing appears heterogeneous andincreasingly complex, seriously hindering the exploitation of the fullpower of modern computing devices. In this project, Dr. Xipeng Shenand his team conducted the first set of fundamental explorations intothe properties of the sharing from the compilation perspective. Theyproposed several novel predictive models for analyzing the effects ofthe sharing on co-running tasks. They developed some new ways totransform programs so that rather than competing for the sharedresource in the memory system, co-running jobs can help each other byloading data in a synergistic manner. Dr. Shen's team proposed andmaterialized the idea of proactive cache management which partitionscaches---the fast-access memory close to processors---or schedulesprocesses according to predicted effects of shared memory system ontwo or more co-running tasks. The approach overcomes the limitationsof current reactive schemes on scalability, accuracy andeffectiveness. They further extended the study beyond traditionalparallel processors into Graphic Processing Units, and revealed aseries of fundamental properties of the memory systems in theseemerging, massively parallel devices. Their work has produced morethan 20 research papers, fruitful collaborations with industry, andthe first open-source runtime library that offers a comprehensivesupport for maximizing memory performance on a heterogeneous computingdevice on the fly. The outcome of this project provides the first principledunderstanding to modern memory systems that feature non-uniformsharing. It is important for bridging the gap between the rapidlygrowing processor throughput and the slowly expanding memory bandwidthin modern systems. Ultimately, it offers a critical component fortapping into the full power of modern computing devices, and hencemaximizing the cost effectiveness and efficiency in a broad range ofcomputing tasks, including those for scientific discoveries, socialnetwork analysis, finance, health, and many other aspects of humanity.