CPA-DA: Formal Methods for Multi-core Shared Memory Protocol Design

Gopalakrishnan, Ganesh

Abstract

The human society crucially depends on computing devices: from embedded computers in phones to peta-scale computing systems that can perform a million billion multiplications every second, and help simulate everything from car crashes to hurricanes. The performance of a computer must increase each year, without which the information-based human society will cease to advance. Unfortunately, past methods to increase the performance of a computer ? namely increasing the clock frequency and the functional unit complexity -- cease to be effective. These techniques now produce only a miniscule performance increase, while causing huge increases in the energy consumption. Already computing equipments consume more than 5% of the nation's electricity! The only available energy-efficient method of increasing computer performance is through the use of multiple central processing units (CPUs). Unfortunately, such organizations (called "multi-core CPUs") require that the accesses to the central memory be extremely efficient - requiring the use of highly complex protocols - called cache coherence protocols. Unfortunately these protocols must be hand-crafted for high performance, and hence are extremely error-prone. Previous methods to verify cache coherence protocols were already at the limits of the capabilities of verification tools. With the advent of multi-core CPUs, the complexity has become out of reach of all published techniques. The PI and his team are the only academic group to have developed techniques to verify, using mathematically sound computer algorithms, hierarchical multi-core CPU cache coherence protocols. Unfortunately, their methods to date have involved expert humans and often cause considerable tedium. The proposed methods in this proposal are expected to: (1) reduce the burden of verifying cache coherence protocols, and (2) help bridge two central abstraction gaps, thus minimizing the chances of errors in microprocessors: (i) high-level to low-level behavioral modeling gap, and (ii) the low behavioral level to hardware implementation level gap. It will help train valuable manpower - including undergraduates and under-represented groups. It will help sustain the technological momentum of the US, as the availability of sustained high performance computing power is no less important to the nation than its other basic needs such as water, clean air, and energy. The verification tools developed in this project are expected to be technology transferred to the computer industry. Last but not least, the students trained in this project will join the national and international high-technology labor force.

Project Report

This research targeted the fundamental challenge of ensuring that computing systems that are optimized for performance are also devoid of logical errors in their design. Beginning around 2005, computer manufacturers catering to virtually all application classes have adopted multiple central processing units (CPUs) in their design. Adopting this style of multicores is the only foreseeable way toward achieving high computing performance at low energy costs. Unfortunately, this massive-scale usage of parallel processing requires a significant amount of programmer training, new methods to program safely without losing efficiency, as well as incisive verification tools to detect and root-cause correctness errors. Our research made contributions in these regards in four specific areas. VERIFICATION OF HIERARCHICAL CACHE COHERENCE PROTOCOLS Multicore systems are often designed by employing a small cluster of CPU cores that are tightly interconnected and a hierarchical arrangement of such clusters into a larger ensemble. It is crucial to ensure a consistent view of memory both within the cluster as well as across the cluster. This is achieved by using suitable cache coherence protocols. Since the memory sharing patterns as well as electrical distances are different between the CPUs within a cluster as well as across the cluster, different cache coherence protocols are employed within a cluster and spanning multiple clusters. Our research developed techniques for developing ``overlapping abstractions.'' Intuitively, the degree of coupling between the the intra-cluster and intercluster protocols is limited. This observation helps us develop two simpler models, one simplifying the intra-cluster protocol and the other simplifying the inter-cluster model. We successfully demonstrated that non-trivial hierarchical cache coherence protocols can be verified using this approach. A PROVER OF USER GPU PROGRAMS The use of multiple CPU cores takes two forms in practice: (1) the use of traditional CPU cores that are eficient at handling control-flow, and (2) the use of specialized processors that are oriented toward large-scale data processing, and controlled by traditional CPUs for offloading computations. A popular type of data-oriented processor suitable for use in the ``offload model'' is the Graphical Processing Unit (GPU). When we began our research on GPU programming, it came to our attention that no verification methods that offered firm guarantees were available for GPUs. We developed a symbolic analysis method that employed the power of symbolic execution and invariant generation to ensure that GPU programs do not have insidious bugs such as data racs (conflicting read/write accesses that led to unpredictable answers). VERIFICATION METHODS FOR MULTICORE COMMUNICATION APIS As system complexity escalates, the need for multicore chips that communicate using both message passing and shared memory also escalates. A recent standardization effort by the industry has led to the creation of the multicore communications API (MCAPI). Our research helped investigate these proposed APIs and our efforts led to the creation of a dynamic analysis tool called MCC that orchestrates concurrency in these two regimes AN EXTENSIBLE UTAH MULTICORE DESIGN The last component of our research on multicore programming and correctness analysis was centered around building an actual multicore system that communicates as per MCAPI and can be made to operate on field programmable gate arrays (FPGA). This design is called eXtensible Utah Multicore (XUM), and has been released to Opencores.Org in 2011. There have, to date, been over a thousand downloads of XUM. CONCLUDING REMARKS AND OUTCOMES In the hierarchical cache coherence verification project, a female PhD student, Dr. Xiaofang Chen, graduated in 2008, working on this project. A US Citizen Mr. Michael DeLisi graduated, initially as an NSF REU and later as a BS/MS student. Mr. DeLisi was judged to be in the shortlist of the Outstanding CRA Undergraduate selection held nationally (his national ranking was 6). Industrial relationships with Intel Corporation were furthered in this collaboration, including direct involvement by Dr. Ching-Tsun Chou of Intel, Santa Clara. In the PUG project a tool Prover of User GPU program (PUG). Our work directly influenced several subsequent efforts on GPU program correctness checking, including efforts at UC San Diego (``Test Amplificiation'') and a collaboration between Imperial College and Microsoft Research (``GPUVerify''). A PhD student, namely Dr. Guodong Li, graduated while working on PUG. Three papers resulted from this work. This project had direct involvement of Nvidia Inc. in collaborative research, giving us directions. We released the Multicore Communications Checker (MCC) tool, and also the graduation of PhD student Subodh Sharma. Three papers directly resulted from the effort on MCC. This project had direct involvement of Freescale Inc. The eXtensible Utah Multicore (XUM) design was worked on by two students (both US Citizens), namely Benjamin Meakin and Grant Ayers (initially as an NSF REU and subsequently as a BS/MS student; then joining Stanford University as a PhD student). Grant Ayers received Honorable Mention from the CRA Outstanding Undergraduates selection oprocess of 2011. The design of XUM is available on Opencores.org and is perhaps the only MIPS processor with full functionality needed for use with modern compilers and operating systems. One PhD student (Vishal Sharma) and undergraduate REU student Kevin Avery are working on extending XUM. Two additional undergraduates (both US citizens, namely Michael Scutt and Montgomery Carter) have expressed interest in furthering XUM. This project had direct involvement of Freescale Inc. in our research.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Computer and Communication Foundations (CCF)
Application #: 0811429
Program Officer: Sankar Basu

Project Start
Project End
Budget Start: 2008-07-01
Budget End: 2013-06-30
Support Year
Fiscal Year: 2008
Total Cost: $298,000
Indirect Cost

CPA-DA: Formal Methods for Multi-core Shared Memory Protocol Design
Gopalakrishnan, Ganesh
University of Utah, Salt Lake City, UT, United States

Abstract

Project Report

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Project Report

Funding Agency

Institution

Comments