High-end embedded systems such as smart phones, game consoles, GPS-enabled automotive systems, home entertainment centers, and other ?ambient intelligence? systems are becoming increasingly important in everyday life. Making such systems energy-efficient presents new challenges with broad implications for the economy and the environment. Such high-end embedded systems are multicore architectures, which require management of resources such as memory connectivity and scheduling. This proposal investigates the energy implications of system-level concurrency issues in high-end embedded systems that are not limited by real-time constraints. In particular, it aims to develop energy-efficient techniques of synchronizing memory accesses, and tries to understand the optimal division of tasks between hardware and software.

Embedded systems are an integral component of modern life, and is a continually growing market. As the computational needs of the products in this market becomes more sophisticated, there will be more challenges in meeting the tight constraints imposed by these systems. Improvements in the performance and in particular the energy efficiency of such devices would have a substantial impact in terms of improved functionality, device longevity, and resource conservation. This proposal involves collaboration between two disciplines, computer engineering and computer science, and two institutions. Broader impacts of the proposal include development of workshops focused on multicore and parallel computing with special emphasis on encouraging women and under-represented minorities to participate. In addition, the findings of this project will be integrated into existing courses, specifically aiming to introduce cross-cutting issues between the computer science and engineering courses.

Project Report

High-end embedded systems such as smart phones, game consoles, GPS-enabled automotive systems, and other "ambient intelligence" systems are becoming increasingly important in everyday life. Making such systems energy-efficient presents new challenges with broad implications for the economy and the environment. Such high-end embedded systems are multicore architectures, which require management of resources such as memory connectivity and scheduling. This project investigates the energy implications of system-level concurrency issues in high-end embedded multicore systems. In particular, it aims to develop energy-efficient techniques of synchronizing memory accesses, and tries to understand the optimal division of tasks between hardware and software. Energy consumption and complexity are considered to be driving concerns in the design of embedded systems and therefore we focused on adapting simple hardware transactional memory (HTM) schemes in the our architectural designs. Several different cache structures and contention management schemes to support HTM were proposed and evaluated in terms of energy, performance, and complexity. Many proposals involving hardware transactional memory require modifications to the underlying cache coherency protocol, which can have some undesirable side effects. In this project, we proposed a specialized hardware module, known as the Bloom Filter Module, that decouples the conflict detection from the cache coherency. Because of its separation from the cache coherence protocol of the processor cores, the Bloom Filter Module configuration requires much less complexity in its logic. An additional benefit of the centralized nature of the Bloom Filter Module is that it can use any algorithm to decide which cores to abort in case of a data conflict. Ease of use and transparency of the internal functionality are first-class design constraints of the embedded transactional memory system. As such, we considered an integrated hardware/software solution for transactional programming on embedded multiprocessor systems. This involved developing a low-level transactional application programming interface, implemented as an extension to OpenMP that allows us to support speculative task and data-level parallelism. Finally, in the last year of the project we explored how speculative synchronization could be deployed for applications that were written using conventional non-speculative constructs (e.g., locks). This approach, known a "speculative lock elision", is appealing because it promises to increase concurrency without the need to retrofit code, so programmers can take full advantage of the underlying speculative hardware support even when running code written using traditional locks. While other researchers have previously proposed the idea of speculative lock elision (and transactional lock removal), our work differs on two key aspects. First, it explores a lightweight hardware solution that is evaluated not just in terms of improved throughput, but also in terms of energy efficiency, which is particularly important for embedded platforms. In addition, we explore flexible contention management alternatives that go beyond these other proposals. Outcomes: For our initial simulation results, we found that even simple transactional memory designs outperform locking with respect to both energy and performance. The level of improvement is workload-dependent; however, overall our experimental findings show that ignoring energy considerations can lead to poor design choices, particularly for resource-constrained embedded platforms. When evaluating our Bloom Module, we found that for benchmarks that spend any significant amount of time executing transactions, our transactional memory scheme achieved significantly better performance and energy results than locking. However, as the number of cores increases, we found that the contribution of the Bloom Module to the total energy consumption also increases. This is expected, since the number of hashing operations grows linearly with the number of cores. In the worst case (i.e., with 8 cores) the Bloom module consumes about 5% of the system energy. As an integrated HW/SW solution, our experimental results confirm that our tranactional memory system is a viable and cost-effective solution for embedded multiprocessor systmes, in terms of energy, performance and productivity. Thanks to the low-overhead and highly-reactive hardware support of our system, we are capable of extracting high degrees of parallelism across tasks, in an energy-efficient manner. We also show high performance gains for data-level parallelism when most of the transactions execution time is spent out of dependent program regions. If dependent code dominates execution time, our system minimizes the impact on performance by promptly sequentializing frequently conflicting transactions. Finally, we found that our implementation of speculative lock elision can improve the energy delay product for most of the benchmarks under many of the configurations we consider, especially when more than 2 cores are available. The benefits of speculation are sensitive to the critical section size, the degree of lock contention, the retry policy, and the underlying hardware transactional memory’s contention management policy. We found that placing cores in sleep mode rather than spinning on the lock increases runtime, but that this is compensated for by energy savings in most cases. We also concluded that some of the parameters (such as contention management and retry policy) are workload-dependent, and there is no clear winning choice.

Project Start
Project End
Budget Start
2009-08-01
Budget End
2013-07-31
Support Year
Fiscal Year
2009
Total Cost
$268,892
Indirect Cost
Name
Brown University
Department
Type
DUNS #
City
Providence
State
RI
Country
United States
Zip Code
02912