The project will explore novel software technologies to improve the time to solution of applications by increasing memory efficiency on emerging distributed systems. The approaches to be pursued include combination of simulation and hardware assists. The memory performance of distributed simulations must improve if the benefits of emergent high-end systems are to be realized. To hide the effects of memory latency, the complexity of systems used as building blocks (e.g. SMPs) in large-scale clusters has increased substantially. The combination of memory speed, node complexity and system scale often leads to poor performance. Years of intense research aimed at improving the performance of applications in parallel and distributed systems have led to average efficiencies of 5-10%. The continued exponential increase in complexity makes maintaining these efficiencies through tuning challenging. Improving efficiency dramatically will require innovation. The project will improve distributed simulation performance in SMP-based clusters through creation of a new set of software tools that enable access to coordinated hardware counter information across system components (e.g. CPU and NIC). Preliminary results indicate these multi-component techniques are inherently scalable while providing system-wide memory performance information previously unavailable without the aid of sophisticated, intrusive software profiling or simulation.
The project will leverage hardware profiling, parallel and distributed performance evaluation, statistical data reduction analysis, analytical modeling techniques and tool development with emerging hardware counter technologies on commodity CPUs and NICs to produce a software framework for locality-aware application profiling, analysis and optimization, and create a framework that provides a complete picture of local and remote memory accesses in a largescale, high-end distributed systems .