The goal of this project is to improve the utilization of HPC machines at NSF centers and elsewhere via a lightweight performance profiling tool that can identify performance bottlenecks in full scale applications during production runs. Investments by NSF in Tier1 and Tier2 computers, as well as ever-growing popularity of smaller clusters in university and industrial settings, offer tremendous opportunities for new scientific discoveries using computational science. Yet experience suggests that many users do not make effective use of these machines, often relying on algorithms, programming tools, or libraries that encounter removable performance bottlenecks. High concurrency, complex processor architectures, fragile compiler optimizations, low degree network topologies, deep memory hierarchies, load imbalance, and unpredictable performance due to OS noise are among architectural features that make performance bottlenecks simultaneously easy to encounter and hard to find.

We propose research and development to provide users and system administrators with a tool for identifying performance bottlenecks in production. We will extend and deploy our Integrated Performance Monitoring (IPM) tool for identifying communication bottlenecks, memory system bottlenecks, load imbalances, and other performance problems on systems ranging from small clusters to the petascale. We developed IPM as an ultra lightweight performance profiling system. The current version is in use at NSF, DOE, and DOD HPC centers. IPM has unique features that make it effective for ongoing monitoring of application performance by system administrators as well as application scientists. The key features of IPM include: a performance profiling strategy that is highly scalable and perturbs performance by less that 5%; integration with a performance database that allows for easy and immediate comparisons across applications runs and users; and an easy to recompilation. Via further development we will provide: 1) A tool for capturing program's performance data with special emphasis on low overhead and scalability for up to millions of processors; 2) Easy to understand application profiles which capture communication volumes and patterns, processor and memory system counter information, and topology-aware counters from network adapters and switches 3) A database backend for workload characterization and architecture analytics; 4) Support for community driven enhancements through our portable, extensible, Open Source software.

Intellectual Merit: We will extend IPM's breadth by making it run on more and larger machines, and include additional important performance information. Thereby we will enable domain scientists to pinpoint performance issues affecting their applications running on machines with deep memory hierarchies, complex network topologies, and hierarchical parallelism. We will help scientists to quickly answer questions such as, what are the factors affecting the performance of my scientific application?" In addition, our infrastructure will answer fundamental questions about the benefits of architectural features, such as one sided communication, high degree networks, memory system structures, and processor accelerators. It will also support application performance analysis across petascale systems, automatic and manual performance tuning, and in situ" analysis of algorithm scalability using a full machine and real input data. Broader Impact: Through our scalable, portable, and extensible approach we will bring transparency to performance analysis with low overhead. We will enable all HPC stakeholders to assess and improve both applications and architectures, educate users on performance features, and ensure that parallel machines are used productively to answer basic questions in science and engineering. In addition, by providing a close working relationship between domain scientists, NSF centers, and HPC vendors, this project will educate students who are trained in the many facets that impact HPC software and hardware design. A byproduct will be increased understanding of how to optimally use the current and upcoming NSF HPC Tier1 and Tier2 systems portfolio.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
0721397
Program Officer
Daniel Katz
Project Start
Project End
Budget Start
2007-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2007
Total Cost
$1,963,548
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093