This NSF award to the University of Illinois at Urbana-Champaign and the University of Tennessee at Knoxville funds U.S. researchers participating in a project competitively selected by the G8 Research Councils Initiative on Multilateral Research through the Interdisciplinary Program on Application Software towards Exascale Computing for Global Scale Issues. This is a pilot collaboration among the U.S. National Science Foundation, the Canadian National Sciences and Engineering Research Council (NSERC), the French Agence Nationale de la Recherche (ANR), the German Deutsche Forschungsgemeinschaft (DFG), the Japan Society for the Promotion of Science (JSPS), the Russian Foundation for Basic Research (RFBR),and the United Kingdom Research Councils (RC-UK), supporting collaborative research projects selected on a competitive basis that are comprised of researchers from at least three of the partner countries.

This interdisciplinary project across six countries focuses on three research topics that address limitations in numerical modeling of physics, chemistry and biology with the NCAR Community Earth System Model Version 1 (CESM1) and similar codes used by other countries. These research topics include new approaches to handle resilience, node level optimization and system level scalability. This research will enable the development of more scalable model ensembles. These will allow better evaluation of climate sensitivity and climate feedback processes, a better quantification of model uncertainty and a better understanding of the effects of natural variability.

The project will provide essential knowledge toward scaling climate codes for exascale and thus reducing current uncertainties on climate evolution; it will foster interactions between computer scientists and climate scientists; will foster international collaborations in the area of climate simulations and exascale computing; and will educate a new generation of researchers that understand both the application domain of climate simulation and high-performance computing

Project Report

Policy decisions for both mitigating and adapting to climate change are subjects of great discussion in the G8 countries and throughout the world. Uninformed decisions, including ones of no action, could have a very high cost in money and lives. Therefore, it is essential to reduce, as soon as possible, the current uncertainties about future climate change. Numerical models of the physics, chemistry, and biology affecting the Earth-atmosphere climate system are key tools to these projections. Today, even as we prepare to run complex models of the Earth’s climate system on petascale machines, we realize that despite the extensive capabilities that petascale will enable, a number of critical limitations in modeling the climate system require an exascale capability. One of the areas where research is necessary in order to allow the climate models to produce meaningful results is the area of fault tolerance. The size and structure of future large-scale architectures increase the likelihood of a larger number of soft and hard errors impacting an application’s execution, especially with long-running applications like the NCAR Community Earth System Model Version 1 (CESM1). However, due to the complexity of this climate model and the large number of mathematical algorithms involved overall, a holistic multi-stage approach is indeed necessary. The expectation is that a single software approach, being system-level (checkpoint/restart) or application-level (algorithm based), is unlikely to succeed in all aspects, highlighting the path toward middle-ground hybrid approaches, where distinct solutions will be envisioned for each execution stage. One of the most critical requirements of the NCAR codes used in the context of this project is a strict prerequisite for bit-wise reproducibility of the scientific result. While such an approach is somehow unexpected in the scientific community where some margin of error is tolerable, in the context of this project this requirement is not to be underestimated. Unfortunately, such a strict limit on the numerical stability of the result automatically disqualifies the use of any type of algorithm based fault tolerance, where a trade-off between numerical stability and performance is a crucial piece of the puzzle. Due to the bit-wise reproducibility requirement, instead of making the application recovery a major component, we loosen the requirements and focused on a simpler, yet important, target: being able to accurately validate partial results, bounded by a predefined accuracy, during the application execution. Such a protection will already be a significant step compared with today’s state, where the result is validated only upon completion of the computation. Thus, instead of building algorithms capable of tolerating hard and soft errors, we build validators capable of soft error detection as a first step, supplemented with checkpoint/restart based approaches allowing the application to restart from correct data (one that will lead to a bit-wise reproducible result). To be precise, the interest of this approach is twofold. On one side the validator can stop the execution of the current instance of the climate application, as soon as the mathematical condition is not respected. This ensures that any execution that reached the completion stage is a valid execution, and that the result is indeed usable. On the other side, by stopping the execution early, we would be able to minimize the cost (in terms of energy and time) to the time-to-solution for the target application. Continuing from the observation that we can improve the cost of the execution, we started to build models for representing the different fault management approaches. Instantiating these models with a known architecture, we can then estimate the costs and overheads for approaches for resilience, based not only on the execution environment but also on intrinsic properties of the application. Additionally, we started looking into additional approaches allowing us to aggregate several methods together in order to supplement the error discovery with restart capabilities. Moreover, using these models we are able to not only understand what is currently happening on existing platforms, but we can also predict what will happen on future platforms, and how a specific application will behave under particular hardware constraints.

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Application #
1063019
Program Officer
Daniel Katz
Project Start
Project End
Budget Start
2011-03-01
Budget End
2014-02-28
Support Year
Fiscal Year
2010
Total Cost
$150,000
Indirect Cost
Name
University of Tennessee Knoxville
Department
Type
DUNS #
City
Knoxville
State
TN
Country
United States
Zip Code
37916