Trends in execution concurrency make a compelling case for the development of methods able to automatically and efficiently model and mitigate irreproducibility beyond petascale architectures and into the exascale. It is expected that high performance computers at the exascale will exhibit a massively large level of concurrency - a factor of 10,000 greater than on current platforms - which will move computer simulations from bulk-synchronous executions to multithreading approaches and asynchronous I/O. Simulation calculations and analysis routines will also be tightly coupled on exascale platforms, requiring these two workflow components to work at extremely high levels of concurrency. As concurrency levels increase, the impact of rounding errors on numerical reproducibility also increases, ultimately affecting the ability of scientific simulations to reproduce program executions and numerical results. Under these circumstances, irreproducible results may not be trusted by a scientific community expecting reproducible behaviors and any attempt to pursue reproducibility may come at a cost in performance that is too high.

This "high risk-high payoff" project studies the impact of rounding errors on result reproducibility when concurrent executions burst and workflow determinism vanishes in cutting-edge multicore architectures. To this end, the project models rounding-errors in scientific applications with a mathematical method called "composite precision floating-point arithmetic" and shows how this method can mitigate error drifting. A benchmark suite used in preliminary work is extended to cover a larger range of applications' patterns and used to assess the mitigating impact of the composite precision on new generations of multicore architectures. Lastly, the project quantifies the cost and mitigation factors of the proposed method to mitigate error propagations for the diverse benchmarks and platforms.

The project will advance knowledge and understanding in numerical reproducibility at the exascale by developing and disseminating effective software solutions to the rounding error propagation problem for a broad set of applications and their codes when executed with high degrees of concurrency on massively parallel systems.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Communication Foundations (CCF)
Type
Standard Grant (Standard)
Application #
1446794
Program Officer
Almadena Chtchelkanova
Project Start
Project End
Budget Start
2014-06-15
Budget End
2016-05-31
Support Year
Fiscal Year
2014
Total Cost
$89,998
Indirect Cost
Name
University of Delaware
Department
Type
DUNS #
City
Newark
State
DE
Country
United States
Zip Code
19716