Large scale scientific simulations are used in a range of application domains including materials science, climate modeling, combustion and others. These simulations are often limited by the hardware on which they run, including the capacity of computational platforms and storage systems. The goals of these simulations may include finding rare but interesting events in simulation output, discovering common sequences of events, and discovering causality among events. Often, these scientific simulations consume all available computational resources on a high performance computing platform during a simulation run, and be forced to only sample data techniques to decrease the size of the simulation so as to make it possible to store, transfer and post-process the output data. Such data sampling reduces the quality of science results, since not all available data are utilized during analysis. This project aims to greatly improve the scale and quality of scientific simulation results using innovative "in situ" algorithms and machine learning techniques for rare event detection. This research will be validated using a large-scale materials science simulation, that of self-healing nanomaterial system capable of sensing and repairing damage in harsh chemical environments and in high temperature/high pressure operating conditions. Self-healing is of significance since it can improve the reliability and lifetime of materials while reducing the cost of manufacturing, monitoring and maintenance of high-temperature turbines, wind, solar energy and lighting systems. The research can be generalized to a range of scientific simulation domains that share the common goals of discovering rare and interesting events, sequences of events and causality among events. Finally, the research concepts and results will be incorporated into graduate level courses taught by the research team.
The goal of the project is to demonstrate the feasibility, performance and scalability of the research approaches in greatly improving the quality of exascale scientific simulations using in situ machine learning algorithms within a well-defined, reusable in situ software framework. The scope of the project includes: selecting a simplified, but representative, long-time material process suitable for super-state parallel replica dynamics (SPRD); developing in situ machine learning algorithms for rare event detection of super-state transitions; and studying library-based approaches to support the high performance coupling of exascale simulations with in situ machine learning algorithms. To accomplish the project goals, the following three objectives are defined: 1) Prove the feasibility, performance and scalability of in situ SPRD simulation for predicting long-time material processes; 2) Prove the feasibility, performance and scalability of in situ machine learning algorithms for rare-event detection of super-state transitions; and 3) Prove the feasibility, performance and scalability of in situ library-based approaches to coupling exascale simulations and machine learning algorithms.