A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations
Authors:
Nils Kohl,
Johannes Hötzer,
Florian Schornbaum,
Martin Bauer,
Christian Godenschwager,
Harald Köstler,
Britta Nestler,
Ulrich Rüde
Abstract:
Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient agains…
▽ More
Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to $40$ billion computational cells executing on more than $400$ billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up to more than $260\,000$ ($2^{18}$) processes. To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. The checkpointing mechanism is fully integrated in a state-of-the-art high-performance multi-physics simulation framework. We demonstrate the efficiency and robustness of the method with a realistic phase-field simulation originating in the material sciences and with a lattice Boltzmann method implementation.
△ Less
Submitted 29 January, 2018; v1 submitted 28 August, 2017;
originally announced August 2017.
Massively Parallel Phase-Field Simulations for Ternary Eutectic Directional Solidification
Authors:
Martin Bauer,
Johannes Hötzer,
Philipp Steinmetz,
Marcus Jainta,
Marco Berghoff,
Florian Schornbaum,
Christian Godenschwager,
Harald Köstler,
Britta Nestler,
Ulrich Rüde
Abstract:
Microstructures forming during ternary eutectic directional solidification processes have significant influence on the macroscopic mechanical properties of metal alloys. For a realistic simulation, we use the well established thermodynamically consistent phase-field method and improve it with a new grand potential formulation to couple the concentration evolution. This extension is very compute in…
▽ More
Microstructures forming during ternary eutectic directional solidification processes have significant influence on the macroscopic mechanical properties of metal alloys. For a realistic simulation, we use the well established thermodynamically consistent phase-field method and improve it with a new grand potential formulation to couple the concentration evolution. This extension is very compute intensive due to a temperature dependent diffusive concentration. We significantly extend previous simulations that have used simpler phase-field models or were performed on smaller domain sizes. The new method has been implemented within the massively parallel HPC framework waLBerla that is designed to exploit current supercomputers efficiently. We apply various optimization techniques, including buffering techniques, explicit SIMD kernel vectorization, and communication hiding. Simulations utilizing up to 262,144 cores have been run on three different supercomputing architectures and weak scalability results are shown. Additionally, a hierarchical, mesh-based data reduction strategy is developed to keep the I/O problem manageable at scale.
△ Less
Submitted 4 June, 2015;
originally announced June 2015.