Search | arXiv e-print repository

doi 10.1016/j.parco.2022.102904

Scalable communication for high-order stencil computations using CUDA-aware MPI

Authors: Johannes Pekkilä, Miikka S. Väisälä, Maarit J. Käpylä, Matthias Rheinhardt, Oskar Lappi

Abstract: Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accent… ▽ More Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. Our GPU implementation scales strongly from one to $64$ devices at $50\%$--$87\%$ of the expected efficiency based on a theoretical performance model. Compared with a multi-core CPU solver, our implementation exhibits $20$--$60\times$ speedup and $9$--$12\times$ improved energy efficiency in compute-bound benchmarks on $16$ nodes. △ Less

Submitted 10 May, 2022; v1 submitted 2 March, 2021; originally announced March 2021.

Comments: 15 pages, 15 figures. Updated with the accepted manuscript. More extensive tests added and wording clarified in several places. Please refer to the published article for the most polished version

Journal ref: Parallel Computing, Volume 111, 2022, 102904

arXiv:2012.08758 [pdf, other]

doi 10.3847/1538-4357/abceca

Interaction of large- and small-scale dynamos in isotropic turbulent flows from GPU-accelerated simulations

Authors: Miikka S. Väisälä, Johannes Pekkilä, Maarit J. Käpylä, Matthias Rheinhardt, Hsien Shang, Ruben Krasnopolsky

Abstract: Magnetohydrodynamical (MHD) dynamos emerge in many different astrophysical situations where turbulence is present, but the interaction between large-scale (LSD) and small-scale dynamos (SSD) is not fully understood. We performed a systematic study of turbulent dynamos driven by isotropic forcing in isothermal MHD with magnetic Prandtl number of unity, focusing on the exponential growth stage. Both… ▽ More Magnetohydrodynamical (MHD) dynamos emerge in many different astrophysical situations where turbulence is present, but the interaction between large-scale (LSD) and small-scale dynamos (SSD) is not fully understood. We performed a systematic study of turbulent dynamos driven by isotropic forcing in isothermal MHD with magnetic Prandtl number of unity, focusing on the exponential growth stage. Both helical and non-helical forcing was employed to separate the effects of LSD and SSD in a periodic domain. Reynolds numbers (Rm) up to $\approx 250$ were examined and multiple resolutions used for convergence checks. We ran our simulations with the Astaroth code, designed to accelerate 3D stencil computations on graphics processing units (GPUs) and to employ multiple GPUs with peer-to-peer communication. We observed a speedup of $\approx 35$ in single-node performance compared to the widely used multi-CPU MHD solver Pencil Code. We estimated the growth rates both from the averaged magnetic fields and their power spectra. At low Rm, LSD growth dominates, but at high Rm SSD appears to dominate in both helically and non-helically forced cases. Pure SSD growth rates follow a logarithmic scaling as a function of Rm. Probability density functions of the magnetic field from the growth stage exhibit SSD behaviour in helically forced cases even at intermediate Rm. We estimated mean-field turbulence transport coefficients using closures like the second-order correlation approximation (SOCA). They yield growth rates similar to the directly measured ones and provide evidence of $α$ quenching. Our results are consistent with the SSD inhibiting the growth of the LSD at moderate Rm, while the dynamo growth is enhanced at higher Rm. △ Less

Submitted 16 December, 2020; originally announced December 2020.

Comments: 22 pages, 23 figures, 2 tables, Accepted for publication in the Astrophysical Journal

Report number: NORDITA-2020-067

arXiv:2009.08231 [pdf, other]

doi 10.21105/joss.02807

The Pencil Code, a modular MPI code for partial differential equations and particles: multipurpose and multiuser-maintained

Authors: A. Brandenburg, A. Johansen, P. A. Bourdin, W. Dobler, W. Lyra, M. Rheinhardt, S. Bingert, N. E. L. Haugen, A. Mee, F. Gent, N. Babkovskaia, C. -C. Yang, T. Heinemann, B. Dintrans, D. Mitra, S. Candelaresi, J. Warnecke, P. J. Käpylä, A. Schreiber, P. Chatterjee, M. J. Käpylä, X. -Y. Li, J. Krüger, J. R. Aarnes, G. R. Sarson , et al. (12 additional authors not shown)

Abstract: The Pencil Code is a highly modular physics-oriented simulation code that can be adapted to a wide range of applications. It is primarily designed to solve partial differential equations (PDEs) of compressible hydrodynamics and has lots of add-ons ranging from astrophysical magnetohydrodynamics (MHD) to meteorological cloud microphysics and engineering applications in combustion. Nevertheless, the… ▽ More The Pencil Code is a highly modular physics-oriented simulation code that can be adapted to a wide range of applications. It is primarily designed to solve partial differential equations (PDEs) of compressible hydrodynamics and has lots of add-ons ranging from astrophysical magnetohydrodynamics (MHD) to meteorological cloud microphysics and engineering applications in combustion. Nevertheless, the framework is general and can also be applied to situations not related to hydrodynamics or even PDEs, for example when just the message passing interface or input/output strategies of the code are to be used. The code can also evolve Lagrangian (inertial and noninertial) particles, their coagulation and condensation, as well as their interaction with the fluid. △ Less

Submitted 17 September, 2020; originally announced September 2020.

Comments: 7 pages, submitted to the Journal for Open Source Software (JOSS)

Report number: NORDITA-2020-087

Journal ref: Journal of Open Source Software 6, 2807 (2021)

arXiv:1707.08900 [pdf, ps, other]

doi 10.1016/j.cpc.2017.03.011

Methods for compressible fluid simulation on GPUs using high-order finite differences

Authors: Johannes Pekkilä, Miikka S. Väisälä, Maarit J. Käpylä, Petri J. Käpylä, Omer Anjum

Abstract: We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the cac… ▽ More We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates $343$ million grid points per second on a Tesla K40t GPU, achieving a $3.6 \times$ speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of $168$ million updates per second. △ Less

Submitted 27 July, 2017; originally announced July 2017.

Comments: 14 pages, 7 figures

Journal ref: Computer Physics Communications, Volume 217, August 2017, Pages 11-22

Showing 1–4 of 4 results for author: Pekkilä, J