-
Cornerstone: Octree Construction Algorithms for Scalable Particle Simulations
Authors:
Sebastian Keller,
Aurélien Cavelan,
Rubén Cabezon,
Lucio Mayer,
Florina M. Ciorba
Abstract:
This paper presents an octree construction method, called Cornerstone, that facilitates global domain decomposition and interactions between particles in mesh-free numerical simulations. Our method is based on algorithms developed for 3D computer graphics, which we extend to distributed high performance computing (HPC) systems. Cornerstone yields global and locally essential octrees and is able to…
▽ More
This paper presents an octree construction method, called Cornerstone, that facilitates global domain decomposition and interactions between particles in mesh-free numerical simulations. Our method is based on algorithms developed for 3D computer graphics, which we extend to distributed high performance computing (HPC) systems. Cornerstone yields global and locally essential octrees and is able to operate on all levels of tree hierarchies in parallel. The resulting octrees are suitable for supporting the computation of various kinds of short and long range interactions in N-body methods, such as Barnes-Hut and the Fast Multipole Method (FMM). While we provide a CPU implementation, Cornerstone may run entirely on GPUs. This results in significantly faster tree construction compared to execution on CPUs and serves as a powerful building block for the design of simulation codes that move beyond an offloading approach, where only numerically intensive tasks are dispatched to GPUs. With data residing exclusively in GPU memory, Cornerstone eliminates data movements between CPUs and GPUs. As an example, we employ Cornerstone to generate locally essential octrees for a Barnes-Hut treecode running on almost the full LUMI-G system with up to 8 trillion particles.
△ Less
Submitted 12 July, 2023;
originally announced July 2023.
-
An Execution Fingerprint Dictionary for HPC Application Recognition
Authors:
Thomas Jakobsche,
Nicolas Lachiche,
Aurélien Cavelan,
Florina M. Ciorba
Abstract:
Applications running on HPC systems waste time and energy if they: (a) use resources inefficiently, (b) deviate from allocation purpose (e.g. cryptocurrency mining), or (c) encounter errors and failures. It is important to know which applications are running on the system, how they use the system, and whether they have been executed before. To recognize known applications during execution on a noi…
▽ More
Applications running on HPC systems waste time and energy if they: (a) use resources inefficiently, (b) deviate from allocation purpose (e.g. cryptocurrency mining), or (c) encounter errors and failures. It is important to know which applications are running on the system, how they use the system, and whether they have been executed before. To recognize known applications during execution on a noisy system, we draw inspiration from the way Shazam recognizes known songs playing in a crowded bar. Our contribution is an Execution Fingerprint Dictionary (EFD) that stores execution fingerprints of system metrics (keys) linked to application and input size information (values) as key-value pairs for application recognition. Related work often relies on extensive system monitoring (many system metrics collected over large time windows) and employs machine learning methods to identify applications. Our solution only uses the first 2 minutes and a single system metric to achieve F-scores above 95 percent, providing comparable results to related work but with a fraction of the necessary data and a straightforward mechanism of recognition.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.
-
A Smoothed Particle Hydrodynamics Mini-App for Exascale
Authors:
Aurélien Cavelan,
Rubén M. Cabezón,
Michal Grabarczyk,
Florina M. Ciorba
Abstract:
The Smoothed Particles Hydrodynamics (SPH) is a particle-based, meshfree, Lagrangian method used to simulate multidimensional fluids with arbitrary geometries, most commonly employed in astrophysics, cosmology, and computational fluid-dynamics (CFD). It is expected that these computationally-demanding numerical simulations will significantly benefit from the up-and-coming Exascale computing infras…
▽ More
The Smoothed Particles Hydrodynamics (SPH) is a particle-based, meshfree, Lagrangian method used to simulate multidimensional fluids with arbitrary geometries, most commonly employed in astrophysics, cosmology, and computational fluid-dynamics (CFD). It is expected that these computationally-demanding numerical simulations will significantly benefit from the up-and-coming Exascale computing infrastructures, that will perform 10 18 FLOP/s. In this work, we review the status of a novel SPH-EXA mini-app, which is the result of an interdisciplinary co-design project between the fields of astrophysics, fluid dynamics and computer science, whose goal is to enable SPH simulations to run on Exascale systems. The SPH-EXA mini-app merges the main characteristics of three state-of-the-art parent SPH codes (namely ChaNGa, SPH-flow, SPHYNX) with state-of-the-art (parallel) programming, optimization, and parallelization methods. The proposed SPH-EXA mini-app is a C++14 lightweight and flexible header-only code with no external software dependencies. Parallelism is expressed via multiple programming models, which can be chosen at compilation time with or without accelerator support, for a hybrid process+thread+accelerator configuration. Strong and weak-scaling experiments on a production supercomputer show that the SPH-EXA mini-app can be efficiently executed with up 267 million particles and up to 65 billion particles in total on 2,048 hybrid CPU-GPU nodes.
△ Less
Submitted 6 May, 2020;
originally announced May 2020.
-
Two-level Dynamic Load Balancing for High Performance Scientific Applications
Authors:
Ali Mohammed,
Aurelien Cavelan,
Florina M. Ciorba,
Ruben M. Cabezon,
Ioana Banicesu
Abstract:
Scientific applications are often complex, irregular, and computationally-intensive. To accommodate the ever-increasing computational demands of scientific applications, high-performance computing (HPC) systems have become larger and more complex, offering parallelism at multiple levels (e.g., nodes, cores per node, threads per core). Scientific applications need to exploit all the available multi…
▽ More
Scientific applications are often complex, irregular, and computationally-intensive. To accommodate the ever-increasing computational demands of scientific applications, high-performance computing (HPC) systems have become larger and more complex, offering parallelism at multiple levels (e.g., nodes, cores per node, threads per core). Scientific applications need to exploit all the available multilevel hardware parallelism to harness the available computational power. The performance of applications executing on such HPC systems may adversely be affected by load imbalance at multiple levels, caused by problem, algorithmic, and systemic characteristics. Nevertheless, most existing load balancing methods do not simultaneously address load imbalance at multiple levels. This work investigates the impact of load imbalance on the performance of three scientific applications at the thread and process levels. We jointly apply and evaluate selected dynamic loop self-scheduling (DLS) techniques to both levels. Specifically, we employ the extended LaPeSD OpenMP runtime library at the thread level and extend the DLS4LB MPI-based dynamic load balancing library at the process level. This approach is generic and applicable to any multiprocess-multithreaded computationally-intensive application (programmed using MPI and OpenMP). We conduct an exhaustive set of experiments to assess and compare six DLS techniques at the thread level and eleven at the process level. The results show that improved application performance, by up to 21%, can only be achieved by jointly addressing load imbalance at the two levels. We offer insights into the performance of the selected DLS techniques and discuss the interplay of load balancing at the thread level and process level.
△ Less
Submitted 15 November, 2019;
originally announced November 2019.
-
Finding Neighbors in a Forest: A b-tree for Smoothed Particle Hydrodynamics Simulations
Authors:
Aurélien Cavelan,
Rubén M. Cabezón,
Jonas H. M. Korndorfer,
Florina M. Ciorba
Abstract:
Finding the exact close neighbors of each fluid element in mesh-free computational hydrodynamical methods, such as the Smoothed Particle Hydrodynamics (SPH), often becomes a main bottleneck for scaling their performance beyond a few million fluid elements per computing node. Tree structures are particularly suitable for SPH simulation codes, which rely on finding the exact close neighbors of each…
▽ More
Finding the exact close neighbors of each fluid element in mesh-free computational hydrodynamical methods, such as the Smoothed Particle Hydrodynamics (SPH), often becomes a main bottleneck for scaling their performance beyond a few million fluid elements per computing node. Tree structures are particularly suitable for SPH simulation codes, which rely on finding the exact close neighbors of each fluid element (or SPH particle). In this work we present a novel tree structure, named \textit{$b$-tree}, which features an adaptive branching factor to reduce the depth of the neighbor search. Depending on the particle spatial distribution, finding neighbors using \tree has an asymptotic best case complexity of $O(n)$, as opposed to $O(n \log n)$ for other classical tree structures such as octrees and quadtrees. We also present the proposed tree structure as well as the algorithms to build it and to find the exact close neighbors of all particles. We assess the scalability of the proposed tree-based algorithms through an extensive set of performance experiments in a shared-memory system. Results show that b-tree is up to $12\times$ faster for building the tree and up to $1.6\times$ faster for finding the exact neighbors of all particles when compared to its octree form. Moreover, we apply b-tree to a SPH code and show its usefulness over the existing octree implementation, where b-tree is up to $5\times$ faster for finding the exact close neighbors compared to the legacy code.
△ Less
Submitted 18 May, 2020; v1 submitted 7 October, 2019;
originally announced October 2019.
-
Algorithm-Based Fault Tolerance for Parallel Stencil Computations
Authors:
Aurélien Cavelan,
Florina M. Ciorba
Abstract:
The increase in HPC systems size and complexity, together with increasing on-chip transistor density, power limitations, and number of components, render modern HPC systems subject to soft errors. Silent data corruptions (SDCs) are typically caused by such soft errors in the form of bit-flips in the memory subsystem and hinder the correctness of scientific applications. This work addresses the pro…
▽ More
The increase in HPC systems size and complexity, together with increasing on-chip transistor density, power limitations, and number of components, render modern HPC systems subject to soft errors. Silent data corruptions (SDCs) are typically caused by such soft errors in the form of bit-flips in the memory subsystem and hinder the correctness of scientific applications. This work addresses the problem of protecting a class of iterative computational kernels, called stencils, against SDCs when executing on parallel HPC systems. Existing SDC detection and correction methods are in general either inaccurate, inefficient, or targeting specific application classes that do not include stencils. This work proposes a novel algorithm-based fault tolerance (ABFT) method to protect scientific applications that contain arbitrary stencil computations against SDCs. The ABFT method can be applied both online and offline to accurately detect and correct SDCs in 2D and 3D parallel stencil computations. We present a formal model for the proposed method including theorems and proofs for the computation of the associated checksums as well as error detection and correction. We experimentally evaluate the use of the proposed ABFT method on a real 3D stencil-based application (HotSpot3D) via a fault-injection, detection, and correction campaign. Results show that the proposed ABFT method achieves less than 8% overhead compared to the performance of the unprotected stencil application. Moreover, it accurately detects and corrects SDCs. While the offline ABFT version corrects errors more accurately, it may incur a small additional overhead than its online counterpart.
△ Less
Submitted 2 September, 2019;
originally announced September 2019.
-
rDLB: A Novel Approach for Robust Dynamic Load Balancing of Scientific Applications with Parallel Independent Tasks
Authors:
Ali Mohammed,
Aurelien Cavelan,
Florina M. Ciorba
Abstract:
Scientific applications often contain large and computationally intensive parallel loops. Dynamic loop self scheduling (DLS) is used to achieve a balanced load execution of such applications on high performance computing (HPC) systems. Large HPC systems are vulnerable to processors or node failures and perturbations in the availability of resources. Most self-scheduling approaches do not consider…
▽ More
Scientific applications often contain large and computationally intensive parallel loops. Dynamic loop self scheduling (DLS) is used to achieve a balanced load execution of such applications on high performance computing (HPC) systems. Large HPC systems are vulnerable to processors or node failures and perturbations in the availability of resources. Most self-scheduling approaches do not consider fault-tolerant scheduling or depend on failure or perturbation detection and react by rescheduling failed tasks. In this work, a robust dynamic load balancing (rDLB) approach is proposed for the robust self scheduling of independent tasks. The proposed approach is proactive and does not depend on failure or perturbation detection. The theoretical analysis of the proposed approach shows that it is linearly scalable and its cost decrease quadratically by increasing the system size. rDLB is integrated into an MPI DLS library to evaluate its performance experimentally with two computationally intensive scientific applications. Results show that rDLB enables the tolerance of up to (P minus one) processor failures, where P is the number of processors executing an application. In the presence of perturbations, rDLB boosted the robustness of DLS techniques up to 30 times and decreased application execution time up to 7 times compared to their counterparts without rDLB.
△ Less
Submitted 4 October, 2019; v1 submitted 20 May, 2019;
originally announced May 2019.
-
SPH-EXA: Enhancing the Scalability of SPH codes Via an Exascale-Ready SPH Mini-App
Authors:
Danilo Guerrera,
Aurélien Cavelan,
Rubén M. Cabezón,
David Imbert,
Jean-Guillaume Piccinali,
Ali Mohammed,
Lucio Mayer,
Darren Reed,
Florina M. Ciorba
Abstract:
Numerical simulations of fluids in astrophysics and computational fluid dynamics (CFD) are among the most computationally-demanding calculations, in terms of sustained floating-point operations per second, or FLOP/s. It is expected that these numerical simulations will significantly benefit from the future Exascale computing infrastructures, that will perform 10^18 FLOP/s. The performance of the S…
▽ More
Numerical simulations of fluids in astrophysics and computational fluid dynamics (CFD) are among the most computationally-demanding calculations, in terms of sustained floating-point operations per second, or FLOP/s. It is expected that these numerical simulations will significantly benefit from the future Exascale computing infrastructures, that will perform 10^18 FLOP/s. The performance of the SPH codes is, in general, adversely impacted by several factors, such as multiple time-step**, long-range interactions, and/or boundary conditions. In this work an extensive study of three SPH implementations SPHYNX, ChaNGa, and XXX is performed, to gain insights and to expose any limitations and characteristics of the codes. These codes are the starting point of an interdisciplinary co-design project, SPH-EXA, for the development of an Exascale-ready SPH mini-app. We implemented a rotating square patch as a joint test simulation for the three SPH codes and analyzed their performance on a modern HPC system, Piz Daint. The performance profiling and scalability analysis conducted on the three parent codes allowed to expose their performance issues, such as load imbalance, both in MPI and OpenMP. Two-level load balancing has been successfully applied to SPHYNX to overcome its load imbalance. The performance analysis shapes and drives the design of the SPH-EXA mini-app towards the use of efficient parallelization methods, fault-tolerance mechanisms, and load balancing approaches.
△ Less
Submitted 29 April, 2019;
originally announced May 2019.
-
Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations
Authors:
Aurélien Cavelan,
Rubén M. Cabezón,
Florina M. Ciorba
Abstract:
Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astr…
▽ More
Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on an HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.
△ Less
Submitted 23 April, 2019;
originally announced April 2019.
-
Towards a Mini-App for Smoothed Particle Hydrodynamics at Exascale
Authors:
Danilo Guerrera,
Rubén M. Cabezón,
Jean-Guillaume Piccinali,
Aurélien Cavelan,
Florina M. Ciorba,
David Imbert,
Lucio Mayer,
Darren Reed
Abstract:
The smoothed particle hydrodynamics (SPH) technique is a purely Lagrangian method, used in numerical simulations of fluids in astrophysics and computational fluid dynamics, among many other fields. SPH simulations with detailed physics represent computationally-demanding calculations. The parallelization of SPH codes is not trivial due to the absence of a structured grid. Additionally, the perform…
▽ More
The smoothed particle hydrodynamics (SPH) technique is a purely Lagrangian method, used in numerical simulations of fluids in astrophysics and computational fluid dynamics, among many other fields. SPH simulations with detailed physics represent computationally-demanding calculations. The parallelization of SPH codes is not trivial due to the absence of a structured grid. Additionally, the performance of the SPH codes can be, in general, adversely impacted by several factors, such as multiple time-step**, long-range interactions, and/or boundary conditions. This work presents insights into the current performance and functionalities of three SPH codes: SPHYNX, ChaNGa, and SPH-flow. These codes are the starting point of an interdisciplinary co-design project, SPH-EXA, for the development of an Exascale-ready SPH mini-app. To gain such insights, a rotating square patch test was implemented as a common test simulation for the three SPH codes and analyzed on two modern HPC systems. Furthermore, to stress the differences with the codes stemming from the astrophysics community (SPHYNX and ChaNGa), an additional test case, the Evrard collapse, has also been carried out. This work extrapolates the common basic SPH features in the three codes for the purpose of consolidating them into a pure-SPH, Exascale-ready, optimized, mini-app. Moreover, the outcome of this serves as direct feedback to the parent codes, to improve their performance and overall scalability.
△ Less
Submitted 21 September, 2018;
originally announced September 2018.