-
Code Generation and Performance Engineering for Matrix-Free Finite Element Methods on Hybrid Tetrahedral Grids
Authors:
Fabian Böhm,
Daniel Bauer,
Nils Kohl,
Christie Alappat,
Dominik Thönnes,
Marcus Mohr,
Harald Köstler,
Ulrich Rüde
Abstract:
This paper introduces a code generator designed for node-level optimized, extreme-scalable, matrix-free finite element operators on hybrid tetrahedral grids. It optimizes the local evaluation of bilinear forms through various techniques including tabulation, relocation of loop invariants, and inter-element vectorization - implemented as transformations of an abstract syntax tree. A key contributio…
▽ More
This paper introduces a code generator designed for node-level optimized, extreme-scalable, matrix-free finite element operators on hybrid tetrahedral grids. It optimizes the local evaluation of bilinear forms through various techniques including tabulation, relocation of loop invariants, and inter-element vectorization - implemented as transformations of an abstract syntax tree. A key contribution is the development, analysis, and generation of efficient loop patterns that leverage the local structure of the underlying tetrahedral grid. These significantly enhance cache locality and arithmetic intensity, mitigating bandwidth-pressure associated with compute-sparse, low-order operators. The paper demonstrates the generator's capabilities through a comprehensive educational cycle of performance analysis, bottleneck identification, and emission of dedicated optimizations. For three differential operators ($-Δ$, $-\nabla \cdot (k(\mathbf{x})\, \nabla\,)$, $α(\mathbf{x})\, \mathbf{curl}\ \mathbf{curl} + β(\mathbf{x}) $), we determine the set of most effective optimizations. Applied by the generator, they result in speed-ups of up to 58$\times$ compared to reference implementations. Detailed node-level performance analysis yields matrix-free operators with a throughput of 1.3 to 2.1 GDoF/s, achieving up to 62% peak performance on a 36-core Intel Ice Lake socket. Finally, the solution of the curl-curl problem with more than a trillion ($ 10^{12}$) degrees of freedom on 21504 processes in less than 50 seconds demonstrates the generated operators' performance and extreme-scalability as part of a full multigrid solver.
△ Less
Submitted 12 April, 2024;
originally announced April 2024.
-
A Continuous Benchmarking Infrastructure for High-Performance Computing Applications
Authors:
Christoph Alt,
Martin Lanser,
Jonas Plewinski,
Atin Janki,
Axel Klawonn,
Harald Köstler,
Michael Selzer,
Ulrich Rüde
Abstract:
For scientific software, especially those used for large-scale simulations, achieving good performance and efficiently using the available hardware resources is essential. It is important to regularly perform benchmarks to ensure the efficient use of hardware and software when systems are changing and the software evolves. However, this can become quickly very tedious when many options for paramet…
▽ More
For scientific software, especially those used for large-scale simulations, achieving good performance and efficiently using the available hardware resources is essential. It is important to regularly perform benchmarks to ensure the efficient use of hardware and software when systems are changing and the software evolves. However, this can become quickly very tedious when many options for parameters, solvers, and hardware architectures are available. We present a continuous benchmarking strategy that automates benchmarking new code changes on high-performance computing clusters. This makes it possible to track how each code change affects the performance and how it evolves.
△ Less
Submitted 3 March, 2024;
originally announced March 2024.
-
waLBerla-wind: a lattice-Boltzmann-based high-performance flow solver for wind energy applications
Authors:
Helen Schottenhamml,
Ani Anciaux-Sedrakian,
Frédéric Blondel,
Harald Köstler,
Ulrich Rüde
Abstract:
This article presents the development of a new wind turbine simulation software to study wake flow physics. To this end, the design and development of waLBerla-wind, a new simulator based on the lattice-Boltzmann method that is known for its excellent performance and scaling properties, will be presented. Here it will be used for large eddy simulations (LES) coupled with actuator wind turbine mode…
▽ More
This article presents the development of a new wind turbine simulation software to study wake flow physics. To this end, the design and development of waLBerla-wind, a new simulator based on the lattice-Boltzmann method that is known for its excellent performance and scaling properties, will be presented. Here it will be used for large eddy simulations (LES) coupled with actuator wind turbine models. Due to its modular software design, waLBerla-wind is flexible and extensible with regard to turbine configurations. Additionally it is performance portable across different hardware architectures, another critical design goal. The new solver is validated by presenting force distributions and velocity profiles and comparing them with experimental data and a vortex solver. Furthermore, waLBerla-wind's performance is \revision{compared to a theoretical peak performance}, and analysed with weak and strong scaling benchmarks on CPU and GPU systems. This analysis demonstrates the suitability for large-scale applications and future cost-effective full wind farm simulations.
△ Less
Submitted 8 December, 2023;
originally announced February 2024.
-
Fundamental Data Structures for Matrix-Free Finite Elements on Hybrid Tetrahedral Grids
Authors:
Nils Kohl,
Daniel Bauer,
Fabian Böhm,
Ulrich Rüde
Abstract:
This paper presents efficient data structures for the implementation of matrix-free finite element methods on block-structured, hybrid tetrahedral grids. It provides a complete categorization of all geometric sub-objects that emerge from the regular refinement of the unstructured, tetrahedral coarse grid and describes efficient iteration patterns and analytical linearization functions for the mapp…
▽ More
This paper presents efficient data structures for the implementation of matrix-free finite element methods on block-structured, hybrid tetrahedral grids. It provides a complete categorization of all geometric sub-objects that emerge from the regular refinement of the unstructured, tetrahedral coarse grid and describes efficient iteration patterns and analytical linearization functions for the map** of coefficients to memory addresses. This foundation enables the implementation of fast, extreme-scalable, matrix-free, iterative solvers, and in particular geometric multigrid methods by design. Their application to the variable-coefficient Stokes system subject to an enriched Galerkin discretization and to the curl-curl problem discretized with Nédélec edge elements showcases the flexibility of the implementation. Eventually, the solution of a curl-curl problem with $1.6 \cdot 10^{11}$ (more than one hundred billion) unknowns on more than $32000$ processes with a matrix-free full multigrid solver demonstrates its extreme-scalability.
△ Less
Submitted 3 August, 2023;
originally announced August 2023.
-
Model-Based Performance Analysis of the HyTeG Finite Element Framework
Authors:
Dominik Thönnes,
Ulrich Rüde
Abstract:
In this work, we present how code generation techniques significantly improve the performance of the computational kernels in the HyTeG software framework. This HPC framework combines the performance and memory advantages of matrix-free multigrid solvers with the flexibility of unstructured meshes. The pystencils code generation toolbox is used to replace the original abstract C++ kernels with hig…
▽ More
In this work, we present how code generation techniques significantly improve the performance of the computational kernels in the HyTeG software framework. This HPC framework combines the performance and memory advantages of matrix-free multigrid solvers with the flexibility of unstructured meshes. The pystencils code generation toolbox is used to replace the original abstract C++ kernels with highly optimized loop nests. The performance of one of those kernels (the matrix-vector multiplication) is thoroughly analyzed using the Execution-Cache-Memory (ECM) performance model. We validate these predictions by measurements on the SuperMUC-NG supercomputer. The experiments show that the performance mostly matches the predictions. In cases where the prediction does not match, we discuss the discrepancies. Additionally, we conduct a node-level scaling study which shows the expected behavior for a memory-bound compute kernel.
△ Less
Submitted 24 May, 2023;
originally announced May 2023.
-
Advanced Automatic Code Generation for Multiple Relaxation-Time Lattice Boltzmann Methods
Authors:
Frederik Hennig,
Markus Holzer,
Ulrich Rüde
Abstract:
The scientific code generation package lbmpy supports the automated design and the efficient implementation of lattice Boltzmann methods (LBMs) through metaprogramming. It is based on a new, concise calculus for describing multiple relaxation-time LBMs, including techniques that enable the numerically advantageous subtraction of the constant background component from the populations. These techniq…
▽ More
The scientific code generation package lbmpy supports the automated design and the efficient implementation of lattice Boltzmann methods (LBMs) through metaprogramming. It is based on a new, concise calculus for describing multiple relaxation-time LBMs, including techniques that enable the numerically advantageous subtraction of the constant background component from the populations. These techniques are generalized to a wide range of collision spaces and equilibrium distributions. The article contains an overview of lbmpy's front-end and its code generation pipeline, which implements the new LBM calculus by means of symbolic formula manipulation tools and object-oriented programming. The generated codes have only a minimal number of arithmetic operations. Their automatic derivation rests on two novel Chimera transforms that have been specifically developed for efficiently computing raw and central moments. Information contained in the symbolic representation of the methods is further exploited in a customized sequence of algebraic simplifications, further reducing computational cost. When combined, these algebraic transformations lead to concise and compact numerical kernels. Specifically, with these optimizations, the advanced central moment- and cumulant-based methods can be realized with only little additional cost as when compared with the simple BGK method. The effectiveness and flexibility of the new lbmpy code generation system is demonstrated in simulating Taylor-Green vortex decay and the automatic derivation of an LBM algorithm to solve the shallow water equations.
△ Less
Submitted 4 November, 2022;
originally announced November 2022.
-
A massively parallel Eulerian-Lagrangian method for advection-dominated transport in viscous fluids
Authors:
Nils Kohl,
Marcus Mohr,
Sebastian Eibl,
Ulrich Rüde
Abstract:
Motivated by challenges in Earth mantle convection, we present a massively parallel implementation of an Eulerian-Lagrangian method for the advection-diffusion equation in the advection-dominated regime. The advection term is treated by a particle-based, characteristics method coupled to a block-structured finite-element framework. Its numerical and computational performance is evaluated in multip…
▽ More
Motivated by challenges in Earth mantle convection, we present a massively parallel implementation of an Eulerian-Lagrangian method for the advection-diffusion equation in the advection-dominated regime. The advection term is treated by a particle-based, characteristics method coupled to a block-structured finite-element framework. Its numerical and computational performance is evaluated in multiple, two- and three-dimensional benchmarks, including curved geometries, discontinuous solutions, pure advection, and it is applied to a coupled non-linear system modeling buoyancy-driven convection in Stokes flow. We demonstrate the parallel performance in a strong and weak scaling experiment, with scalability to up to $147,456$ parallel processes, solving for more than $5.2 \times 10^{10}$ (52 billion) degrees of freedom per time-step.
△ Less
Submitted 3 March, 2021;
originally announced March 2021.
-
Highly Efficient Lattice-Boltzmann Multiphase Simulations of Immiscible Fluids at High-Density Ratios on CPUs and GPUs through Code Generation
Authors:
Markus Holzer,
Martin Bauer,
Ulrich Rüde
Abstract:
A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Metaprogramming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations…
▽ More
A high-performance implementation of a multiphase lattice Boltzmann method based on the conservative Allen-Cahn model supporting high-density ratios and high Reynolds numbers is presented. Metaprogramming techniques are used to generate optimized code for CPUs and GPUs automatically. The coupled model is specified in a high-level symbolic description and optimized through automatic transformations. The memory footprint of the resulting algorithm is reduced through the fusion of compute kernels. A roofline analysis demonstrates the excellent efficiency of the generated code on a single GPU. The resulting single GPU code has been integrated into the multiphysics framework waLBerla to run massively parallel simulations on large domains. Communication hiding and GPUDirect-enabled MPI yield near-perfect scaling behaviour. Scaling experiments are conducted on the Piz Daint supercomputer with up to 2048 GPUs, simulating several hundred fully resolved bubbles. Further, validation of the implementation is shown in a physically relevant scenario-a three-dimensional rising air bubble in water.
△ Less
Submitted 11 December, 2020;
originally announced December 2020.
-
Textbook efficiency: massively parallel matrix-free multigrid for the Stokes system
Authors:
Nils Kohl,
Ulrich Rüde
Abstract:
We employ textbook multigrid efficiency (TME), as introduced by Achi Brandt, to construct an asymptotically optimal monolithic multigrid solver for the Stokes system. The geometric multigrid solver builds upon the concept of hierarchical hybrid grids (HHG), which is extended to higher-order finite-element discretizations, and a corresponding matrix-free implementation. The computational cost of th…
▽ More
We employ textbook multigrid efficiency (TME), as introduced by Achi Brandt, to construct an asymptotically optimal monolithic multigrid solver for the Stokes system. The geometric multigrid solver builds upon the concept of hierarchical hybrid grids (HHG), which is extended to higher-order finite-element discretizations, and a corresponding matrix-free implementation. The computational cost of the full multigrid (FMG) iteration is quantified, and the solver is applied to multiple benchmark problems. Through a parameter study, we suggest configurations that achieve TME for both, stabilized equal-order, and Taylor-Hood discretizations. The excellent node-level performance of the relevant compute kernels is presented via a roofline analysis. Finally, we demonstrate the weak and strong scalability to up to $147,456$ parallel processes and solve Stokes systems with more than $3.6 \times 10^{12}$ (trillion) unknowns.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
Resiliency in Numerical Algorithm Design for Extreme Scale Simulations
Authors:
Emmanuel Agullo,
Mirco Altenbernd,
Hartwig Anzt,
Leonardo Bautista-Gomez,
Tommaso Benacchio,
Luca Bonaventura,
Hans-Joachim Bungartz,
Sanjay Chatterjee,
Florina M. Ciorba,
Nathan DeBardeleben,
Daniel Drzisga,
Sebastian Eibl,
Christian Engelmann,
Wilfried N. Gansterer,
Luc Giraud,
Dominik Goeddeke,
Marco Heisig,
Fabienne Jezequel,
Nils Kohl,
Xiaoye Sherry Li,
Romain Lion,
Miriam Mehl,
Paul Mycek,
Michael Obersteiner,
Enrique S. Quintana-Orti
, et al. (11 additional authors not shown)
Abstract:
This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors.
Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to backgr…
▽ More
This work is based on the seminar titled ``Resiliency in Numerical Algorithm Design for Extreme Scale Simulations'' held March 1-6, 2020 at Schloss Dagstuhl, that was attended by all the authors.
Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated.
More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors.
The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge.
△ Less
Submitted 26 October, 2020;
originally announced October 2020.
-
lbmpy: Automatic code generation for efficient parallel lattice Boltzmann methods
Authors:
Martin Bauer,
Harald Köstler,
Ulrich Rüde
Abstract:
Lattice Boltzmann methods are a popular mesoscopic alternative to macroscopic computational fluid dynamics solvers. Many variants have been developed that vary in complexity, accuracy, and computational cost. Extensions are available to simulate multi-phase, multi-component, turbulent, or non-Newtonian flows. In this work we present lbmpy, a code generation package that supports a wide variety of…
▽ More
Lattice Boltzmann methods are a popular mesoscopic alternative to macroscopic computational fluid dynamics solvers. Many variants have been developed that vary in complexity, accuracy, and computational cost. Extensions are available to simulate multi-phase, multi-component, turbulent, or non-Newtonian flows. In this work we present lbmpy, a code generation package that supports a wide variety of different methods and provides a generic development environment for new schemes as well. A high-level domain-specific language allows the user to formulate, extend and test various lattice Boltzmann schemes. The method specification is represented in a symbolic intermediate representation. Transformations that operate on this intermediate representation optimize and parallelize the method, yielding highly efficient lattice Boltzmann compute kernels not only for single- and two-relaxation-time schemes but also for multi-relaxation-time, cumulant, and entropically stabilized methods. An integration into the HPC framework waLBerla makes massively parallel, distributed simulations possible, which is demonstrated through scaling experiments on the SuperMUC-NG supercomputing system
△ Less
Submitted 11 April, 2020; v1 submitted 31 January, 2020;
originally announced January 2020.
-
Parallel solution of saddle point systems with nested iterative solvers based on the Golub-Kahan Bidiagonalization
Authors:
Carola Kruse,
Masha Sosonkina,
Mario Arioli,
Nicolas Tardieu,
Ulrich Ruede
Abstract:
We present a scalability study of Golub-Kahan bidiagonalization for the parallel iterative solution of symmetric indefinite linear systems with a 2x2 block structure. The algorithms have been implemented within the parallel numerical library PETSc. Since a nested inner-outer iteration strategy may be necessary, we investigate different choices for the inner solvers, including parallel sparse direc…
▽ More
We present a scalability study of Golub-Kahan bidiagonalization for the parallel iterative solution of symmetric indefinite linear systems with a 2x2 block structure. The algorithms have been implemented within the parallel numerical library PETSc. Since a nested inner-outer iteration strategy may be necessary, we investigate different choices for the inner solvers, including parallel sparse direct and multigrid accelerated iterative methods. We show the strong and weak scalability of the Golub-Kahan bidiagonalization based iterative method when applied to a two-dimensional Poiseuille flow and to two- and three-dimensional Stokes test problems.
△ Less
Submitted 28 January, 2020;
originally announced January 2020.
-
waLBerla: A block-structured high-performance framework for multiphysics simulations
Authors:
Martin Bauer,
Sebastian Eibl,
Christian Godenschwager,
Nils Kohl,
Michael Kuron,
Christoph Rettinger,
Florian Schornbaum,
Christoph Schwarzmeier,
Dominik Thönnes,
Harald Köstler,
Ulrich Rüde
Abstract:
Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building blocks…
▽ More
Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building blocks for develo** simulations on block-structured grids. The block-structured domain partitioning is flexible enough to handle complex geometries, while the structured grid within each block allows for highly efficient implementations of stencil-based algorithms. We present several example applications realized with waLBerla, ranging from lattice Boltzmann methods to rigid particle simulations. Most importantly, these methods can be coupled together, enabling multiphysics simulations. The framework uses meta-programming techniques to generate highly efficient code for CPUs and GPUs from a symbolic method formulation. To ensure software quality and performance portability, a continuous integration toolchain automatically runs an extensive test suite encompassing multiple compilers, hardware architectures, and software configurations.
△ Less
Submitted 30 September, 2019;
originally announced September 2019.
-
Stencil scaling for vector-valued PDEs on hybrid grids with applications to generalized Newtonian fluids
Authors:
Daniel Drzisga,
Ulrich Rüde,
Barbara Wohlmuth
Abstract:
Matrix-free finite element implementations for large applications provide an attractive alternative to standard sparse matrix data formats due to the significantly reduced memory consumption. Here, we show that they are also competitive with respect to the run time in the low order case if combined with suitable stencil scaling techniques. We focus on variable coefficient vector-valued partial dif…
▽ More
Matrix-free finite element implementations for large applications provide an attractive alternative to standard sparse matrix data formats due to the significantly reduced memory consumption. Here, we show that they are also competitive with respect to the run time in the low order case if combined with suitable stencil scaling techniques. We focus on variable coefficient vector-valued partial differential equations as they arise in many physical applications. The presented method is based on scaling constant reference stencils originating from a linear finite element discretization instead of evaluating the bilinear forms on-the-fly. This method assumes the usage of hierarchical hybrid grids, and it may be applied to vector-valued second-order elliptic partial differential equations directly or as a part of more complicated problems. We provide theoretical and experimental performance estimates showing the advantages of this new approach compared to the traditional on-the-fly integration and stored matrix approaches. In our numerical experiments, we consider two specific mathematical models. Namely, linear elastostatics and incompressible Stokes flow. The final example considers a non-linear shear-thinning generalized Newtonian fluid. For this type of non-linearity, we present an efficient approach to compute a regularized strain rate which is then used to define the node-wise viscosity. Depending on the compute architecture, we could observe maximum speedups of 64% and 122% compared to the on-the-fly integration. The largest considered example involved solving a Stokes problem with 12288 compute cores on the state of the art supercomputer SuperMUC-NG.
△ Less
Submitted 18 March, 2020; v1 submitted 23 August, 2019;
originally announced August 2019.
-
A Modular and Extensible Software Architecture for Particle Dynamics
Authors:
Sebastian Eibl,
Ulrich Rüde
Abstract:
Creating a highly parallel and flexible discrete element software requires an interdisciplinary approach, where expertise from different disciplines is combined. On the one hand domain specialists provide interaction models between particles. On the other hand high-performance computing specialists optimize the code to achieve good performance on different hardware architectures. In particular, th…
▽ More
Creating a highly parallel and flexible discrete element software requires an interdisciplinary approach, where expertise from different disciplines is combined. On the one hand domain specialists provide interaction models between particles. On the other hand high-performance computing specialists optimize the code to achieve good performance on different hardware architectures. In particular, the software must be carefully crafted to achieve good scaling on massively parallel supercomputers. Combining all this in a flexible and extensible, widely usable software is a challenging task. In this article we outline the design decisions and concepts of a newly developed particle dynamics code MESA-PD that is implemented as part of the waLBerla multi-physics framework. Extensibility, flexibility, but also performance and scalability are primary design goals for the new software framework. In particular, the new modular architecture is designed such that physical models can be modified and extended by domain scientists without understanding all details of the parallel computing functionality and the underlying distributed data structures that are needed to achieve good performance on current supercomputer architectures. This goal is achieved by combining the high performance simulation framework waLBerla with code generation techniques. All code and the code generator are released as open source under GPLv3 within the publicly available waLBerla framework (www.walberla.net).
△ Less
Submitted 26 June, 2019;
originally announced June 2019.
-
Computational Study of Ultrathin CNT Films with the Scalable Mesoscopic Distinct Element Method
Authors:
Igor Ostanin,
Traian Dumitrică,
Sebastian Eibl,
Ulrich Rüde
Abstract:
In this work we present a computational study of the small strain mechanics of freestanding ultrathin CNT films under in-plane loading. The numerical modeling of the mechanics of representatively large specimens with realistic micro- and nanostructure is presented. Our simulations utilize the scalable implementation of the mesoscopic distinct element method of the waLBerla multi-physics framework.…
▽ More
In this work we present a computational study of the small strain mechanics of freestanding ultrathin CNT films under in-plane loading. The numerical modeling of the mechanics of representatively large specimens with realistic micro- and nanostructure is presented. Our simulations utilize the scalable implementation of the mesoscopic distinct element method of the waLBerla multi-physics framework. Within our modeling approach, CNTs are represented as chains of interacting rigid segments. Neighboring segments in the chain are connected with elastic bonds, resolving tension, bending, shear and torsional deformations. These bonds represent a covalent bonding within CNT surface and utilize Enhanced Vector Model (EVM) formalism. Segments of the neighboring CNTs interact with realistic coarse-grained anisotropic vdW potential, enabling relative slip of CNTs in contact. The advanced simulation technique allowed us to gain useful insights on the behavior of CNT materials. In particular, it was established that the energy dissipation during CNT sliding leads to extended load transfer that conditions material-like mechanical response of the weakly bonded assemblies of CNTs.
△ Less
Submitted 19 October, 2019; v1 submitted 13 May, 2019;
originally announced May 2019.
-
Dynamic Load Balancing Techniques for Particulate Flow Simulations
Authors:
Christoph Rettinger,
Ulrich Rüde
Abstract:
Parallel multiphysics simulations often suffer from load imbalances originating from the applied coupling of algorithms with spatially and temporally varying workloads. It is thus desirable to minimize these imbalances to reduce the time to solution and to better utilize the available hardware resources. Taking particulate flows as an illustrating example application, we present and evaluate load…
▽ More
Parallel multiphysics simulations often suffer from load imbalances originating from the applied coupling of algorithms with spatially and temporally varying workloads. It is thus desirable to minimize these imbalances to reduce the time to solution and to better utilize the available hardware resources. Taking particulate flows as an illustrating example application, we present and evaluate load balancing techniques that tackle this challenging task. This involves a load estimation step in which the currently generated workload is predicted. We describe in detail how such a workload estimator can be developed. In a second step, load distribution strategies like space-filling curves or graph partitioning are applied to dynamically distribute the load among the available processes. To compare and analyze their performance, we employ these techniques to a benchmark scenario and observe a reduction of the load imbalances by almost a factor of four. This results in a decrease of the overall runtime by 14% for space-filling curves.
△ Less
Submitted 30 November, 2018;
originally announced November 2018.
-
An iterative generalized Golub-Kahan algorithm for problems in structural mechanics
Authors:
Mario Arioli,
Carola Kruse,
Ulrich Ruede,
Nicolas Tardieu
Abstract:
This paper studies the Craig variant of the Golub-Kahan bidiagonalization algorithm as an iterative solver for linear systems with saddle point structure. Such symmetric indefinite systems in 2x2 block form arise in many applications, but standard iterative solvers are often found to perform poorly on them and robust preconditioners may not be available. Specifically, such systems arise in structu…
▽ More
This paper studies the Craig variant of the Golub-Kahan bidiagonalization algorithm as an iterative solver for linear systems with saddle point structure. Such symmetric indefinite systems in 2x2 block form arise in many applications, but standard iterative solvers are often found to perform poorly on them and robust preconditioners may not be available. Specifically, such systems arise in structural mechanics, when a semidefinite finite element stiffness matrix is augmented with linear multi-point constraints via Lagrange multipliers. Engineers often use such multi-point constraints to introduce boundary or coupling conditions into complex finite element models. The article will present a systematic convergence study of the Golub-Kahan algorithm for a sequence of test problems of increasing complexity, including concrete structures enforced with pretension cables and the coupled finite element model of a reactor containment building. When the systems are suitably transformed using augmented Lagrangians on the semidefinite block and when the constraint equations are properly scaled, the Golub-Kahan algorithm is found to exhibit excellent convergence that depends only weakly on the size of the model. The new algorithm is found to be robust in practical cases that are otherwise considered to be difficult for iterative solvers.
△ Less
Submitted 23 August, 2018;
originally announced August 2018.
-
A Systematic Comparison of Dynamic Load Balancing Algorithms for Massively Parallel Rigid Particle Dynamics
Authors:
Sebastian Eibl,
Ulrich Rüde
Abstract:
As compute power increases with time, more involved and larger simulations become possible. However, it gets increasingly difficult to efficiently use the provided computational resources. Especially in particle-based simulations with a spatial domain partitioning large load imbalances can occur due to the simulation being dynamic. Then a static domain partitioning may not be suitable. This can de…
▽ More
As compute power increases with time, more involved and larger simulations become possible. However, it gets increasingly difficult to efficiently use the provided computational resources. Especially in particle-based simulations with a spatial domain partitioning large load imbalances can occur due to the simulation being dynamic. Then a static domain partitioning may not be suitable. This can deteriorate the overall runtime of the simulation significantly. Sophisticated load balancing strategies must be designed to alleviate this problem. In this paper we conduct a systematic evaluation of the performance of six different load balancing algorithms. Our tests cover a wide range of simulation sizes, and employ one of the largest supercomputers available. In particular we study the runtime and memory complexity of all components of the simulation carefully. When progressing to extreme scale simulations it is essential to identify bottlenecks and to predict the scaling behaviour. Scaling experiments are shown for up to over one million processes. The performance of each algorithm is analyzed with respect to the quality of the load balancing and its runtime costs. For all tests, the waLBerla multiphysics framework is employed.
△ Less
Submitted 2 August, 2019; v1 submitted 2 August, 2018;
originally announced August 2018.
-
A Scalable and Modular Software Architecture for Finite Elements on Hierarchical Hybrid Grids
Authors:
Nils Kohl,
Dominik Thönnes,
Daniel Drzisga,
Dominik Bartuschat,
Ulrich Rüde
Abstract:
In this article, a new generic higher-order finite-element framework for massively parallel simulations is presented. The modular software architecture is carefully designed to exploit the resources of modern and future supercomputers. Combining an unstructured topology with structured grid refinement facilitates high geometric adaptability and matrix-free multigrid implementations with excellent…
▽ More
In this article, a new generic higher-order finite-element framework for massively parallel simulations is presented. The modular software architecture is carefully designed to exploit the resources of modern and future supercomputers. Combining an unstructured topology with structured grid refinement facilitates high geometric adaptability and matrix-free multigrid implementations with excellent performance. Different abstraction levels and fully distributed data structures additionally ensure high flexibility, extensibility, and scalability. The software concepts support sophisticated load balancing and flexibly combining finite element spaces. Example scenarios with coupled systems of PDEs show the applicability of the concepts to performing geophysical simulations.
△ Less
Submitted 25 May, 2018;
originally announced May 2018.
-
Adaptive control in rollforward recovery for extreme scale multigrid
Authors:
Markus Huber,
Ulrich Rüde,
Barbara Wohlmuth
Abstract:
With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution p…
▽ More
With the increasing number of compute components, failures in future exa-scale computer systems are expected to become more frequent. This motivates the study of novel resilience techniques. Here, we extend a recently proposed algorithm-based recovery method for multigrid iterations by introducing an adaptive control. After a fault, the healthy part of the system continues the iterative solution process, while the solution in the faulty domain is re-constructed by an asynchronous on-line recovery. The computations in both the faulty and healthy subdomains must be coordinated in a sensitive way, in particular, both under and over-solving must be avoided. Both of these waste computational resources and will therefore increase the overall time-to-solution. To control the local recovery and guarantee an optimal re-coupling, we introduce a stop** criterion based on a mathematical error estimator. It involves hierarchical weighted sums of residuals within the context of uniformly refined meshes and is well-suited in the context of parallel high-performance computing. The re-coupling process is steered by local contributions of the error estimator. We propose and compare two criteria which differ in their weights. Failure scenarios when solving up to $6.9\cdot10^{11}$ unknowns on more than 245\,766 parallel processes will be reported on a state-of-the-art peta-scale supercomputer demonstrating the robustness of the method.
△ Less
Submitted 17 April, 2018;
originally announced April 2018.
-
A local parallel communication algorithm for polydisperse rigid body dynamics
Authors:
Sebastian Eibl,
Ulrich Rüde
Abstract:
The simulation of large ensembles of particles is usually parallelized by partitioning the domain spatially and using message passing to communicate between the processes handling neighboring subdomains. The particles are represented as individual geometric objects and are associated to the subdomains. Handling collisions and migrating particles between subdomains, as required for proper parallel…
▽ More
The simulation of large ensembles of particles is usually parallelized by partitioning the domain spatially and using message passing to communicate between the processes handling neighboring subdomains. The particles are represented as individual geometric objects and are associated to the subdomains. Handling collisions and migrating particles between subdomains, as required for proper parallel execution, requires a complex communication protocol. Typically, the parallelization is restricted to handling only particles that are smaller than a subdomain. In many applications, however, particle sizes may vary drastically with some of them being larger than a subdomain. In this article we propose a new communication and synchronization algorithm that can handle the parallelization without size restrictions on the particles. Despite the additional complexity and extended functionality, the new algorithm introduces only minimal overhead. We demonstrate the scalability of the previous and the new communication algorithms up to almost two million parallel processes and for handling ten billion (1e10) geometrically resolved particles on a state-of-the-art petascale supercomputer. Different scenarios are presented to analyze the performance of the new algorithm and to demonstrate its capability to simulate polydisperse scenarios, where large individual particles can extend across several subdomains.
△ Less
Submitted 2 August, 2018; v1 submitted 8 February, 2018;
originally announced February 2018.
-
A Coupled Lattice Boltzmann Method and Discrete Element Method for Discrete Particle Simulations of Particulate Flows
Authors:
Christoph Rettinger,
Ulrich Rüde
Abstract:
Discrete particle simulations are widely used to study large-scale particulate flows in complex geometries where particle-particle and particle-fluid interactions require an adequate representation but the computational cost has to be kept low. In this work, we present a novel coupling approach for such simulations. A lattice Boltzmann formulation of the generalized Navier-Stokes equations is used…
▽ More
Discrete particle simulations are widely used to study large-scale particulate flows in complex geometries where particle-particle and particle-fluid interactions require an adequate representation but the computational cost has to be kept low. In this work, we present a novel coupling approach for such simulations. A lattice Boltzmann formulation of the generalized Navier-Stokes equations is used to describe the fluid motion. This promises efficient simulations suitable for high performance computing and, since volume displacement effects by the solid phase are considered, our approach is also applicable to non-dilute particulate systems. The discrete element method is combined with an explicit evaluation of interparticle lubrication forces to simulate the motion of individual submerged particles. Drag, pressure and added mass forces determine the momentum transfer by fluid-particle interactions. A stable coupling algorithm is presented and discussed in detail. We demonstrate the validity of our approach for dilute as well as dense systems by predicting the settling velocity of spheres over a broad range of solid volume fractions in good agreement with semi-empirical correlations. Additionally, the accuracy of particle-wall interactions in a viscous fluid is thoroughly tested and established. Our approach can thus be readily used for various particulate systems and can be extended straightforward to e.g. non-spherical particles.
△ Less
Submitted 1 November, 2017;
originally announced November 2017.
-
A stencil scaling approach for accelerating matrix-free finite element implementations
Authors:
Simon Bauer,
Daniel Drzisga,
Marcus Mohr,
Ulrich Ruede,
Christian Waluga,
Barbara Wohlmuth
Abstract:
We present a novel approach to fast on-the-fly low order finite element assembly for scalar elliptic partial differential equations of Darcy type with variable coefficients optimized for matrix-free implementations. Our approach introduces a new operator that is obtained by appropriately scaling the reference stiffness matrix from the constant coefficient case. Assuming sufficient regularity, an a…
▽ More
We present a novel approach to fast on-the-fly low order finite element assembly for scalar elliptic partial differential equations of Darcy type with variable coefficients optimized for matrix-free implementations. Our approach introduces a new operator that is obtained by appropriately scaling the reference stiffness matrix from the constant coefficient case. Assuming sufficient regularity, an a priori analysis shows that solutions obtained by this approach are unique and have asymptotically optimal order convergence in the $H^1$- and the $L^2$-norm on hierarchical hybrid grids. For the pre-asymptotic regime, we present a local modification that guarantees uniform ellipticity of the operator. Cost considerations show that our novel approach requires roughly one third of the floating-point operations compared to a classical finite element assembly scheme employing nodal integration. Our theoretical considerations are illustrated by numerical tests that confirm the expectations with respect to accuracy and run-time. A large scale application with more than a hundred billion ($1.6\cdot10^{11}$) degrees of freedom executed on 14,310 compute cores demonstrates the efficiency of the new scaling approach.
△ Less
Submitted 23 July, 2018; v1 submitted 20 September, 2017;
originally announced September 2017.
-
A Scalable Multiphysics Algorithm for Massively Parallel Direct Numerical Simulations of Electrophoresis
Authors:
Dominik Bartuschat,
Ulrich Rüde
Abstract:
In this article we introduce a novel coupled algorithm for massively parallel direct numerical simulations of electrophoresis in microfluidic flows. This multiphysics algorithm employs an Eulerian description of fluid and ions, combined with a Lagrangian representation of moving charged particles. The fixed grid facilitates efficient solvers and the employed lattice Boltzmann method can efficientl…
▽ More
In this article we introduce a novel coupled algorithm for massively parallel direct numerical simulations of electrophoresis in microfluidic flows. This multiphysics algorithm employs an Eulerian description of fluid and ions, combined with a Lagrangian representation of moving charged particles. The fixed grid facilitates efficient solvers and the employed lattice Boltzmann method can efficiently handle complex geometries. Validation experiments with more than $70\,000$ time steps are presented, together with scaling experiments with over ${4\cdot10^{6}}$ particles and ${1.96\cdot10^{11}}$ grid cells for both hydrodynamics and electric potential. We achieve excellent performance and scaling on up to $65\,536$ cores of a current supercomputer.
△ Less
Submitted 25 May, 2018; v1 submitted 29 August, 2017;
originally announced August 2017.
-
A Scalable and Extensible Checkpointing Scheme for Massively Parallel Simulations
Authors:
Nils Kohl,
Johannes Hötzer,
Florian Schornbaum,
Martin Bauer,
Christian Godenschwager,
Harald Köstler,
Britta Nestler,
Ulrich Rüde
Abstract:
Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient agains…
▽ More
Realistic simulations in engineering or in the materials sciences can consume enormous computing resources and thus require the use of massively parallel supercomputers. The probability of a failure increases both with the runtime and with the number of system components. For future exascale systems it is therefore considered critical that strategies are developed to make software resilient against failures. In this article, we present a scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain. We demonstrate the efficiency and scalability of the checkpoint strategy for simulations with up to $40$ billion computational cells executing on more than $400$ billion floating point values. A checkpoint creation is shown to require only a few seconds and the new checkpointing scheme scales almost perfectly up to more than $260\,000$ ($2^{18}$) processes. To recover from a diskless checkpoint during runtime, we realize the recovery algorithms using ULFM MPI. The checkpointing mechanism is fully integrated in a state-of-the-art high-performance multi-physics simulation framework. We demonstrate the efficiency and robustness of the method with a realistic phase-field simulation originating in the material sciences and with a lattice Boltzmann method implementation.
△ Less
Submitted 29 January, 2018; v1 submitted 28 August, 2017;
originally announced August 2017.
-
The Maximum Dissipation Principle in Rigid-Body Dynamics with Purely Inelastic Impacts
Authors:
Tobias Preclik,
Sebastian Eibl,
Ulrich Rüde
Abstract:
Formulating a consistent theory for rigid-body dynamics with impacts is an intricate problem. Twenty years ago Stewart published the first consistent theory with purely inelastic impacts and an impulsive friction model analogous to Coulomb friction. In this paper we demonstrate that the consistent impact model can exhibit multiple solutions with a varying degree of dissipation even in the single-c…
▽ More
Formulating a consistent theory for rigid-body dynamics with impacts is an intricate problem. Twenty years ago Stewart published the first consistent theory with purely inelastic impacts and an impulsive friction model analogous to Coulomb friction. In this paper we demonstrate that the consistent impact model can exhibit multiple solutions with a varying degree of dissipation even in the single-contact case. Replacing the impulsive friction model based on Coulomb friction by a model based on the maximum dissipation principle resolves the non-uniqueness in the single-contact impact problem. The paper constructs the alternative impact model and presents integral equations describing rigid-body dynamics with a non-impulsive and non-compliant contact model and an associated purely inelastic impact model maximizing dissipation. An analytic solution is derived for the single-contact impact problem. The models are then embedded into a time-step** scheme. The macroscopic behaviour is compared to Coulomb friction in a large-scale granular flow problem.
△ Less
Submitted 1 June, 2017;
originally announced June 2017.
-
Extreme-Scale Block-Structured Adaptive Mesh Refinement
Authors:
Florian Schornbaum,
Ulrich Rüde
Abstract:
In this article, we present a novel approach for block-structured adaptive mesh refinement (AMR) that is suitable for extreme-scale parallelism. All data structures are designed such that the size of the meta data in each distributed processor memory remains bounded independent of the processor number. In all stages of the AMR process, we use only distributed algorithms. No central resources such…
▽ More
In this article, we present a novel approach for block-structured adaptive mesh refinement (AMR) that is suitable for extreme-scale parallelism. All data structures are designed such that the size of the meta data in each distributed processor memory remains bounded independent of the processor number. In all stages of the AMR process, we use only distributed algorithms. No central resources such as a master process or replicated data are employed, so that an unlimited scalability can be achieved. For the dynamic load balancing in particular, we propose to exploit the hierarchical nature of the block-structured domain partitioning by creating a lightweight, temporary copy of the core data structure. This copy acts as a local and fully distributed proxy data structure. It does not contain simulation data, but only provides topological information about the domain partitioning into blocks. Ultimately, this approach enables an inexpensive, local, diffusion-based dynamic load balancing scheme.
We demonstrate the excellent performance and the full scalability of our new AMR implementation for two architecturally different petascale supercomputers. Benchmarks on an IBM Blue Gene/Q system with a mesh containing 3.7 trillion unknowns distributed to 458,752 processes confirm the applicability for future extreme-scale parallel machines. The algorithms proposed in this article operate on blocks that result from the domain partitioning. This concept and its realization support the storage of arbitrary data. In consequence, the software framework can be used for different simulation methods, including mesh based and meshless methods. In this article, we demonstrate fluid simulations based on the lattice Boltzmann method.
△ Less
Submitted 13 April, 2018; v1 submitted 22 April, 2017;
originally announced April 2017.
-
A comparative study of fluid-particle coupling methods for fully resolved lattice Boltzmann simulations
Authors:
Christoph Rettinger,
Ulrich Rüde
Abstract:
The direct numerical simulation of particulate systems offers a unique approach to study the dynamics of fluid-solid suspensions by fully resolving the submerged particles and without introducing empirical models. For the lattice Boltzmann method, different variants exist to incorporate the fluid-particle interaction into the simulation. This paper provides a detailed and systematic comparison of…
▽ More
The direct numerical simulation of particulate systems offers a unique approach to study the dynamics of fluid-solid suspensions by fully resolving the submerged particles and without introducing empirical models. For the lattice Boltzmann method, different variants exist to incorporate the fluid-particle interaction into the simulation. This paper provides a detailed and systematic comparison of two different methods, namely the momentum exchange method and the partially saturated cells method by Noble and Torczynski. Three subvariants of each method are used in the benchmark scenario of a single heavy sphere settling in ambient fluid to study their characteristics and accuracy for particle Reynolds numbers from 185 up to 365. The sphere must be resolved with at least 24 computational cells per diameter to achieve velocity errors below 5%. The momentum exchange method is found to be more accurate in predicting the streamwise velocity component whereas the partially saturated cells method is more accurate in the spanwise components. The study reveals that the resolution should be chosen with respect to the coupling dynamics, and not only based on the flow properties, to avoid large errors in the fluid-particle interaction.
△ Less
Submitted 16 February, 2017;
originally announced February 2017.
-
Research and Education in Computational Science and Engineering
Authors:
Ulrich Rüde,
Karen Willcox,
Lois Curfman McInnes,
Hans De Sterck,
George Biros,
Hans Bungartz,
James Corones,
Evin Cramer,
James Crowley,
Omar Ghattas,
Max Gunzburger,
Michael Hanke,
Robert Harrison,
Michael Heroux,
Jan Hesthaven,
Peter Jimack,
Chris Johnson,
Kirk E. Jordan,
David E. Keyes,
Rolf Krause,
Vipin Kumar,
Stefan Mayer,
Juan Meza,
Knut Martin Mørken,
J. Tinsley Oden
, et al. (8 additional authors not shown)
Abstract:
Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that…
▽ More
Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade.
△ Less
Submitted 31 December, 2017; v1 submitted 8 October, 2016;
originally announced October 2016.
-
Scheduling massively parallel multigrid for multilevel Monte Carlo methods
Authors:
Björn Gmeiner,
Daniel Drzisga,
Ulrich Ruede,
Robert Scheichl,
Barbara Wohlmuth
Abstract:
The computational complexity of naive, sampling-based uncertainty quantification for 3D partial differential equations is extremely high. Multilevel approaches, such as multilevel Monte Carlo (MLMC), can reduce the complexity significantly, but to exploit them fully in a parallel environment, sophisticated scheduling strategies are needed. Often fast algorithms that are executed in parallel are es…
▽ More
The computational complexity of naive, sampling-based uncertainty quantification for 3D partial differential equations is extremely high. Multilevel approaches, such as multilevel Monte Carlo (MLMC), can reduce the complexity significantly, but to exploit them fully in a parallel environment, sophisticated scheduling strategies are needed. Often fast algorithms that are executed in parallel are essential to compute fine level samples in 3D, whereas to compute individual coarse level samples only moderate numbers of processors can be employed efficiently. We make use of multiple instances of a parallel multigrid solver combined with advanced load balancing techniques. In particular, we optimize the concurrent execution across the three layers of the MLMC method: parallelization across levels, across samples, and across the spatial grid. The overall efficiency and performance of these methods will be analyzed. Here the scalability window of the multigrid solver is revealed as being essential, i.e., the property that the solution can be computed with a range of process numbers while maintaining good parallel efficiency. We evaluate the new scheduling strategies in a series of numerical tests, and conclude the paper demonstrating large 3D scaling experiments.
△ Less
Submitted 12 July, 2016;
originally announced July 2016.
-
Microswimming with inertia
Authors:
Jayant Pande,
Kristina Pickl,
Oleg Trosman,
Ulrich Rüde,
Ana-Sunčana Smith
Abstract:
Microswimmers, especially in theoretical treatments, are generally taken to be completely inertia-free, since inertial effects on their motion are typically small and assuming their absence simplifies the problem considerably. Yet in nature there is no discrete break between swimmers for which inertia is negligibly small and for which it is detectable. Here we study a microswimming model for which…
▽ More
Microswimmers, especially in theoretical treatments, are generally taken to be completely inertia-free, since inertial effects on their motion are typically small and assuming their absence simplifies the problem considerably. Yet in nature there is no discrete break between swimmers for which inertia is negligibly small and for which it is detectable. Here we study a microswimming model for which the effect of inertia is calculated explicitly in the regime of transition between the Stokesian and the non-Stokesian flow limits, which we term the intermediate regime. The model in the inertialess limit is the bead-spring swimmer. We first show that in the intermediate regime a mechanical microswimmer exhibits damped inertial coasting like an underdamped harmonic oscillator. We then calculate analytically the swimmer's velocity by including a mass-acceleration term in the equations of motion which are otherwise based on the Stokes flow. We show that this hybrid treatment combining aspects of underdamped and overdamped dynamics provides an accurate description of the motion in the intermediate regime, as verified here by comparison to simulations using the lattice Boltzmann method, and is a significant improvement over the results from the inertialess theory when either the mass of the swimmer or the forces driving its motion is/are large enough.
△ Less
Submitted 6 November, 2016; v1 submitted 15 March, 2016;
originally announced March 2016.
-
A Python Extension for the Massively Parallel Multiphysics Simulation Framework waLBerla
Authors:
Martin Bauer,
Florian Schornbaum,
Christian Godenschwager,
Matthias Markl,
Daniela Anderl,
Harald Köstler,
Ulrich Rüde
Abstract:
We present a Python extension to the massively parallel HPC simulation toolkit waLBerla. waLBerla is a framework for stencil based algorithms operating on block-structured grids, with the main application field being fluid simulations in complex geometries using the lattice Boltzmann method. Careful performance engineering results in excellent node performance and good scalability to over 400,000…
▽ More
We present a Python extension to the massively parallel HPC simulation toolkit waLBerla. waLBerla is a framework for stencil based algorithms operating on block-structured grids, with the main application field being fluid simulations in complex geometries using the lattice Boltzmann method. Careful performance engineering results in excellent node performance and good scalability to over 400,000 cores. To increase the usability and flexibility of the framework, a Python interface was developed. Python extensions are used at all stages of the simulation pipeline: They simplify and automate scenario setup, evaluation, and plotting. We show how our Python interface outperforms the existing text-file-based configuration mechanism, providing features like automatic nondimensionalization of physical quantities and handling of complex parameter dependencies. Furthermore, Python is used to process and evaluate results while the simulation is running, leading to smaller output files and the possibility to adjust parameters dependent on the current simulation state. C++ data structures are exported such that a seamless interfacing to other numerical Python libraries is possible. The expressive power of Python and the performance of C++ make development of efficient code with low time effort possible.
△ Less
Submitted 23 November, 2015;
originally announced November 2015.
-
A quantitative performance analysis for Stokes solvers at the extreme scale
Authors:
Björn Gmeiner,
Markus Huber,
Lorenz John,
Ulrich Rüde,
Barbara Wohlmuth
Abstract:
This article presents a systematic quantitative performance analysis for large finite element computations on extreme scale computing systems. Three parallel iterative solvers for the Stokes system, discretized by low order tetrahedral elements, are compared with respect to their numerical efficiency and their scalability running on up to $786\,432$ parallel threads. A genuine multigrid method for…
▽ More
This article presents a systematic quantitative performance analysis for large finite element computations on extreme scale computing systems. Three parallel iterative solvers for the Stokes system, discretized by low order tetrahedral elements, are compared with respect to their numerical efficiency and their scalability running on up to $786\,432$ parallel threads. A genuine multigrid method for the saddle point system using an Uzawa-type smoother provides the best overall performance with respect to memory consumption and time-to-solution. The largest system solved on a Blue Gene/Q system has more than ten trillion ($1.1 \cdot 10 ^{13}$) unknowns and requires about 13 minutes compute time. Despite the matrix free and highly optimized implementation, the memory requirement for the solution vector and the auxiliary vectors is about 200 TByte. Brandt's notion of "textbook multigrid efficiency" is employed to study the algorithmic performance of iterative solvers. A recent extension of this paradigm to "parallel textbook multigrid efficiency" makes it possible to assess also the efficiency of parallel iterative solvers for a given hardware architecture in absolute terms. The efficiency of the method is demonstrated for simulating incompressible fluid flow in a pipe filled with spherical obstacles.
△ Less
Submitted 6 November, 2015;
originally announced November 2015.
-
Massively Parallel Algorithms for the Lattice Boltzmann Method on Non-uniform Grids
Authors:
Florian Schornbaum,
Ulrich Rüde
Abstract:
The lattice Boltzmann method exhibits excellent scalability on current supercomputing systems and has thus increasingly become an alternative method for large-scale non-stationary flow simulations, reaching up to a trillion grid nodes. Additionally, grid refinement can lead to substantial savings in memory and compute time. These saving, however, come at the cost of much more complex data structur…
▽ More
The lattice Boltzmann method exhibits excellent scalability on current supercomputing systems and has thus increasingly become an alternative method for large-scale non-stationary flow simulations, reaching up to a trillion grid nodes. Additionally, grid refinement can lead to substantial savings in memory and compute time. These saving, however, come at the cost of much more complex data structures and algorithms. In particular, the interface between subdomains with different grid sizes must receive special treatment. In this article, we present parallel algorithms, distributed data structures, and communication routines that are implemented in the software framework waLBerla in order to support large-scale, massively parallel lattice Boltzmann-based simulations on non-uniform grids. Additionally, we evaluate the performance of our approach on two current petascale supercomputers. On an IBM Blue Gene/Q system, the largest weak scaling benchmarks with refined grids are executed with almost two million threads, demonstrating not only near-perfect scalability but also an absolute performance of close to a trillion lattice Boltzmann cell updates per second. On an Intel-based system, the strong scaling of a simulation with refined grids and a total of more than 8.5 million cells is demonstrated to reach a performance of less than one millisecond per time step. This enables simulations with complex, non-uniform grids and four million time steps per hour compute time.
△ Less
Submitted 21 January, 2016; v1 submitted 31 August, 2015;
originally announced August 2015.
-
Pore-scale lattice Boltzmann simulation of laminar and turbulent flow through a sphere pack
Authors:
Ehsan Fattahia,
Christian Waluga,
Barbara Wohlmuth,
Ulrich Rüde,
Michael Manhart,
Rainer Helmig
Abstract:
The lattice Boltzmann method can be used to simulate flow through porous media with full geometrical resolution. With such a direct numerical simulation, it becomes possible to study fundamental effects which are difficult to assess either by develo** macroscopic mathematical models or experiments. We first evaluate the lattice Boltzmann method with various boundary handling of the solid-wall an…
▽ More
The lattice Boltzmann method can be used to simulate flow through porous media with full geometrical resolution. With such a direct numerical simulation, it becomes possible to study fundamental effects which are difficult to assess either by develo** macroscopic mathematical models or experiments. We first evaluate the lattice Boltzmann method with various boundary handling of the solid-wall and various collision operators to assess their suitability for large scale direct numerical simulation of porous media flow. A periodic pressure drop boundary condition is used to mimic the pressure driven flow through the simple sphere pack in a periodic domain. The evaluation of the method is done in the Darcy regime and the results are compared to a semi-analytic solution. Taking into account computational cost and accuracy, we choose the most efficient combination of the solid boundary condition and collision operator. We apply this method to perform simulations for a wide range of Reynolds numbers from Stokes flow over seven orders of magnitude to turbulent flow. Contours and streamlines of the flow field are presented to show the flow behavior in different flow regimes. Moreover, unknown parameters of the Forchheimer, the Barree--Conway and friction factor models are evaluated numerically for the considered flow regimes.
△ Less
Submitted 12 August, 2015;
originally announced August 2015.
-
Large scale lattice Boltzmann simulation for the coupling of free and porous media flow
Authors:
Ehsan Fattahi,
Christian Waluga,
Barbara Wohlmuth,
Ulrich Rüde
Abstract:
In this work, we investigate the interaction of free and porous media flow by large scale lattice Boltzmann simulations. We study the transport phenomena at the porous interface on multiple scales, i.e., we consider both, computationally generated pore-scale geometries and homogenized models at a macroscopic scale. The pore-scale results are compared to those obtained by using different transmissi…
▽ More
In this work, we investigate the interaction of free and porous media flow by large scale lattice Boltzmann simulations. We study the transport phenomena at the porous interface on multiple scales, i.e., we consider both, computationally generated pore-scale geometries and homogenized models at a macroscopic scale. The pore-scale results are compared to those obtained by using different transmission models. Two-domain approaches with sharp interface conditions, e.g., of Beavers--Joseph--Saffman type, as well as a single-domain approach with a porosity depending viscosity are taken into account. For the pore-scale simulations, we use a highly scalable communication-reducing scheme with a robust second order boundary handling. We comment on computational aspects of the pore-scale simulation and on how to generate pore-scale geometries. The two-domain approaches depend sensitively on the choice of the exact position of the interface, whereas a well-designed single-domain approach can significantly better recover the averaged pore-scale results.
△ Less
Submitted 23 July, 2015;
originally announced July 2015.
-
Resilience for Multigrid Software at the Extreme Scale
Authors:
Markus Huber,
Björn Gmeiner,
Ulrich Rüde,
Barbara Wohlmuth
Abstract:
Fault tolerant algorithms for the numerical approximation of elliptic partial differential equations on modern supercomputers play a more and more important role in the future design of exa-scale enabled iterative solvers. Here, we combine domain partitioning with highly scalable geometric multigrid schemes to obtain fast and fault-robust solvers in three dimensions. The recovery strategy is based…
▽ More
Fault tolerant algorithms for the numerical approximation of elliptic partial differential equations on modern supercomputers play a more and more important role in the future design of exa-scale enabled iterative solvers. Here, we combine domain partitioning with highly scalable geometric multigrid schemes to obtain fast and fault-robust solvers in three dimensions. The recovery strategy is based on a hierarchical hybrid concept where the values on lower dimensional primitives such as faces are stored redundantly and thus can be recovered easily in case of a failure. The lost volume unknowns in the faulty region are re-computed approximately with multigrid cycles by solving a local Dirichlet problem on the faulty subdomain. Different strategies are compared and evaluated with respect to performance, computational cost, and speed up. Especially effective are strategies in which the local recovery in the faulty region is executed in parallel with global solves and when the local recovery is additionally accelerated. This results in an asynchronous multigrid iteration that can fully compensate faults. Excellent parallel performance on a current peta-scale system is demonstrated.
△ Less
Submitted 19 June, 2015;
originally announced June 2015.
-
Massively Parallel Phase-Field Simulations for Ternary Eutectic Directional Solidification
Authors:
Martin Bauer,
Johannes Hötzer,
Philipp Steinmetz,
Marcus Jainta,
Marco Berghoff,
Florian Schornbaum,
Christian Godenschwager,
Harald Köstler,
Britta Nestler,
Ulrich Rüde
Abstract:
Microstructures forming during ternary eutectic directional solidification processes have significant influence on the macroscopic mechanical properties of metal alloys. For a realistic simulation, we use the well established thermodynamically consistent phase-field method and improve it with a new grand potential formulation to couple the concentration evolution. This extension is very compute in…
▽ More
Microstructures forming during ternary eutectic directional solidification processes have significant influence on the macroscopic mechanical properties of metal alloys. For a realistic simulation, we use the well established thermodynamically consistent phase-field method and improve it with a new grand potential formulation to couple the concentration evolution. This extension is very compute intensive due to a temperature dependent diffusive concentration. We significantly extend previous simulations that have used simpler phase-field models or were performed on smaller domain sizes. The new method has been implemented within the massively parallel HPC framework waLBerla that is designed to exploit current supercomputers efficiently. We apply various optimization techniques, including buffering techniques, explicit SIMD kernel vectorization, and communication hiding. Simulations utilizing up to 262,144 cores have been run on three different supercomputing architectures and weak scalability results are shown. Additionally, a hierarchical, mesh-based data reduction strategy is developed to keep the I/O problem manageable at scale.
△ Less
Submitted 4 June, 2015;
originally announced June 2015.
-
Two Computational Models for Simulating the Tumbling Motion of Elongated Particles in Fluids
Authors:
Dominik Bartuschat,
Ellen Fischermeier,
Katarina Gustavsson,
Ulrich Rüde
Abstract:
Suspensions with fiber-like particles in the low Reynolds number regime are modeled by two different approaches that both use a Lagrangian representation of individual particles. The first method is the well-established formulation based on Stokes flow that is formulated as integral equations. It uses a slender body approximation for the fibers to represent the interaction between them directly wi…
▽ More
Suspensions with fiber-like particles in the low Reynolds number regime are modeled by two different approaches that both use a Lagrangian representation of individual particles. The first method is the well-established formulation based on Stokes flow that is formulated as integral equations. It uses a slender body approximation for the fibers to represent the interaction between them directly without explicitly computing the flow field. The second is a new technique using the 3D lattice Boltzmann method on parallel supercomputers. Here the flow computation is coupled to a computational model of the dynamics of rigid bodies using fluid-structure interaction techniques. Both methods can be applied to simulate fibers in fluid flow. They are carefully validated and compared against each other, exposing systematically their strengths and weaknesses regarding their accuracy, the computational cost, and possible model extensions.
△ Less
Submitted 23 March, 2015;
originally announced March 2015.
-
Resilience for Exascale Enabled Multigrid Methods
Authors:
Markus Huber,
Björn Gmeiner,
Ulrich Rüde,
Barbara Wohlmuth
Abstract:
With the increasing number of components and further miniaturization the mean time between faults in supercomputers will decrease. System level fault tolerance techniques are expensive and cost energy, since they are often based on redundancy. Also classical check-point-restart techniques reach their limits when the time for storing the system state to backup memory becomes excessive. Therefore, a…
▽ More
With the increasing number of components and further miniaturization the mean time between faults in supercomputers will decrease. System level fault tolerance techniques are expensive and cost energy, since they are often based on redundancy. Also classical check-point-restart techniques reach their limits when the time for storing the system state to backup memory becomes excessive. Therefore, algorithm-based fault tolerance mechanisms can become an attractive alternative. This article investigates the solution process for elliptic partial differential equations that are discretized by finite elements. Faults that occur in the parallel geometric multigrid solver are studied in various model scenarios. In a standard domain partitioning approach, the impact of a failure of a core or a node will affect one or several subdomains. Different strategies are developed to compensate the effect of such a failure algorithmically. The recovery is achieved by solving a local subproblem with Dirichlet boundary conditions using local multigrid cycling algorithms. Additionally, we propose a superman strategy where extra compute power is employed to minimize the time of the recovery process.
△ Less
Submitted 29 January, 2015;
originally announced January 2015.
-
Ultrascale Simulations of Non-smooth Granular Dynamics
Authors:
Tobias Preclik,
Ulrich Rüde
Abstract:
This article presents new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach. Collisions are modelled with hard contacts in order to hide their micro-dynamics and thus to extend the time and length scales that can be simulated. The multi-contact problem is solved using a non-linear block Gauss-Seidel method that…
▽ More
This article presents new algorithms for massively parallel granular dynamics simulations on distributed memory architectures using a domain partitioning approach. Collisions are modelled with hard contacts in order to hide their micro-dynamics and thus to extend the time and length scales that can be simulated. The multi-contact problem is solved using a non-linear block Gauss-Seidel method that is conforming to the subdomain structure. The parallel algorithms employ a sophisticated protocol between processors that delegate algorithmic tasks such as contact treatment and position integration uniquely and robustly to the processors. Communication overhead is minimized through aggressive message aggregation, leading to excellent strong and weak scaling. The robustness and scalability is assessed on three clusters including two peta-scale supercomputers with up to 458752 processor cores. The simulations can reach unprecedented resolution of up to ten billion non-spherical particles and contacts.
△ Less
Submitted 23 January, 2015;
originally announced January 2015.
-
Parallel Multiphysics Simulations of Charged Particles in Microfluidic Flows
Authors:
Dominik Bartuschat,
Ulrich Rüde
Abstract:
The article describes parallel multiphysics simulations of charged particles in microfluidic flows with the waLBerla framework. To this end, three physical effects are coupled: rigid body dynamics, fluid flow modelled by a lattice Boltzmann algorithm, and electric potentials represented by a finite volume discretisation. For solving the finite volume discretisation for the electrostatic forces, a…
▽ More
The article describes parallel multiphysics simulations of charged particles in microfluidic flows with the waLBerla framework. To this end, three physical effects are coupled: rigid body dynamics, fluid flow modelled by a lattice Boltzmann algorithm, and electric potentials represented by a finite volume discretisation. For solving the finite volume discretisation for the electrostatic forces, a cell-centered multigrid algorithm is developed that conforms to the lattice Boltzmann meshes and the parallel communication structure of waLBerla. The new functionality is validated with suitable benchmark scenarios. Additionally, the parallel scaling and the numerical efficiency of the algorithms are analysed on an advanced supercomputer.
△ Less
Submitted 24 October, 2014;
originally announced October 2014.
-
A Scala Prototype to Generate Multigrid Solver Implementations for Different Problems and Target Multi-Core Platforms
Authors:
Harald Koestler,
Christian Schmitt,
Sebastian Kuckuk,
Frank Hannig,
Juergen Teich,
Ulrich Ruede
Abstract:
Many problems in computational science and engineering involve partial differential equations and thus require the numerical solution of large, sparse (non)linear systems of equations. Multigrid is known to be one of the most efficient methods for this purpose. However, the concrete multigrid algorithm and its implementation highly depend on the underlying problem and hardware. Therefore, changes…
▽ More
Many problems in computational science and engineering involve partial differential equations and thus require the numerical solution of large, sparse (non)linear systems of equations. Multigrid is known to be one of the most efficient methods for this purpose. However, the concrete multigrid algorithm and its implementation highly depend on the underlying problem and hardware. Therefore, changes in the code or many different variants are necessary to cover all relevant cases. In this article we provide a prototype implementation in Scala for a framework that allows abstract descriptions of PDEs, their discretization, and their numerical solution via multigrid algorithms. From these, one is able to generate data structures and implementations of multigrid components required to solve elliptic PDEs on structured grids. Two different test problems showcase our proposed automatic generation of multigrid solvers for both CPU and GPU target platforms.
△ Less
Submitted 20 June, 2014;
originally announced June 2014.
-
Numerical Investigations on Hatching Process Strategies for Powder Bed Based Additive Manufacturing using an Electron Beam
Authors:
Matthias Markl,
Regina Ammer,
Ulrich Rüde,
Carolin Körner
Abstract:
This paper investigates in hatching process strategies for additive manufacturing using an electron beam by numerical simulations. The underlying physical model and the corresponding three dimensional thermal free surface lattice Boltzmann method of the simulation software are briefly presented. The simulation software has already been validated on the basis of experiments up to 1.2 kW beam power…
▽ More
This paper investigates in hatching process strategies for additive manufacturing using an electron beam by numerical simulations. The underlying physical model and the corresponding three dimensional thermal free surface lattice Boltzmann method of the simulation software are briefly presented. The simulation software has already been validated on the basis of experiments up to 1.2 kW beam power by hatching a cuboid with a basic process strategy, whereby the results are classified into `porous', `good' and `uneven', depending on their relative density and top surface smoothness. In this paper we study the limitations of this basic process strategy in terms of higher beam powers and scan velocities to exploit the future potential of high power electron beam guns up to 10 kW. Subsequently, we introduce modified process strategies, which circumvent these restrictions, to build the part as fast as possible under the restriction of a fully dense part with a smooth top surface. These process strategies are suitable to reduce the build time and costs, maximize the beam power usage and therefore use the potential of high power electron beam guns.
△ Less
Submitted 30 March, 2015; v1 submitted 13 March, 2014;
originally announced March 2014.
-
Validation Experiments for LBM Simulations of Electron Beam Melting
Authors:
Regina Ammer,
Matthias Markl,
Vera Jüchter,
Carolin Körner,
Ulrich Rüde
Abstract:
This paper validates 3D simulation results of electron beam melting (EBM) processes comparing experimental and numerical data. The physical setup is presented which is discretized by a three dimensional (3D) thermal lattice Boltzmann method (LBM). An experimental process window is used for the validation depending on the line energy injected into the metal powder bed and the scan velocity of the e…
▽ More
This paper validates 3D simulation results of electron beam melting (EBM) processes comparing experimental and numerical data. The physical setup is presented which is discretized by a three dimensional (3D) thermal lattice Boltzmann method (LBM). An experimental process window is used for the validation depending on the line energy injected into the metal powder bed and the scan velocity of the electron beam. In the process window the EBM products are classified into the categories, porous, good and swelling, depending on the quality of the surface. The same parameter sets are used to generate a numerical process window. A comparison of numerical and experimental process windows shows a good agreement. This validates the EBM model and justifies simulations for future improvements of EBM processes. In particular numerical simulations can be used to explain future process window scenarios and find the best parameter set for a good surface quality and dense products.
△ Less
Submitted 11 February, 2014;
originally announced February 2014.
-
Model-guided Performance Analysis of the Sparse Matrix-Matrix Multiplication
Authors:
Tobias Scharpff,
Klaus Iglberger,
Georg Hager,
Ulrich Ruede
Abstract:
Achieving high efficiency with numerical kernels for sparse matrices is of utmost importance, since they are part of many simulation codes and tend to use most of the available compute time and resources. In addition, especially in large scale simulation frameworks the readability and ease of use of mathematical expressions are essential components for the continuous maintenance, modification, and…
▽ More
Achieving high efficiency with numerical kernels for sparse matrices is of utmost importance, since they are part of many simulation codes and tend to use most of the available compute time and resources. In addition, especially in large scale simulation frameworks the readability and ease of use of mathematical expressions are essential components for the continuous maintenance, modification, and extension of software. In this context, the sparse matrix-matrix multiplication is of special interest. In this paper we thoroughly analyze the single-core performance of sparse matrix-matrix multiplication kernels in the Blaze Smart Expression Template (SET) framework. We develop simple models for estimating the achievable maximum performance, and use them to assess the efficiency of our implementations. Additionally, we compare these kernels with several commonly used SET-based C++ libraries, which, just as Blaze, aim at combining the requirements of high performance with an elegant user interface. For the different sparse matrix structures considered here, we show that our implementations are competitive or faster than those of the other SET libraries for most problem sizes on a current Intel multicore processor.
△ Less
Submitted 6 May, 2013; v1 submitted 7 March, 2013;
originally announced March 2013.
-
Liquid-gas-solid flows with lattice Boltzmann: Simulation of floating bodies
Authors:
Simon Bogner,
Ulrich Rüde
Abstract:
This paper presents a model for the simulation of liquid-gas-solid flows by means of the lattice Boltzmann method. The approach is built upon previous works for the simulation of liquid-solid particle suspensions on the one hand, and on a liquid-gas free surface model on the other. We show how the two approaches can be unified by a novel set of dynamic cell conversion rules. For evaluation, we con…
▽ More
This paper presents a model for the simulation of liquid-gas-solid flows by means of the lattice Boltzmann method. The approach is built upon previous works for the simulation of liquid-solid particle suspensions on the one hand, and on a liquid-gas free surface model on the other. We show how the two approaches can be unified by a novel set of dynamic cell conversion rules. For evaluation, we concentrate on the rotational stability of non-spherical rigid bodies floating on a plane water surface - a classical hydrostatic problem known from naval architecture. We show the consistency of our method in this kind of flows and obtain convergence towards the ideal solution for the measured heeling stability of a floating box.
△ Less
Submitted 1 January, 2012;
originally announced January 2012.
-
All good things come in threes - Three beads learn to swim with lattice Boltzmann and a rigid body solver
Authors:
Kristina Pickl,
Jan Götz,
Klaus Iglberger,
Jayant Pande,
Klaus Mecke,
Ana-Suncana Smith,
Ulrich Rüde
Abstract:
We simulate the self-propulsion of devices in a fluid in the regime of low Reynolds numbers. Each device consists of three bodies (spheres or capsules) connected with two damped harmonic springs. Sinusoidal driving forces compress the springs which are resolved within a rigid body physics engine. The latter is consistently coupled to a 3D lattice Boltzmann framework for the fluid dynamics. In simu…
▽ More
We simulate the self-propulsion of devices in a fluid in the regime of low Reynolds numbers. Each device consists of three bodies (spheres or capsules) connected with two damped harmonic springs. Sinusoidal driving forces compress the springs which are resolved within a rigid body physics engine. The latter is consistently coupled to a 3D lattice Boltzmann framework for the fluid dynamics. In simulations of three-sphere devices, we find that the propulsion velocity agrees well with theoretical predictions. In simulations where some or all spheres are replaced by capsules, we find that the asymmetry of the design strongly affects the propelling efficiency.
△ Less
Submitted 3 August, 2011;
originally announced August 2011.
-
Expression Templates Revisited: A Performance Analysis of the Current ET Methodology
Authors:
Klaus Iglberger,
Georg Hager,
Jan Treibig,
Ulrich Ruede
Abstract:
In the last decade, Expression Templates (ET) have gained a reputation as an efficient performance optimization tool for C++ codes. This reputation builds on several ET-based linear algebra frameworks focused on combining both elegant and high-performance C++ code. However, on closer examination the assumption that ETs are a performance optimization technique cannot be maintained. In this paper we…
▽ More
In the last decade, Expression Templates (ET) have gained a reputation as an efficient performance optimization tool for C++ codes. This reputation builds on several ET-based linear algebra frameworks focused on combining both elegant and high-performance C++ code. However, on closer examination the assumption that ETs are a performance optimization technique cannot be maintained. In this paper we demonstrate and explain the inability of current ET-based frameworks to deliver high performance for dense and sparse linear algebra operations, and introduce a new "smart" ET implementation that truly allows the combination of high performance code with the elegance and maintainability of a domain-specific language.
△ Less
Submitted 9 April, 2011;
originally announced April 2011.