-
Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc
Authors:
Jacob Faibussowitsch,
Mark F. Adams,
Richard Tran Mills,
Stefano Zampini,
Junchao Zhang
Abstract:
Leveraging Graphics Processing Units (GPUs) to accelerate scientific software has proven to be highly successful, but in order to extract more performance, GPU programmers must overcome the high latency costs associated with their use. One method of reducing or hiding this latency cost is to use asynchronous streams to issue commands to the GPU. While performant, the streams model is an invasive a…
▽ More
Leveraging Graphics Processing Units (GPUs) to accelerate scientific software has proven to be highly successful, but in order to extract more performance, GPU programmers must overcome the high latency costs associated with their use. One method of reducing or hiding this latency cost is to use asynchronous streams to issue commands to the GPU. While performant, the streams model is an invasive abstraction, and has therefore proven difficult to integrate into general-purpose libraries. In this work, we enumerate the difficulties specific to library authors in adopting streams, and present recent work on addressing them. Finally, we present a unified asynchronous programming model for use in the Portable, Extensible, Toolkit for Scientific Computation (PETSc) to overcome these challenges. The new model shows broad performance benefits while remaining ergonomic to the user.
△ Less
Submitted 30 June, 2023;
originally announced June 2023.
-
A Numerical Study of Landau Dam** with PETSc-PIC
Authors:
Daniel S. Finn,
Matthew G. Knepley,
Joseph V. Pusztay,
Mark F. Adams
Abstract:
We present a study of the standard plasma physics test, Landau dam**, using the Particle-In-Cell (PIC) algorithm. The Landau dam** phenomenon consists of the dam** of small oscillations in plasmas without collisions. In the PIC method, a hybrid discretization is constructed with a grid of finitely supported basis functions to represent the electric, magnetic and/or gravitational fields, and…
▽ More
We present a study of the standard plasma physics test, Landau dam**, using the Particle-In-Cell (PIC) algorithm. The Landau dam** phenomenon consists of the dam** of small oscillations in plasmas without collisions. In the PIC method, a hybrid discretization is constructed with a grid of finitely supported basis functions to represent the electric, magnetic and/or gravitational fields, and a distribution of delta functions to represent the particle field. Approximations to the dispersion relation are found to be inadequate in accurately calculating values for the electric field frequency and dam** rate when parameters of the physical system, such as the plasma frequency or thermal velocity, are varied. We present a full derivation and numerical solution for the dispersion relation, and verify the PETSC-PIC numerical solutions to the Vlasov-Poisson for a large range of wave numbers and charge densities.
△ Less
Submitted 22 March, 2023;
originally announced March 2023.
-
A bespoke multigrid approach for magnetohydrodynamics models of magnetized plasmas in PETSc
Authors:
Mark F. Adams,
Matthew K. Knepley
Abstract:
Fully realizing the potential of multigrid solvers often requires custom algorithms for a given application model, discretizations and even regimes of interest, despite considerable effort from the applied math community to develop fully algebraic multigrid (AMG) methods for almost 40 years. Classic geometric multigrid (GMG) has been effectively applied to challenging, non-elliptic problems in eng…
▽ More
Fully realizing the potential of multigrid solvers often requires custom algorithms for a given application model, discretizations and even regimes of interest, despite considerable effort from the applied math community to develop fully algebraic multigrid (AMG) methods for almost 40 years. Classic geometric multigrid (GMG) has been effectively applied to challenging, non-elliptic problems in engineering and scientifically relevant codes, but application specific algorithms are generally required that do not lend themselves to deployment in numerical libraries. However, tools in libraries that support discretizations, distributed mesh management and high performance computing (HPC) can be used to develop such solvers.
This report develops a magnetohydrodynamics (MHD) code in PETSc (Portable Extensible Toolkit for Scientific computing) with a fully integrated GMG solver that is designed to demonstrate the potential of our approach to providing fast and robust solvers for production applications. These applications must, however, be able to provide, in addition to the Jacobian matrix and residual of a pure AMG solver, a hierarchy of meshes and knowledge of the application's equations and discretization. An example of a 2D, two field reduced resistive MHD model, using existing tools in PETSc that is verified with a ``tilt" instability problem that is well documented in the literature is presented and is an example in the PETSc repository (\path{src/ts/tutorials/ex48.c}). Preliminary CPU-only performance data demonstrates that the solver can be robust and scalable for the model problem that is pushed into a regime with highly localized current sheets, which generates strong, localized non-linearity, that is a challenge for iterative solvers.
△ Less
Submitted 21 February, 2023;
originally announced February 2023.
-
A performance portable, fully implicit Landau collision operator with batched linear solvers
Authors:
Mark F. Adams,
Peng Wang,
Jacob Merson,
Kevin Huck,
Matthew G. Knepley
Abstract:
Modern accelerators use hierarchical parallel programming models that enable massive multithreading within a processing element (PE), with multiple PEs per device driven by traditional processes. Batching is a technique for exposing PE-level parallelism in algorithms that have traditionally run on MPI processes or multiple threads within a single process. Opportunities for batching arise in, for e…
▽ More
Modern accelerators use hierarchical parallel programming models that enable massive multithreading within a processing element (PE), with multiple PEs per device driven by traditional processes. Batching is a technique for exposing PE-level parallelism in algorithms that have traditionally run on MPI processes or multiple threads within a single process. Opportunities for batching arise in, for example, kinetic discretizations of magnetized plasmas where collisions are advanced in velocity space at each spatial point independently.
This paper builds on previous work on a high-performance, fully nonlinear, Landau collision operator by batching the linear solver, as well as batching the spatial point problems and adding new support for multiple grids for multiscale, multi-species problems. An anisotropic relaxation verification test that agrees well with previous published results and analytical models is presented. The performance results from NVIDIA A100 and AMD MI250X nodes are presented with hardware utilization analysis for each architecture. The entire implicit Landau operator time advance is implemented in Kokkos for performance portability, running entirely on the device and is available in the PETSc numerical library.
△ Less
Submitted 8 July, 2024; v1 submitted 7 September, 2022;
originally announced September 2022.
-
Conservative Projection Between Finite Element and Particle Bases
Authors:
Joseph V. Pusztay,
Matthew G. Knepley,
Mark F. Adams
Abstract:
Particle-in-Cell (PIC) methods employ particle representations of unknown fields, but also employ continuum fields for other parts of the problem. Thus projection between particle and continuum bases is required. Moreover, we often need to enforce conservation constraints on this projection. We derive a mechanism for enforcement based on weak equality, and implement it in the PETSc libraries. Scal…
▽ More
Particle-in-Cell (PIC) methods employ particle representations of unknown fields, but also employ continuum fields for other parts of the problem. Thus projection between particle and continuum bases is required. Moreover, we often need to enforce conservation constraints on this projection. We derive a mechanism for enforcement based on weak equality, and implement it in the PETSc libraries. Scalability is demonstrated to more than 1B particles.
△ Less
Submitted 12 May, 2022;
originally announced May 2022.
-
Verification of a Fully Implicit Particle-in-Cell Method for the $v_\parallel$ Formalism of Electromagnetic Gyrokinetics in the XGC Code
Authors:
Benjamin J. Sturdevant,
S. Ku,
L. Chacón,
Y. Chen,
D. Hatch,
M. D. J. Cole,
A. Y. Sharma,
M. F. Adams,
C. S. Chang,
S. E. Parker,
R. Hager
Abstract:
A fully implicit particle-in-cell method for handling the $v_\parallel$-formalism of electromagnetic gyrokinetics has been implemented in XGC. By choosing the $v_\parallel$-formalism, we avoid introducing the non-physical skin terms in Ampère's law, which are responsible for the well-known ``cancellation problem" in the $p_\parallel$-formalism. The $v_\parallel$-formalism, however, is known to suf…
▽ More
A fully implicit particle-in-cell method for handling the $v_\parallel$-formalism of electromagnetic gyrokinetics has been implemented in XGC. By choosing the $v_\parallel$-formalism, we avoid introducing the non-physical skin terms in Ampère's law, which are responsible for the well-known ``cancellation problem" in the $p_\parallel$-formalism. The $v_\parallel$-formalism, however, is known to suffer from a numerical instability when explicit time integration schemes are used due to the appearance of a time derivative in the particle equations of motion from the inductive component of the electric field. Here, using the conventional $δf$ scheme, we demonstrate that our implicitly discretized algorithm can provide numerically stable simulation results with accurate dispersive properties. We verify the algorithm using a test case for shear Alfvén wave propagation in addition to a case demonstrating the ITG-KBM transition. The ITG-KBM transition case is compared to results obtained from other $δf$ gyrokinetic codes/schemes, whose verification has already been archived in the literature.
△ Less
Submitted 11 June, 2021;
originally announced June 2021.
-
Exascale Landau collision operator in the Cuda programming model applied to thermal quench plasmas
Authors:
M. F. Adams,
D. P. Brennan,
M. G. Knepley,
P. Wang
Abstract:
Collisional processes are critical in the understanding of non-Maxwellian plasmas. The Landau form of the Fokker-Planck equation is the gold standard for modeling collisions in most plasmas, however O(N^2) work complexity inhibits its widespread use. We show that with advanced numerical methods and GPU hardware this cost can be effectively mitigated. This paper extends previous work on a conservat…
▽ More
Collisional processes are critical in the understanding of non-Maxwellian plasmas. The Landau form of the Fokker-Planck equation is the gold standard for modeling collisions in most plasmas, however O(N^2) work complexity inhibits its widespread use. We show that with advanced numerical methods and GPU hardware this cost can be effectively mitigated. This paper extends previous work on a conservative, high order accurate, finite element discretization with adaptive mesh refinement of the Landau operator, with extensions to GPU hardware and implementations in both the CUDA and Kokkos programming languages. This work focuses on the Landau kernels and on NVIDIA hardware, however preliminary results on AMD and Fujitsu/ARM hardware, as well as end-to-end performance of a velocity space model of a plasma thermal quench, are also presented. Both the fully implicit Landau time integrator and the plasma thermal quench model are publicly available in PETSc (Portable, Extensible, Toolkit for Scientific computing).
△ Less
Submitted 18 May, 2022; v1 submitted 7 April, 2021;
originally announced April 2021.
-
Implementation of higher-order velocity map** between marker particles and grid in the particle-in-cell code XGC
Authors:
Albert Mollén,
M. F. Adams,
M. G. Knepley,
R. Hager,
C. S. Chang
Abstract:
The global total-$f$ gyrokinetic particle-in-cell code XGC, used to study transport in magnetic fusion plasmas, implements a continuum grid to perform the dissipative operations, such as plasma collisions. To transfer the distribution function between marker particles and a rectangular velocity-space grid, XGC employs a bilinear map**. The conservation of particle density and momentum is accurat…
▽ More
The global total-$f$ gyrokinetic particle-in-cell code XGC, used to study transport in magnetic fusion plasmas, implements a continuum grid to perform the dissipative operations, such as plasma collisions. To transfer the distribution function between marker particles and a rectangular velocity-space grid, XGC employs a bilinear map**. The conservation of particle density and momentum is accurate enough in this bilinear operation, but the error in the particle energy conservation can become undesirably large in special conditions. In the present work we update XGC to use a novel map** technique, based on the calculation of a pseudo-inverse, to exactly preserve moments up to the order of the discretization space. We describe the details of the implementation and we demonstrate the reduced interpolation error for a neoclassical tokamak test case by using $1^{\mathrm{st}}$- and $2^{\mathrm{nd}}$-order elements with the pseudo-inverse method and comparing to the bilinear map**.
△ Less
Submitted 21 December, 2020;
originally announced December 2020.
-
Toward Performance-Portable PETSc for GPU-based Exascale Systems
Authors:
Richard Tran Mills,
Mark F. Adams,
Satish Balay,
Jed Brown,
Alp Dener,
Matthew Knepley,
Scott E. Kruger,
Hannah Morgan,
Todd Munson,
Karl Rupp,
Barry F. Smith,
Stefano Zampini,
Hong Zhang,
Junchao Zhang
Abstract:
The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization.The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from…
▽ More
The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization.The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from that used by the library, and it enables application developers to use their preferred programming model, such as Kokkos, RAJA, SYCL, HIP, CUDA, or OpenCL, on upcoming exascale systems. A blueprint for using GPUs from PETSc-based codes is provided, and case studies emphasize the flexibility and high performance achieved on current GPU-based systems.
△ Less
Submitted 29 September, 2021; v1 submitted 1 November, 2020;
originally announced November 2020.
-
Landau Collision Integral Solver with Adaptive Mesh Refinement on Emerging Architectures
Authors:
M. F. Adams,
E. Hirvijoki,
M. G. Knepley,
J. Brown,
T. Isaac,
R. Mills
Abstract:
The Landau collision integral is an accurate model for the small-angle dominated Coulomb collisions in fusion plasmas. We investigate a high order accurate, fully conservative, finite element discretization of the nonlinear multi-species Landau integral with adaptive mesh refinement using the PETSc library (www.mcs.anl.gov/petsc). We develop algorithms and techniques to efficiently utilize emergin…
▽ More
The Landau collision integral is an accurate model for the small-angle dominated Coulomb collisions in fusion plasmas. We investigate a high order accurate, fully conservative, finite element discretization of the nonlinear multi-species Landau integral with adaptive mesh refinement using the PETSc library (www.mcs.anl.gov/petsc). We develop algorithms and techniques to efficiently utilize emerging architectures with an approach that minimizes memory usage and movement and is suitable for vector processing. The Landau collision integral is vectorized with Intel AVX-512 intrinsics and the solver sustains as much as 22% of the theoretical peak flop rate of the Second Generation Intel Xeon Phi, Knights Landing, processor.
△ Less
Submitted 28 February, 2017; v1 submitted 27 February, 2017;
originally announced February 2017.
-
Verification of long wavelength electromagnetic modes with a gyrokinetic-fluid hybrid model in the XGC code
Authors:
Robert Hager,
Jianying Lang,
Choong-Seock Chang,
Seung-Hoe Ku,
Yang Chen,
Scott E. Parker,
Mark F. Adams
Abstract:
As an alternative option to kinetic electrons, the gyrokinetic total-f particle-in-cell (PIC) code XGC1 has been extended to the MHD/fluid type electromagnetic regime by combining gyrokinetic PIC ions with massless drift-fluid electrons analogous to Chen and Parker, Physics of Plasmas 8, 441 (2001). Two representative long wavelength modes, shear Alfvén waves and resistive tearing modes, are verif…
▽ More
As an alternative option to kinetic electrons, the gyrokinetic total-f particle-in-cell (PIC) code XGC1 has been extended to the MHD/fluid type electromagnetic regime by combining gyrokinetic PIC ions with massless drift-fluid electrons analogous to Chen and Parker, Physics of Plasmas 8, 441 (2001). Two representative long wavelength modes, shear Alfvén waves and resistive tearing modes, are verified in cylindrical and toroidal magnetic field geometries.
△ Less
Submitted 16 February, 2017;
originally announced February 2017.
-
Scalable smoothing strategies for a geometric multigrid method for the immersed boundary equations
Authors:
Amneet Pal Singh Bhalla,
Matthew G. Knepley,
Mark F. Adams,
Robert D. Guy,
Boyce E. Griffith
Abstract:
The immersed boundary (IB) method is a widely used approach to simulating fluid-structure interaction (FSI). Although explicit versions of the IB method can suffer from severe time step size restrictions, these methods remain popular because of their simplicity and generality. In prior work (Guy et al., Adv Comput Math, 2015), some of us developed a geometric multigrid preconditioner for a stable…
▽ More
The immersed boundary (IB) method is a widely used approach to simulating fluid-structure interaction (FSI). Although explicit versions of the IB method can suffer from severe time step size restrictions, these methods remain popular because of their simplicity and generality. In prior work (Guy et al., Adv Comput Math, 2015), some of us developed a geometric multigrid preconditioner for a stable semi-implicit IB method under Stokes flow conditions; however, this solver methodology used a Vanka-type smoother that presented limited opportunities for parallelization. This work extends this Stokes-IB solver methodology by develo** smoothing techniques that are suitable for parallel implementation. Specifically, we demonstrate that an additive version of the Vanka smoother can yield an effective multigrid preconditioner for the Stokes-IB equations, and we introduce an efficient Schur complement-based smoother that is also shown to be effective for the Stokes-IB equations. We investigate the performance of these solvers for a broad range of material stiffnesses, both for Stokes flows and flows at nonzero Reynolds numbers, and for thick and thin structural models. We show here that linear solver performance degrades with increasing Reynolds number and material stiffness, especially for thin interface cases. Nonetheless, the proposed approaches promise to yield effective solution algorithms, especially at lower Reynolds numbers and at modest-to-high elastic stiffnesses.
△ Less
Submitted 7 December, 2016;
originally announced December 2016.
-
Segmental Refinement: A Multigrid Technique for Data Locality
Authors:
Mark F. Adams,
Jed Brown,
Matt Knepley,
Ravi Samtaney
Abstract:
We investigate a domain decomposed multigrid technique, segmental refinement, for solving general nonlinear elliptic boundary value problems. Brandt and Diskin first proposed this method in 1994; we continue this work by analytically and experimentally investigating its complexity. We confirm that communication of traditional parallel multigrid can be eliminated on fine grids with modest amounts o…
▽ More
We investigate a domain decomposed multigrid technique, segmental refinement, for solving general nonlinear elliptic boundary value problems. Brandt and Diskin first proposed this method in 1994; we continue this work by analytically and experimentally investigating its complexity. We confirm that communication of traditional parallel multigrid can be eliminated on fine grids with modest amounts of extra work and storage while maintaining the asymptotic exactness of full multigrid, although we observe a dependence on an additional parameter not considered in the original analysis. We present a communication complexity analysis that quantifies the communication costs ameliorated by segmental refinement and report performance results with up to 64K cores of a Cray XC30.
△ Less
Submitted 11 August, 2015; v1 submitted 30 June, 2014;
originally announced June 2014.
-
A low memory, highly concurrent multigrid algorithm
Authors:
Mark F. Adams
Abstract:
We examine what is an efficient and scalable nonlinear solver, with low work and memory complexity, for many classes of discretized partial differential equations (PDEs) - matrix-free Full multigrid (FMG) with a Full Approximation Storage (FAS) - in the context of current trends in computer architectures. Brandt proposed an extremely low memory FMG-FAS algorithm over 25 years ago that has several…
▽ More
We examine what is an efficient and scalable nonlinear solver, with low work and memory complexity, for many classes of discretized partial differential equations (PDEs) - matrix-free Full multigrid (FMG) with a Full Approximation Storage (FAS) - in the context of current trends in computer architectures. Brandt proposed an extremely low memory FMG-FAS algorithm over 25 years ago that has several attractive properties for reducing costs on modern - memory centric -- machines and has not been developed to our knowledge. This method, segmental refinement (SR), has very low memory requirements because the finest grids need not be held in memory at any one time but can be "swept" through, computing coarse grid correction and any quantities of interest, allowing for orders of magnitude reduction in memory usage. This algorithm has two useful ideas for effectively exploiting future architectures: improved data locality and reuse via "vertical" processing of the multigrid algorithms and the method of $τ$-corrections, which allows for not storing the entire fine grids at any one time. This report develops this algorithm for a model problem and a parallel generalization of the original swee** technique. We show that FMG-FAS-SR can work as originally predicted, solving systems accurately enough to maintain the convergence rate of the discretization with one FMG iteration, and that the parallel algorithm provides a natural approach to fully exploiting the available parallelism of FMG.
△ Less
Submitted 8 November, 2012; v1 submitted 28 July, 2012;
originally announced July 2012.