Search | arXiv e-print repository

Safe, Seamless, And Scalable Integration Of Asynchronous GPU Streams In PETSc

Authors: Jacob Faibussowitsch, Mark F. Adams, Richard Tran Mills, Stefano Zampini, Junchao Zhang

Abstract: Leveraging Graphics Processing Units (GPUs) to accelerate scientific software has proven to be highly successful, but in order to extract more performance, GPU programmers must overcome the high latency costs associated with their use. One method of reducing or hiding this latency cost is to use asynchronous streams to issue commands to the GPU. While performant, the streams model is an invasive a… ▽ More Leveraging Graphics Processing Units (GPUs) to accelerate scientific software has proven to be highly successful, but in order to extract more performance, GPU programmers must overcome the high latency costs associated with their use. One method of reducing or hiding this latency cost is to use asynchronous streams to issue commands to the GPU. While performant, the streams model is an invasive abstraction, and has therefore proven difficult to integrate into general-purpose libraries. In this work, we enumerate the difficulties specific to library authors in adopting streams, and present recent work on addressing them. Finally, we present a unified asynchronous programming model for use in the Portable, Extensible, Toolkit for Scientific Computation (PETSc) to overcome these challenges. The new model shows broad performance benefits while remaining ergonomic to the user. △ Less

Submitted 30 June, 2023; originally announced June 2023.

arXiv:2303.12620 [pdf, other]

A Numerical Study of Landau Dam** with PETSc-PIC

Authors: Daniel S. Finn, Matthew G. Knepley, Joseph V. Pusztay, Mark F. Adams

Abstract: We present a study of the standard plasma physics test, Landau dam**, using the Particle-In-Cell (PIC) algorithm. The Landau dam** phenomenon consists of the dam** of small oscillations in plasmas without collisions. In the PIC method, a hybrid discretization is constructed with a grid of finitely supported basis functions to represent the electric, magnetic and/or gravitational fields, and… ▽ More We present a study of the standard plasma physics test, Landau dam**, using the Particle-In-Cell (PIC) algorithm. The Landau dam** phenomenon consists of the dam** of small oscillations in plasmas without collisions. In the PIC method, a hybrid discretization is constructed with a grid of finitely supported basis functions to represent the electric, magnetic and/or gravitational fields, and a distribution of delta functions to represent the particle field. Approximations to the dispersion relation are found to be inadequate in accurately calculating values for the electric field frequency and dam** rate when parameters of the physical system, such as the plasma frequency or thermal velocity, are varied. We present a full derivation and numerical solution for the dispersion relation, and verify the PETSC-PIC numerical solutions to the Vlasov-Poisson for a large range of wave numbers and charge densities. △ Less

Submitted 22 March, 2023; originally announced March 2023.

Comments: 14 pages, 7 figures

arXiv:2302.10242 [pdf, other]

A bespoke multigrid approach for magnetohydrodynamics models of magnetized plasmas in PETSc

Authors: Mark F. Adams, Matthew K. Knepley

Abstract: Fully realizing the potential of multigrid solvers often requires custom algorithms for a given application model, discretizations and even regimes of interest, despite considerable effort from the applied math community to develop fully algebraic multigrid (AMG) methods for almost 40 years. Classic geometric multigrid (GMG) has been effectively applied to challenging, non-elliptic problems in eng… ▽ More Fully realizing the potential of multigrid solvers often requires custom algorithms for a given application model, discretizations and even regimes of interest, despite considerable effort from the applied math community to develop fully algebraic multigrid (AMG) methods for almost 40 years. Classic geometric multigrid (GMG) has been effectively applied to challenging, non-elliptic problems in engineering and scientifically relevant codes, but application specific algorithms are generally required that do not lend themselves to deployment in numerical libraries. However, tools in libraries that support discretizations, distributed mesh management and high performance computing (HPC) can be used to develop such solvers. This report develops a magnetohydrodynamics (MHD) code in PETSc (Portable Extensible Toolkit for Scientific computing) with a fully integrated GMG solver that is designed to demonstrate the potential of our approach to providing fast and robust solvers for production applications. These applications must, however, be able to provide, in addition to the Jacobian matrix and residual of a pure AMG solver, a hierarchy of meshes and knowledge of the application's equations and discretization. An example of a 2D, two field reduced resistive MHD model, using existing tools in PETSc that is verified with a ``tilt" instability problem that is well documented in the literature is presented and is an example in the PETSc repository (\path{src/ts/tutorials/ex48.c}). Preliminary CPU-only performance data demonstrates that the solver can be robust and scalable for the model problem that is pushed into a regime with highly localized current sheets, which generates strong, localized non-linearity, that is a challenge for iterative solvers. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2209.03228 [pdf, other]

A performance portable, fully implicit Landau collision operator with batched linear solvers

Authors: Mark F. Adams, Peng Wang, Jacob Merson, Kevin Huck, Matthew G. Knepley

Abstract: Modern accelerators use hierarchical parallel programming models that enable massive multithreading within a processing element (PE), with multiple PEs per device driven by traditional processes. Batching is a technique for exposing PE-level parallelism in algorithms that have traditionally run on MPI processes or multiple threads within a single process. Opportunities for batching arise in, for e… ▽ More Modern accelerators use hierarchical parallel programming models that enable massive multithreading within a processing element (PE), with multiple PEs per device driven by traditional processes. Batching is a technique for exposing PE-level parallelism in algorithms that have traditionally run on MPI processes or multiple threads within a single process. Opportunities for batching arise in, for example, kinetic discretizations of magnetized plasmas where collisions are advanced in velocity space at each spatial point independently. This paper builds on previous work on a high-performance, fully nonlinear, Landau collision operator by batching the linear solver, as well as batching the spatial point problems and adding new support for multiple grids for multiscale, multi-species problems. An anisotropic relaxation verification test that agrees well with previous published results and analytical models is presented. The performance results from NVIDIA A100 and AMD MI250X nodes are presented with hardware utilization analysis for each architecture. The entire implicit Landau operator time advance is implemented in Kokkos for performance portability, running entirely on the device and is available in the PETSc numerical library. △ Less

Submitted 8 July, 2024; v1 submitted 7 September, 2022; originally announced September 2022.

arXiv:2205.06402 [pdf, other]

doi 10.1137/21M1454079

Conservative Projection Between Finite Element and Particle Bases

Authors: Joseph V. Pusztay, Matthew G. Knepley, Mark F. Adams

Abstract: Particle-in-Cell (PIC) methods employ particle representations of unknown fields, but also employ continuum fields for other parts of the problem. Thus projection between particle and continuum bases is required. Moreover, we often need to enforce conservation constraints on this projection. We derive a mechanism for enforcement based on weak equality, and implement it in the PETSc libraries. Scal… ▽ More Particle-in-Cell (PIC) methods employ particle representations of unknown fields, but also employ continuum fields for other parts of the problem. Thus projection between particle and continuum bases is required. Moreover, we often need to enforce conservation constraints on this projection. We derive a mechanism for enforcement based on weak equality, and implement it in the PETSc libraries. Scalability is demonstrated to more than 1B particles. △ Less

Submitted 12 May, 2022; originally announced May 2022.

arXiv:2106.06681 [pdf, other]

doi 10.1063/5.0047842

Verification of a Fully Implicit Particle-in-Cell Method for the $v_\parallel$ Formalism of Electromagnetic Gyrokinetics in the XGC Code

Authors: Benjamin J. Sturdevant, S. Ku, L. Chacón, Y. Chen, D. Hatch, M. D. J. Cole, A. Y. Sharma, M. F. Adams, C. S. Chang, S. E. Parker, R. Hager

Abstract: A fully implicit particle-in-cell method for handling the $v_\parallel$-formalism of electromagnetic gyrokinetics has been implemented in XGC. By choosing the $v_\parallel$-formalism, we avoid introducing the non-physical skin terms in Ampère's law, which are responsible for the well-known ``cancellation problem" in the $p_\parallel$-formalism. The $v_\parallel$-formalism, however, is known to suf… ▽ More A fully implicit particle-in-cell method for handling the $v_\parallel$-formalism of electromagnetic gyrokinetics has been implemented in XGC. By choosing the $v_\parallel$-formalism, we avoid introducing the non-physical skin terms in Ampère's law, which are responsible for the well-known ``cancellation problem" in the $p_\parallel$-formalism. The $v_\parallel$-formalism, however, is known to suffer from a numerical instability when explicit time integration schemes are used due to the appearance of a time derivative in the particle equations of motion from the inductive component of the electric field. Here, using the conventional $δf$ scheme, we demonstrate that our implicitly discretized algorithm can provide numerically stable simulation results with accurate dispersive properties. We verify the algorithm using a test case for shear Alfvén wave propagation in addition to a case demonstrating the ITG-KBM transition. The ITG-KBM transition case is compared to results obtained from other $δf$ gyrokinetic codes/schemes, whose verification has already been archived in the literature. △ Less

Submitted 11 June, 2021; originally announced June 2021.

arXiv:2104.10000 [pdf, other]

doi 10.1109/ipdps53621.2022.00020

Exascale Landau collision operator in the Cuda programming model applied to thermal quench plasmas

Authors: M. F. Adams, D. P. Brennan, M. G. Knepley, P. Wang

Abstract: Collisional processes are critical in the understanding of non-Maxwellian plasmas. The Landau form of the Fokker-Planck equation is the gold standard for modeling collisions in most plasmas, however O(N^2) work complexity inhibits its widespread use. We show that with advanced numerical methods and GPU hardware this cost can be effectively mitigated. This paper extends previous work on a conservat… ▽ More Collisional processes are critical in the understanding of non-Maxwellian plasmas. The Landau form of the Fokker-Planck equation is the gold standard for modeling collisions in most plasmas, however O(N^2) work complexity inhibits its widespread use. We show that with advanced numerical methods and GPU hardware this cost can be effectively mitigated. This paper extends previous work on a conservative, high order accurate, finite element discretization with adaptive mesh refinement of the Landau operator, with extensions to GPU hardware and implementations in both the CUDA and Kokkos programming languages. This work focuses on the Landau kernels and on NVIDIA hardware, however preliminary results on AMD and Fujitsu/ARM hardware, as well as end-to-end performance of a velocity space model of a plasma thermal quench, are also presented. Both the fully implicit Landau time integrator and the plasma thermal quench model are publicly available in PETSc (Portable, Extensible, Toolkit for Scientific computing). △ Less

Submitted 18 May, 2022; v1 submitted 7 April, 2021; originally announced April 2021.

Journal ref: IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2022

arXiv:2012.11764 [pdf, other]

doi 10.1017/S0022377821000441

Implementation of higher-order velocity map** between marker particles and grid in the particle-in-cell code XGC

Authors: Albert Mollén, M. F. Adams, M. G. Knepley, R. Hager, C. S. Chang

Abstract: The global total-$f$ gyrokinetic particle-in-cell code XGC, used to study transport in magnetic fusion plasmas, implements a continuum grid to perform the dissipative operations, such as plasma collisions. To transfer the distribution function between marker particles and a rectangular velocity-space grid, XGC employs a bilinear map**. The conservation of particle density and momentum is accurat… ▽ More The global total-$f$ gyrokinetic particle-in-cell code XGC, used to study transport in magnetic fusion plasmas, implements a continuum grid to perform the dissipative operations, such as plasma collisions. To transfer the distribution function between marker particles and a rectangular velocity-space grid, XGC employs a bilinear map**. The conservation of particle density and momentum is accurate enough in this bilinear operation, but the error in the particle energy conservation can become undesirably large in special conditions. In the present work we update XGC to use a novel map** technique, based on the calculation of a pseudo-inverse, to exactly preserve moments up to the order of the discretization space. We describe the details of the implementation and we demonstrate the reduced interpolation error for a neoclassical tokamak test case by using $1^{\mathrm{st}}$- and $2^{\mathrm{nd}}$-order elements with the pseudo-inverse method and comparing to the bilinear map**. △ Less

Submitted 21 December, 2020; originally announced December 2020.

Comments: 21 pages, 7 figures

arXiv:2011.00715 [pdf, other]

Toward Performance-Portable PETSc for GPU-based Exascale Systems

Authors: Richard Tran Mills, Mark F. Adams, Satish Balay, Jed Brown, Alp Dener, Matthew Knepley, Scott E. Kruger, Hannah Morgan, Todd Munson, Karl Rupp, Barry F. Smith, Stefano Zampini, Hong Zhang, Junchao Zhang

Abstract: The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization.The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from… ▽ More The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization.The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from that used by the library, and it enables application developers to use their preferred programming model, such as Kokkos, RAJA, SYCL, HIP, CUDA, or OpenCL, on upcoming exascale systems. A blueprint for using GPUs from PETSc-based codes is provided, and case studies emphasize the flexibility and high performance achieved on current GPU-based systems. △ Less

Submitted 29 September, 2021; v1 submitted 1 November, 2020; originally announced November 2020.

Comments: 15 pages, 10 figures, 2 tables

Report number: ANL/MCS-P9401-1020 MSC Class: 65F10; 65F50; 68N99; 68W10 ACM Class: G.4

arXiv:1702.08880 [pdf, other]

doi 10.1137/17M1118828

Landau Collision Integral Solver with Adaptive Mesh Refinement on Emerging Architectures

Authors: M. F. Adams, E. Hirvijoki, M. G. Knepley, J. Brown, T. Isaac, R. Mills

Abstract: The Landau collision integral is an accurate model for the small-angle dominated Coulomb collisions in fusion plasmas. We investigate a high order accurate, fully conservative, finite element discretization of the nonlinear multi-species Landau integral with adaptive mesh refinement using the PETSc library (www.mcs.anl.gov/petsc). We develop algorithms and techniques to efficiently utilize emergin… ▽ More The Landau collision integral is an accurate model for the small-angle dominated Coulomb collisions in fusion plasmas. We investigate a high order accurate, fully conservative, finite element discretization of the nonlinear multi-species Landau integral with adaptive mesh refinement using the PETSc library (www.mcs.anl.gov/petsc). We develop algorithms and techniques to efficiently utilize emerging architectures with an approach that minimizes memory usage and movement and is suitable for vector processing. The Landau collision integral is vectorized with Intel AVX-512 intrinsics and the solver sustains as much as 22% of the theoretical peak flop rate of the Second Generation Intel Xeon Phi, Knights Landing, processor. △ Less

Submitted 28 February, 2017; v1 submitted 27 February, 2017; originally announced February 2017.

Journal ref: SIAM Journal on Scientific Computing, 39 (6), 2017

arXiv:1702.05182 [pdf, other]

doi 10.1063/1.4983320

Verification of long wavelength electromagnetic modes with a gyrokinetic-fluid hybrid model in the XGC code

Authors: Robert Hager, Jianying Lang, Choong-Seock Chang, Seung-Hoe Ku, Yang Chen, Scott E. Parker, Mark F. Adams

Abstract: As an alternative option to kinetic electrons, the gyrokinetic total-f particle-in-cell (PIC) code XGC1 has been extended to the MHD/fluid type electromagnetic regime by combining gyrokinetic PIC ions with massless drift-fluid electrons analogous to Chen and Parker, Physics of Plasmas 8, 441 (2001). Two representative long wavelength modes, shear Alfvén waves and resistive tearing modes, are verif… ▽ More As an alternative option to kinetic electrons, the gyrokinetic total-f particle-in-cell (PIC) code XGC1 has been extended to the MHD/fluid type electromagnetic regime by combining gyrokinetic PIC ions with massless drift-fluid electrons analogous to Chen and Parker, Physics of Plasmas 8, 441 (2001). Two representative long wavelength modes, shear Alfvén waves and resistive tearing modes, are verified in cylindrical and toroidal magnetic field geometries. △ Less

Submitted 16 February, 2017; originally announced February 2017.

arXiv:1612.02208 [pdf, ps, other]

Scalable smoothing strategies for a geometric multigrid method for the immersed boundary equations

Authors: Amneet Pal Singh Bhalla, Matthew G. Knepley, Mark F. Adams, Robert D. Guy, Boyce E. Griffith

Abstract: The immersed boundary (IB) method is a widely used approach to simulating fluid-structure interaction (FSI). Although explicit versions of the IB method can suffer from severe time step size restrictions, these methods remain popular because of their simplicity and generality. In prior work (Guy et al., Adv Comput Math, 2015), some of us developed a geometric multigrid preconditioner for a stable… ▽ More The immersed boundary (IB) method is a widely used approach to simulating fluid-structure interaction (FSI). Although explicit versions of the IB method can suffer from severe time step size restrictions, these methods remain popular because of their simplicity and generality. In prior work (Guy et al., Adv Comput Math, 2015), some of us developed a geometric multigrid preconditioner for a stable semi-implicit IB method under Stokes flow conditions; however, this solver methodology used a Vanka-type smoother that presented limited opportunities for parallelization. This work extends this Stokes-IB solver methodology by develo** smoothing techniques that are suitable for parallel implementation. Specifically, we demonstrate that an additive version of the Vanka smoother can yield an effective multigrid preconditioner for the Stokes-IB equations, and we introduce an efficient Schur complement-based smoother that is also shown to be effective for the Stokes-IB equations. We investigate the performance of these solvers for a broad range of material stiffnesses, both for Stokes flows and flows at nonzero Reynolds numbers, and for thick and thin structural models. We show here that linear solver performance degrades with increasing Reynolds number and material stiffness, especially for thin interface cases. Nonetheless, the proposed approaches promise to yield effective solution algorithms, especially at lower Reynolds numbers and at modest-to-high elastic stiffnesses. △ Less

Submitted 7 December, 2016; originally announced December 2016.

arXiv:1406.7808 [pdf, other]

doi 10.1137/140975127

Segmental Refinement: A Multigrid Technique for Data Locality

Authors: Mark F. Adams, Jed Brown, Matt Knepley, Ravi Samtaney

Abstract: We investigate a domain decomposed multigrid technique, segmental refinement, for solving general nonlinear elliptic boundary value problems. Brandt and Diskin first proposed this method in 1994; we continue this work by analytically and experimentally investigating its complexity. We confirm that communication of traditional parallel multigrid can be eliminated on fine grids with modest amounts o… ▽ More We investigate a domain decomposed multigrid technique, segmental refinement, for solving general nonlinear elliptic boundary value problems. Brandt and Diskin first proposed this method in 1994; we continue this work by analytically and experimentally investigating its complexity. We confirm that communication of traditional parallel multigrid can be eliminated on fine grids with modest amounts of extra work and storage while maintaining the asymptotic exactness of full multigrid, although we observe a dependence on an additional parameter not considered in the original analysis. We present a communication complexity analysis that quantifies the communication costs ameliorated by segmental refinement and report performance results with up to 64K cores of a Cray XC30. △ Less

Submitted 11 August, 2015; v1 submitted 30 June, 2014; originally announced June 2014.

Journal ref: J. Sci. Comput. (2016) 38(4) C426-C440

arXiv:1207.6720 [pdf, ps, other]

doi 10.1137/140975127

A low memory, highly concurrent multigrid algorithm

Authors: Mark F. Adams

Abstract: We examine what is an efficient and scalable nonlinear solver, with low work and memory complexity, for many classes of discretized partial differential equations (PDEs) - matrix-free Full multigrid (FMG) with a Full Approximation Storage (FAS) - in the context of current trends in computer architectures. Brandt proposed an extremely low memory FMG-FAS algorithm over 25 years ago that has several… ▽ More We examine what is an efficient and scalable nonlinear solver, with low work and memory complexity, for many classes of discretized partial differential equations (PDEs) - matrix-free Full multigrid (FMG) with a Full Approximation Storage (FAS) - in the context of current trends in computer architectures. Brandt proposed an extremely low memory FMG-FAS algorithm over 25 years ago that has several attractive properties for reducing costs on modern - memory centric -- machines and has not been developed to our knowledge. This method, segmental refinement (SR), has very low memory requirements because the finest grids need not be held in memory at any one time but can be "swept" through, computing coarse grid correction and any quantities of interest, allowing for orders of magnitude reduction in memory usage. This algorithm has two useful ideas for effectively exploiting future architectures: improved data locality and reuse via "vertical" processing of the multigrid algorithms and the method of $τ$-corrections, which allows for not storing the entire fine grids at any one time. This report develops this algorithm for a model problem and a parallel generalization of the original swee** technique. We show that FMG-FAS-SR can work as originally predicted, solving systems accurately enough to maintain the convergence rate of the discretization with one FMG iteration, and that the parallel algorithm provides a natural approach to fully exploiting the available parallelism of FMG. △ Less

Submitted 8 November, 2012; v1 submitted 28 July, 2012; originally announced July 2012.

Journal ref: SIAM Journal on Scientific Computing, 38(4), 2016

Showing 1–14 of 14 results for author: Adams, M F