\corrauth

Richard Tran Mills, Mathematics and Computer Science Division, Argonne National Laboratory, Lemont, Illinois, USA.

PETSc/TAO Developments for Early Exascale Systems

Richard Tran Mills11affiliationmark:    Mark Adams22affiliationmark:    Satish Balay11affiliationmark:    Jed Brown33affiliationmark:    Jacob Faibussowitsch11affiliationmark:    Toby Isaac11affiliationmark:    Matthew Knepley44affiliationmark:    Todd Munson11affiliationmark:    Hansol Suh11affiliationmark:    Stefano Zampini55affiliationmark:    Hong Zhang11affiliationmark:    and Junchao Zhang11affiliationmark: 11affiliationmark: Argonne National Laboratory, Lemont, Illinois, USA
22affiliationmark: Lawrence Berkeley National Laboratory, Berkeley, California, USA
33affiliationmark: University of Colorado Boulder, Colorado, USA
44affiliationmark: University at Buffalo, Buffalo, New York, USA
55affiliationmark: King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
[email protected]
Abstract

The Portable Extensible Toolkit for Scientific Computation (PETSc) library provides scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization via the Toolkit for Advanced Optimization (TAO). PETSc is used in dozens of scientific fields and is an important building block for many simulation codes. During the U.S. Department of Energy’s Exascale Computing Project, the PETSc team has made substantial efforts to enable efficient utilization of the massive fine-grain parallelism present within exascale compute nodes and to enable performance portability across exascale architectures. We recap some of the challenges that designers of numerical libraries face in such an endeavor, and then discuss the many developments we have made, which include the addition of new GPU backends, features supporting efficient on-device matrix assembly, better support for asynchronicity and GPU kernel concurrency, and new communication infrastructure. We evaluate the performance of these developments on some pre-exascale systems as well the early exascale systems Frontier and Aurora, using compute kernel, communication layer, solver, and mini-application benchmark studies, and then close with a few observations drawn from our experiences on the tension between portable performance and other goals of numerical libraries.

keywords:
PETSc, GPU, Exascale Computing Project (ECP), performance portability, algebraic solvers

1 Introduction

In little more than a decade, we have seen a shift in high-performance computing (HPC) from complete reliance on central processing units (CPUs) to the incorporation of graphics processing units (GPUs) that provide the bulk of the computing power in most supercomputers. This kind of heterogeneous architecture with CPUs and GPUs is exemplified by the several high-end pre- or early exascale machines funded by the U.S. Department of Energy (DOE), including Summit at the Oak Ridge Leadership Computing Facility (OLCF) and Perlmutter at the National Energy Research Scientific Computing Center (NERSC) with NVIDIA GPUs, Frontier at OLCF and El Capitan at Lawrence Livermore National Laboratory (LLNL) with AMD GPUs, and Aurora at the Argonne Leadership Computing Facility (ALCF) with Intel GPUs. Exascale machines, with the capability to perform 1018superscript101810^{18}10 start_POSTSUPERSCRIPT 18 end_POSTSUPERSCRIPT floating point operations per second, represent a 1000x leap from previous petascale machines. The DOE Exascale Computing Project (ECP) (Kothe et al., 2019) aimed to prepare DOE mission-critical applications, an integrated software stack, and exascale hardware technology to form a capable exascale computing ecosystem. Although GPUs, sometimes referred to as “accelerators” or “devices” (in contrast to CPU “hosts”), have higher throughput and energy-efficiency than CPUs, they also bring profound programming challenges due to their architectural departures from CPUs and still-evolving software environments. GPUs use massive fine-grain parallelism, and thus programmers must write massively parallel code at the intra-node level in order to effectively utilize the compute power. GPUs usually have discrete device memory in addition to the main memory used by CPUs, and often the device and host cannot directly dereference pointers for the other memory. Programmers have to manage these two kinds of memory and differentiate pointers between them, e.g., when passing pointer arguments to routines. Though unified shared memory (USM), available on modern heterogeneous systems, enables the creation of a single address space accessible to both the CPU and GPU, simplifying memory management and data sharing between these processors, it may not provide transparent performance: programmers often must prefetch data to avoid performance penalties. More importantly, computational kernels on device are executed asynchronously with respect to the host process on CPUs, and kernels can be launched to different streams (like work queues) on a GPU and executed independently. Programmers must be aware of the asynchronicity and GPU streams, and synchronize their computations on host or on device properly to maintain correct code.

To program GPUs, developers can use vendor-provided programming models and libraries, such as CUDA, cuBLAS, and cuSparse from NVIDIA (NVIDA, 2024); HIP, hipBLAS, and hipSparse from AMD (AMD, 2024); and SYCL and OneAPI from Intel (Khronos SYCL Working Group, 2020). CUDA is proprietary and currently the most popular programming model for general-purpose processing on accelerators thanks to NVIDIA’s large market share. HIP, syntactically very similar to CUDA, allows developers to create portable applications for AMD and NVIDIA GPUs from single source code. On AMD GPUs, hipBLAS and hipSparse are simply wrappers over the rocBLAS and rocSparse libraries underneath from the AMD ROCm software stack for GPU programming. SYCL is a stark departure from CUDA and HIP and is advertised as an open standard allowing developers to use a C++ single source to program a wide range of targets, including CPUs, GPUs, digital signal processors (DSPs) and field programmable gate arrays (FPGAs). AMD and Intel also provide translation tools, such as hipify from AMD and SYCLomatic from Intel, to automatically convert source from CUDA to HIP or SYCL, respectively. Though these migration tools are helpful, none can guarantee automatic translation. Usually, programmers have to intervene to fix untranslatable code. To enhance performance portability across various GPUs and CPUs, ECP funded several projects such as RAJA (Beckingsale et al., 2019), Kokkos (Trott et al., 2022) and OpenMP target offload (OpenMP Architecture Review Board, 2021) compilers. RAJA and Kokkos use C++ templates and lambda functions to provide a portability layer in the front end but rely on vendor programming models and compilers in the backend. OpenMP target offload uses preprocessor directives (pragmas) in source code and thus requires compiler support. OpenMP has been a popular choice for Fortran applications, as Fortran support is poor in the other models. With all these options available, ECP applications and libraries were not required to use a certain programming model. The variety of programming languages, code complexity, varying package dependencies and large amount of legacy code precludes a one-size-fits-all solution.

PETSc/TAO (PETSc for short) (Balay et al., 2024), one of the math libraries funded by ECP, features scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization. PETSc is mainly written in the C programming language, though employing an object-oriented design. PETSc provides Fortran and Python bindings and interfaces with dozens of external packages in various languages, such as the direct solvers MUMPS (in Fortran) (Amestoy et al., 2001) and SuperLU (in C) (Demmel et al., 2024), the multigrid solver Hypre (in C) (Falgout, 2023), and the adaptive mesh refinement framework AMReX (in C++) (Zhang et al., 2019). PETSc, which has formed an ecosystem with applications, libraries and frameworks built upon it (Adams et al., 2022a), is widely used in academia, government laboratories, and industry and is an important building block for many simulation codes. Likewise, PETSc is part of the xSDK (xSDK Team, 2024) and E4S (E4S Team, 2024) ecosystems of complementary HPC packages.

Because PETSc functions as numerical infrastructure for many application packages, the library design completely separates the programming models used by applications (or external packages) and the models used by PETSc for its backend computational kernels. This flexibility is accomplished by sharing data between the application and PETSc programming models, but not sharing the programming models’ internal data structures. Because the data is shared, there are no copies between models and no loss of efficiency. This separation allows PETSc users from C/C++, Fortran, or Python to employ their preferred GPU programming model, whether it be CUDA, HIP, or Kokkos. In all cases, users can rely on PETSc’s large assortment of composable, hierarchical, and nested solvers (Brown et al., 2012), as well as advanced time-step** and adjoint capabilities (Zhang et al., 2022) and numerical optimization methods running on the GPU. For example, an application for solving time-dependent partial differential equations may compute the Jacobian using Kokkos and then call PETSc’s time-step** routines and algebraic solvers that use CUDA. Applications can mix and match programming models, allowing, for example, some application code in HIP and some in CUDA.

In Mills et al. (2021), we provided a blueprint for porting PETSc applications to use GPUs, surveyed challenges in develo** efficient and portable mathematical libraries for GPU systems, and introduced the PETSc backend developments meant to meet these challenges at that time. With the sunset of ECP and the availability of early exascale computers such as Frontier and Aurora, we summarize the accomplishments of the PETSc/TAO ECP project, and we present the latest developments on performance portability through PETSc, illustrated by results on exascale computers. This paper is organized as follows. We first recap some GPU programming challenges and PETSc’s responses in Section 2, then provide an update in Section 3 to the PETSc application GPU-porting blueprint. Afterwards, we introduce in detail three low-level PETSc developments for GPUs, including communication, matrix assembly, and asynchronous solvers in Sections 4 to 6. Next, we move higher in the PETSc software hierarchy and present PETSc GPU developments for algebraic multigrid, batched linear solvers, and dense reformulations of the limited memory BFGS method (L-BFGS) (Liu and Nocedal, 1989) in Sections 7 to 9. In Section 10, we move to the top of software stack and present a mini-application case study, examining a fully GPU-enabled implementation of time evolution of the Landau collision integral using PETSc. We conclude in Section 11. Throughout the paper, we use the term programming model to refer to both the model and its supporting runtime. With one exception, our experiments were conducted on the four (pre-) exascale machines with configurations shown in Table 1.

Table 1: Configurations of the (pre-)exascale machines we used for experiments. aNote that we treat the two graphics compute dies (GCDs) in an AMD MI250X on Frontier or the two tiles in an Intel PVC on Aurora as two separate GPUs. bWe report unidirectional bandwidth numbers. cMeasured by a benchmark since we are unaware of the theoretical bandwidth.
{tblr}

width=—Q[m]—Q[m]—Q[m]—Q[m]—Q[m]— Machine Summit@OLCF Perlmutter@NERSC Frontier@OLCF Aurora@ALCF
CPUs per node 2x IBM Power9 1x AMD EPYC-7763 1x AMD EPYC-7763 2x Intel Xeon SPR
GPUs per node 6x NVIDIA Tesla V100 4x NVIDIA A100 4x AMD MI250X
(2 GCDs per GPU) 6x Intel GPU PVC
(2 tiles per PVC)
HBM per GPUa 16 GB, 900 GB/s 40 GB, 1.5 TB/s 64 GB, 1.6 TB/s 64 GB, 1.6 TB/s
GPU-GPU linkb NVLINK, 50 GB/s NVLINK-3, 100 GB/s Infinity Fabric, 50similar-to\sim200 GB/s Xe Link, 15similar-to\sim196 GB/sc
Network InfiniBand, Fat-tree Slingshot-11, Dragonfly Slingshot-11, Dragonfly Slingshot-11, Dragonfly
Software IBM Spectrum MPI-10.3,
cuda-10.1, gcc-6.4
NVSHMEM-2.1.2 Cray MPICH-8.1.28,
cuda-12.2, gcc-12.3,
Kokkos-4.3 Cray MPICH-8.1.23,
rocm-5.4, gcc-12.2,
Kokkos-4.3 Aurora MPICH-52.2,
oneapi-2023.12.15.001,
Kokkos-4.3

2 GPU programming challenges and PETSc’s responses

In Mills et al. (2021), we enumerated three fundamental GPU programming challenges:

  1. F1

    Portability of application codes: write code that is portable across different hardware.

  2. F2

    Algorithms for high-throughput systems: design parallel algorithms exploiting high GPU-concurrentcy.

  3. F3

    Utilizing all GPU and CPU compute power: reduce idle and waiting time to improve hardware utilization.

In addition, we talked about seven ancillary challenges:

  1. A1

    Managing the kernel queue: pipeline many, not-too-small kernels in GPU streams to keep GPUs busy.

  2. A2

    Network communication: communication should be stream-aware without the need to synchronize GPUs.

  3. A3

    Over- and undersubscription: overcome the mismatch between the number of CPU cores and GPUs.

  4. A4

    CPU-GPU communication time: lower cost by reducing the communication amount or overlap** communication with computation.

  5. A5

    Multiple memory types: manage different types of memory either explicitly or implicitly via USM.

  6. A6

    Use of multiple streams from libraries: manage streams within a library or across libraries while maintaining data dependence.

  7. A7

    Multiprecision on the GPU: take advantage of the higher compute power of GPUs with lower precision.

Roughly speaking, while F1 concerns portability, A5 and A6 are about correctness, and F2, F3, A1–A4 and A7 are related to performance. To meet these challenges, during ECP we greatly improved PETSc’s GPU capability around these three aspects with new GPU backends and constructs, new GPU-friendly application programming interfaces (APIs), and new algorithms for high-throughput computation. One can refer to Mills et al. (2021) for categorized PETSc responses to these challenges. Here we further explain the issues of portability and correctness with respect to the latest developments in PETSc.

On PETSc’s code portability:

Before ECP, we had a legacy PETSc/CUDA backend for NVIDIA GPUs, and a PETSc/OpenCL backend via ViennaCL (Rupp et al., 2016) for OpenCL devices. We implemented vector and matrix sub-types for these backends, such as VECCUDA and MATAIJVIENNACL, and encouraged users to use PETSc’s options database from the command line, e.g., -vec_type cuda or -mat_type aijviennacl to set GPU-specific object types at runtime. Using these options, the same PETSc application source code can work with either a CUDA device or an OpenCL device. We did not write many device kernels ourselves. Instead we relied on vendor libraries, such as NVIDIA’s cuBLAS, cuSparse, etc. to provide basic vector and matrix operations at the process level. PETSc mainly managed the MPI parallelism and host-device synchronization. During ECP, we also needed to support AMD and Intel GPUs. Ideally we should use a unified, portable programming model for all devices, but unfortunately no suitable model existed at the time. OpenCL has portable performance as an overarching goal, but it is verbose and implementation quality lags behind the standard needed for many application codes (Pennycook et al., 2013). Therefore, for NVIDIA GPUs, we further optimized and expanded the PETSc/CUDA backend with new features. Later, staff from AMD helped us add a PETSc/HIP backend for AMD GPUs, which was very similar to the PETSc/CUDA backend thanks to the similarity between HIP and CUDA. We partially combined the PETSc/CUDA and the PETSc/HIP backends via a C++ abstraction layer we developed named CUPM (CUDA-like Programing Model). Further work is needed to fully combine the two backends. We did not use hipify, the AMD CUDA to HIP migration tool, because PETSc supports multiple versions of CUDA and contains many CUDA library calls to cuBLAS, cuSparse etc., which makes the translation more difficult. In addition to these vendor-native models, we developed a new GPU backend using the Kokkos programming model and employing Kokkos-Kernels (a math library providing BLAS, sparse BLAS etc.) for basic math operations. We currently do not have a RAJA implementation, as it lacks a math kernel library similar to Kokkos-Kernels. The PETSc/Kokkos backend is able to run on various CPUs and GPUs, including ones from NVIDIA, AMD and Intel. Going forward, as portable device programming models mature, we hope to consolidate these PETSc GPU backends to reduce maintenance burden and avoid code duplication. However, internal reorganization will not affect existing PETSc users working with PETSc objects. As mentioned before, we share data between an application and the PETSc programming model, but do not share the programming models’ internal data structures. We do provide utility APIs for PETSc users employing some popular GPU programming models. For example, for Kokkos users, we have APIs returning a Kokkos::View from a PETSc vector; for CUDA users, given the same PETSc vector, we have APIs returning a device pointer pointing to the vector storage in device memory.

On PETSc’s memory types:

PETSc assumes discrete device memory and maintains two copies of data, one on host and the other on device, for PETSc GPU vectors (Vec) and matrices (Mat). These objects have an internal mask indicating where the latest data resides. If a given Vec or Mat operation is not implemented on device, we simply synchronize the copy on host and perform the operation there instead, so that we always have full API compliance. PETSc Vec or Mat APIs identify themselves as either read only, write only, or read and write on parameters. With this information, we can update the mask accordingly and avoid unnecessary memory copying between the host and device. For users who want to directly obtain the device array used by PETSc objects, we provide APIs such as VecGetArrayAndMemType{Read,Write}(Vec x, PetscScalar **a, PetscMemType *mtype), with the returned pointer a pointing to the vector array and mtype being one of PETSC_MEMTYPE_{HOST, CUDA, HIP, SYCL}. For PETSc GPU objects, calling these routines always returns the latest data on GPUs, while for CPU objects, it returns the array on host. Callers can retrieve a’s memory type from the mtype argument. Similarly, we also provide APIs for creating PETSc device objects with user-provided device arrays. This approach will work even when CPUs and GPUs share the same physical memory, as in the upcoming El Capitan supercomputer at LLNL. In this case, we will have a single copy of data, making host-device memory copying a no-op.

On PETSc’s GPU streams:

Streams are an important mechanism on GPUs to hide serial kernel launch latency by pipelining kernel launches. Streams can also improve resource utilization with extra parallelism by potentially executing kernels in different streams concurrently. In the latter sense, GPU streams are like CPU threads, but the programming paradigm is very different. Suppose that there are two functions on the host, a caller and a callee. With CPU threads, the callee is automatically executed on the same thread as caller. With GPU streams, the caller must pass the callee the stream that it uses by some mechanism, e.g., via an argument, through a global variable, or they may both assume a default stream. For an existing library written in C++, one might have the luxury to overload all its API functions with an extra stream argument without breaking existing code. Though doable, it is not only daunting but also error-prone because users might call an original API when they are supposed to call the one with a stream argument, resulting in mismatched streams being used. For a C library like PETSc, we cannotafford to break the API or bloat it with a host of new functions merely adding an additional stream argument. Thus, at the very basic level, we use a variable storing a PETSc global stream and wrap it into a generic object of type PetscDeviceContext. PetscDeviceContext provides uniform APIs across different PETSc GPU backends such as CUDA, HIP and SYCL. Users can query the device type to get a handle to the stream, and then cast the handle to the specific stream type according to the device type, as shown in the code below. For Kokkos users, we provide a utility function PetscGetKokkosExecutionSpace() to directly return a Kokkos execution space instance with the stream.

PetscDeviceContext dctx;
PetscDeviceType type;
void *handle;
PetscDeviceContextGetCurrentContext(&dctx);
PetscDeviceContextGetDeviceType(dctx, &type);
PetscDeviceContextGetStreamHandle(dctx, &handle);
if (type == PETSC_DEVICE_CUDA) {
cudaStream_t stream = *(cudaStream_t *)handle;
myKernel<<<64, 128, 0, stream>>>();
} else if (type == PETSC_DEVICE_HIP) {
hipStream_t stream = *(hipStream_t *)handle;
...
}

From the user’s point of view, PETSc works as if using one global GPU stream. PETSc users should query this stream with the above methods when they want to coordinate with PETSc streams. The type of stream can be changed globally with the command line option -root_device_context_stream_type. We are exploring multi-stream parallelism with the PetscDeviceContext construct, see Section 6.2.

3 Porting PETSc applications to GPUs

Mills et al. (2021) introduced general strategies for porting PETSc applications to GPUs and recommended an incremental approach facilitating easy comparison of GPU and CPU results. Here, we summarize the approach we presented, with some updates reflecting the latest PETSc performance portability design. We mainly focus on GPU work and use an application using Kokkos as an example.

Listing 1 displays an excerpt of a typical PETSc main application program for solving a nonlinear set of equations on a structured grid using Newton’s method. This example illustrates the common usage pattern that applies when using components of PETSc, whether time integrators, nonlinear solvers, linear solvers, etc.:

  • Setup application data, meshes, initial state, etc., (here, a SNES solver object, a data management object DM for a 1-D domain, a vector of degrees of freedom Vec, and a Mat to hold the Jacobian),

  • provide a callback for the Function that defines the problem (the nonlinear residual),

  • provide a callback for the Jacobian of the Function,

  • call the PETSc solver (SNESSolve()), possibly in a loop.

This pattern holds whether targeting CPUs or GPUs for execution. When porting GPUs, the creation and manipulation of solver, matrix, and vector objects does not change, but users will need to 1) write code to ensure that needed data structures are either copied from CPU memory to device memory or constructed directly on the GPU and 2) provide Function and (optionally) Jacobian routines that call GPU kernels. We recommend adopting an incremental approach, in which portions of the computation are moved to GPU and then evaluated for correctness and performance; porting can be considered complete when observed GPU speedup is sufficient, relative to the cost of the rest of the simulation run.

SNESCreate(PETSC_COMM_WORLD,&snes);
DMDACreate1d(PETSC_COMM_WORLD,...,&ctx.da);
DMCreateGlobalVector(ctx.da,&x);
VecDuplicate(x,&r);
DMCreateMatrix(ctx.da,&J);
SNESSetFunction(snes,r,KokkosFunction,&ctx);
SNESSetJacobian(snes,J,J,KokkosJacobian,&ctx);
SNESSolve(snes,NULL,x);
Listing 1: Main application code, with no syntactic changes from typical CPU version.

Listing 2 shows a Kokkos implementation of Function, which is similar to the traditional CPU version. For simplicity, we assume periodic boundary conditions in one dimension, but the pattern is similar in more dimensions and with general boundary conditions. DMDAVecGetKokkosOffsetView(da, xl, &X) returns a Kokkos OffsetView X from a PETSc vector xl. Because X and xl share data, we wrap but do not copy xl’s data. Kokkos OffsetViews are essentially multi-dimensional arrays with non-zero start indices. PETSc uses them so that users can access a locally owned array with conceptually simpler global indices. Via the mesh management object da, PETSc knows the dimension, start and end indices of the OffsetView so that the returned object is properly populated. Then, users can write parallel device code of their choice with these OffsetViews. A critical issue here is that users must be aware of the asynchronous nature of GPU computations, and therefore must know the current GPU stream used by PETSc. PETSc Kokkos users have the PetscGetKokkosExecutionSpace() to get a Kokkos execution space instance that wraps the stream. With that, one can construct a Kokkos RangePolicy object, for example in a Kokkos parallel_for dispatch, so that the user’s device code and PETSc’s share the same stream, and to enforce data dependency.

DMGetLocalVector(da,&xl);
DMGlobalToLocal(da,x,INSERT_VALUES,xl);
DMDAVecGetKokkosOffsetView(da,xl,&X); // no copy
DMDAVecGetKokkosOffsetViewWrite(da,r,&R); // R is W-only
DMDAVecGetKokkosOffsetView(da,f,&F);
PetscInt xs = R.begin(0), xm = R.end(0);
auto exec = PetscGetKokkosExecutionSpace();
Kokkos::parallel_for(
Kokkos::RangePolicy<>(exec,xs,xm),[=](PetscInt i){
R(i) = d*(X(i-1)-2*X(i)+X(i+1))+X(i)*X(i)-F(i);});
Listing 2: Function callback code. xl, x, r, f are PETSc Vecs, while X, R, F are Kokkos OffsetViews. Marking R write-only offers PETSc an optimization hint. Note that via exec user shares the GPU stream that PETSc is using.

The Jacobian computation departs from the block-oriented MatSetValues() approach traditionally used in CPU-based PETSc codes, as it is not possible to efficiently implement on GPUs. We discuss this topic in detail in Section 5, and here simply note that to use the new coordinate-based MatSetValuesCOO() approach each GPU thread places the non-zeros it contributes into a device buffer and later calls MatSetValuesCOO() to have PETSc post-process the buffer, as shown in Listing 3.

DMDAVecGetKokkosOffsetView(da,x,&X);
PetscInt xs = R.begin(0), xm = R.end(0);
auto exec = PetscGetKokkosExecutionSpace();
auto v = ctx.v; // a Kokkos::View used as a device buffer
Kokkos::parallel_for(
Kokkos::RangePolicy<>(exec,xs,xm),[=](PetscInt i){
PetscInt ofst = (i-xs)*3; // offset in v for thread i
v(ofst+0) = d;
v(ofst+1) = -2*d + 2*X(i);
v(ofst+2) = d;});
MatSetValuesCOO(J,v.data());
Listing 3: Jacobian callback code. v, the Kokkos View managed by users, works as a device buffer to store unprocessed nonzeros calculated by GPU threads.

4 Communication on GPUs

As PETSc’s strength lies in distributed computation with sparse and irregular data, the library has to take care of complex communication among processes, such as those in sparse matrix-vector multiplication, sparse matrix-matrix multiplication, or irregular mesh partitioning. In PETSc, we designed a module named PetscSF to encapsulate frequently used communication operations. Underneath a unified interface, PetscSF can have different communication implementations, though the default uses MPI. In this section, we first introduce PetscSF’s design and the extensions to allow MPI communication with device data. Next we introduce an experimental PetscSF implementation using NVIDIA NVSHMEM (NVIDIA, 2024) that can overcome some limitations imposed by MPI. Finally, we use a microbenchmark written with PetscSF to study the communication performance on the four target machines.

4.1 The star-forest (SF) abstraction

PetscSF uses star-forests to abstract communication patterns. A star is a simple tree consisting of one root vertex connected to zero or more leaves. A star forest is a disjoint union of stars, see examples in Figure 1.

Refer to caption
Figure 1: Two star-forest examples. The left example has three MPI ranks, while the right has two. Vertical dashed lines separate MPI ranks. Colored boxes are roots (leaves). Enclosed numbers are indices of the roots (leaves) in their index space. Dashed boxes represent holes in the spaces not belonging to the SF.

Leaves are locally indexed with integers, while roots are globally indexed via tuples of (owner rank, offset). A PetscSF is created collectively by specifying, for each leaf on the current process, the owner rank and an offset of the corresponding root on the owner. PETSc analyzes the graph and derives the communication pattern. We provide APIs to communicate between roots and leaves, for example:

PetscSFBcastBegin/End(PetscSF sf, MPI_Datatype unit,
const void *rootdata, void *leafdata, MPI_Op op);
PetscSFReduceBegin/End(PetscSF sf, MPI_Datatype unit,
const void *leafdata, void *rootdata, MPI_Op op);

The former broadcasts root values to leaves, while the latter reduces leaf values into roots, with both taking a reduction operation specified by an MPI_Op argument to add the source values or to replace the destination values (with op=MPI_REPLACE). The Begin/End split-phase design allows users to insert computations in between to potentially overlap computation with communication. It is common in irregular applications that leaves and roots are not consecutive in their respective index space. In that case, PETSc will call its pack or unpack kernels to put data into internally managed buffers for send and receive. The same PetscSF can be applied to different root/leaf data or data types. Depending on the communication pattern, PetscSF can perform the operation using MPI point to point or collective operations. For the most common sparse neighborhood communication pattern, we use persistent MPI sends and receives by default, but we also support MPI neighborhood collectives.

4.2 The mismatch between MPI and GPUs

As more and more computations are offloaded to device for acceleration, it is desirable to directly communicate data on device instead of staging data on host just for the purpose of communication, because moving data between host and device is expensive. GPU-aware MPI was introduced to solve this problem by allowing MPI calls to accept device buffers. However, one has to be aware that GPU computations or kernels are queued in streams and are executed asynchronously with respect to the host process. MPI routines, as a host side API, must make sure the send data on device, which could be the output of some previous GPU kernels, is ready to be sent. We would like to inform MPI of the stream in use so that it can maintain the data dependency. Unfortunately, as of this writing, the MPI standard does not support this functionality, since MPI routines do not take a GPU stream argument. As a result, MPI users must synchronize the GPU stream to have the send data ready: In other words, force all kernels producing the send data to complete before launching new kernels. On the receiver side, an MPI receive must block subsequent GPU kernel launches, i.e., synchronize the device. In PETSc, we have a default device stream. With that, Figure 2 gives a typical data path when using CUDA for PetscSFReduce(), which reduces sparse leaf data to root data with the help of the Pack/Unpack kernels.

Refer to caption
Figure 2: A typical data path of PetscSFReduce() with CUDA, assuming all parts except MPI work on a common device stream local to the calling process. Note the stream synchronization before MPI_Isend().

Because kernel launches are expensive, the GPU runtime provides users the stream mechanism for pipelining kernel launches, so that kernel launch latency can be hidden by the execution of previous kernels. MPI incurs a device synchronization that stalls the pipeline, making the cost of kernel launches harder to mitigate. To quantify the launch latency, we designed a microbenchmark, in which we launch an empty GPU kernel many times, marking on the CPU before and after each launch, and then calculate the average latency per launch (i.e., per iteration). We had two variants: in one (Asynchronous), there were no stream synchronizations at all; in the other (Synchronous), we inserted a stream synchronization after each launch. The results are shown in Table 2, from which we can see the huge impact of synchronization on kernel launch latency.

Table 2: Average latency (μs𝜇𝑠\mu sitalic_μ italic_s) of launching an empty kernel asynchronously or synchronously on (pre-)exascale machines
Platform Summit Perlmutter Frontier Aurora
Asynchronous 4.9 2.3 1.9 3.3
Synchronous 12.8 7.1 7.8 6.2

4.3 Synchronization-free PetscSF with NVSHMEM

To avoid the device synchronizations imposed by MPI, we developed an experimental implementation of PetscSF using NVSHMEM (NVIDIA, 2024). NVSHMEM is NVIDIA’s implementation of OpenSHMEM (Open Source Software Solutions, Inc., 2020) on CUDA devices. The OpenSHMEM standard specifies an API for partitioned global address space (PGAS) parallel programming, providing one-sided shared-memory style put/get APIs to access remote objects. NVSHMEM supports point-to-point and collective communications between GPUs within a node or over networks. Communication can be initiated either on the host or on the device. Unlike MPI, NVSHMEM host-side APIs take a stream argument. Remotely accessible objects (aka symmetric objects) are collectively allocated over all processing elements (PEs, like processes in MPI) from a special heap (called the symmetric heap). All PEs must allocate a symmetric object with the same size, so that the object always appears at the same offset in their symmetric heap. PEs access remote data by referencing a symmetric address and the rank of the remote PE. A symmetric address is the address of a symmetric object on the local PE, plus an offset if needed. In the NVSHMEM implementation of PetscSF, we still pack/unpack data in send/receive buffers, but those buffers are now allocated as symmetric objects on device. We launch, on the PETSc default stream, a CUDA kernel which then calls the device function nvshmem_putmem_nbi to put data from send buffers into receive buffers. NVSHMEM can be used with MPI. In our design we use MPI to communicate information, such as sizes and offsets of buffers, to help set up auxiliary data structures. We use NVSHMEM only in real communication. We found implementing PetscSF in NVSHMEM no less complex than in MPI. Symmetric allocation forced us to take the maximal size of send/receive buffers across all PEs, though their real sizes could vary. Shared-memory style APIs forced us to design complex communication protocols even for simple patterns, such as knowing when data in a send buffer has already been sent and thus is safe to overwrite, or when data in a receive buffer is ready for use. The detailed design was published in Zhang et al. (2021). We hope in the near future that the MPI Forum will provide GPU-initiated communication support in the MPI standard, such as in the proposal of Zhou et al. (2022), which would greatly simplify implementation. Currently, our PetscSF implementation is capable of doing stream-aware communication across CUDA devices, free of device synchronization. In Section 6.1, we have used it to implement a parallel asynchronous conjugate gradient (CG) solver.

4.4 GPU message passing latency on (pre-)exascale machines

To measure GPU-aware MPI communication performance on various machines, we designed a microbenchmark shown in Listing 4. One of the star-forests used is illustrated in the right of Figure 1, where there are n consecutive roots on the first process and n consecutive leaves on the second. The leaves are connected to the roots one-on-one in order. The code works as if the two processes keep bouncing a message of n*sizeof(double) bytes to the other.

MPI_Op op = MPI_REPLACE; // or MPI_SUM
for (i=0; i<niter; i++) {
PetscSFBcastBegin(sf,MPI_DOUBLE,rdata,ldata,op);
PetscSFBcastEnd(sf,MPI_DOUBLE,rdata,ldata,op);
PetscSFReduceBegin(sf,MPI_DOUBLE,ldata,rdata,op);
PetscSFReduceEnd(sf,MPI_DOUBLE,ldata,rdata,op);
}
Listing 4: PetscSF code measuring message latency

When op is MPI_REPLACE, the two buffers, rdata for root data and ldata for leaf data, are directly used as MPI send/receive buffers. The communication in this case is very similar to the osu_latency test from the OSU Micro-Benchmarks (Panda et al., 2024), with the exception that we synchronize the device before sending messages for reasons mentioned in Section 4.2, while osu_latency does not. We named the test SF-**pong and measured the one-way latency of a message, with results on the four (pre-)exascale machines shown in Figure 3. Note that on Frontier and Aurora, we treated the two GCDs (or tiles) within a GPU package as two separate GPUs. In all tests, we used the two closest GPUs within a compute node, in other words, the two are either in the same package or connected to the same CPU socket (as on Summit or Perlmutter). For comparison, we also give the intra-node CPU to CPU message latency on Perlmutter measured with the same code with host data. From Figure 3 we can see MPI latency with device data is generally much higher than that with host data for small and medium messages, which implies GPU operations have a high start-up cost. We can see that Perlmutter has much lower latency than Summit thanks to the hardware and software upgrade listed in Table 1.

Refer to caption
Figure 3: SF-**pong test: MPI latency between two closest GPUs on the four (pre-)exascale machines, and between two CPU cores within a compute node on Perlmutter. In the legend parentheses are GPU vendor names. Note the strikingly better performance on Frontier with small messages, compared with other machines.

The GPU-Frontier performance in Figure 3 stands out as its latency looks quite good for small messages, e.g., even as good as Perlmutter’s CPU MPI latency for 256-byte messages. Further investigation revealed that was misleading. We changed the op in Listing 4 to MPI_SUM, which let the code add roots to leaves on broadcast and vice versa on reduce. PETSc would allocate a buffer alongside the roots (leaves) and call an unpack kernel to add values in the buffer to the roots (leaves). We named the test SF-unpack and measured again the one way latency of a message, with results shown in Figure 4. We can see for small messages (therefore light unpack kernels) on Frontier, the latency increased dramatically from Figure 3.

Refer to caption
Figure 4: SF-unpack test: the test is similar to Figure 3, except the latency contains the execution time of the unpack kernel after receiving data. Note the performance on Frontier will small messages did not stand out anymore as in Figure 3. Also note GPUs have much better performance than CPUs in the unpack kernel with big messages thanks to their higher memory bandwidth.

Further analysis revealed that Cray-MPICH on Frontier would stage GPU messages on host for messages smaller than a threshold and do host-to-host message passing instead, while maintaining cache coherency between the CPU and GPU. In the SF-**pong test, we did not change the root (leaf) data, therefore the cached data on host was used in subsequent iterations of message passing, giving an ultra-low average latency. Meanwhile, in the SF-unpack test, the root (leaf) data was updated in each iteration, making host-caching useless. In real applications, the data sent usually changes between iterations. Therefore we deem the latency in Figure 4 closer to the real MPI latency for small messages on Frontier. From the figure, we can also observe that GPUs have much better performance than CPUs in the unpack kernel with big messages thanks to their higher memory bandwidth.

5 Portable matrix assembly on GPUs

In Section 3, we previewed some new PETSc matrix assembly APIs for GPUs in the Jacobian computation. In this section we discuss them in detail. PETSc has a rich set of APIs for CPU (host) matrix assembly. In the past, to achieve best performance users needed to call MatXAIJSetPreallocation() in advance to preallocate memory for the matrix, but the introduction of a hash table-based matrix assembly approach has rendered explicit preallocation no longer necessary in most cases. Next, users call MatSetValues() to insert blocks of values, e.g. element matrices from finite elements, into the matrix using global indices. These values could be local (i.e., owned by the calling MPI process) or remote (i.e., owned by other processes), determined by the layout of the matrix. For local values, PETSc directly inserts them, while for remote ones, PETSc might stash them until MatAssemblyBegin()/End() are called, when MPI communication is used to distribute stashed values and the matrix is finally assembled. While these APIs are convenient, they are all host APIs working with host memory, and can be called only by a single CPU thread. Directly porting MatSetValues() and related functions to a GPU is not feasible, since if multiple GPU threads call MatSetValues(), PETSc would need to stash remote values (and potentially allocate new stash area on device), do binary searches to find locations to insert local values, and use atomics to prevent data races when multiple GPU threads insert to the same non-zero location. Though these operations keep memory utilization low, they are too latency-sensitive for efficient computing on GPUs. Various approaches, like coloring to avoid atomics or using lookup tables to avoid binary search, were proposed (Cecka et al., 2011; Trotter et al., 2023). But these were for a single process and in effect shifted the burden to application developers. To support MPI parallelism and provide users convenience close to MatSetValues(), we designed a new set of coordinate-based (COO) matrix assembly APIs for matrices in the popular compressed sparse row (CSR) format. In both PETSc’s native and GPU formats, as well as Hypre’s ParCSR, MPI parallel matrices are distributed row-wise across MPI processes with diagonal (intra-process coupling) and off-diagonal (inter-process coupling) blocks stored separately as two sequential CSR matrices. The classic COO format consists of three arrays, i[], j[], v[] of equal length, in which the assembled matrix A𝐴Aitalic_A is defined as the sum of each contribution v[k] to entry ai[k],j[k]subscript𝑎i[k]j[k]a_{\texttt{i[k]},\texttt{j[k]}}italic_a start_POSTSUBSCRIPT i[k] , j[k] end_POSTSUBSCRIPT, so that each index k𝑘kitalic_k represents a distinct nonzero. In nonlinear and transient solves, one needs to repeatedly assemble matrices with the same nonzero pattern but different numeric values.

PETSc splits COO assembly into a symbolic preallocation stage with MatSetPreallocationCOO(A,n,i,j), and one or more numeric stages with MatSetValuesCOO (A,v,mode). During preallocation, it analyzes indices in i/j[] of length n on the host, exchanges information about remote entries, finalizes the sparsity pattern of the diagonal and off-diagonal blocks, preallocates device memory, and builds MPI communication plans. The arrays i/j[] can be freed after this stage. We allow negative indices in i/j[], meaning the corresponding entries will be ignored, a convenient way to handle boundary conditions. MatSetValuesCOO() sets elements of the matrix, where v[] is an array on device that has the same length and follows the same order as i/j[]. Each entry (with non-negative indices) is destined for the owned diagonal, owned off-diagonal block, or a send buffer. The implementation first calls a kernel to fill the send buffer on device and initiates the GPU-aware MPI communication, then calls two asynchronous kernels filling nonzeros in the diagonal and off-diagonal blocks, in which each thread accumulates into a single nonzero entry. After completing communication, the implementation calls two similar kernels unpacking entries from the receive buffer. With knowledge of the sparsity pattern and insertion order in advance, our implementation avoids data races and atomics completely with some helper data structures.

With PETSc’s COO matrix assembly, users need to provide i[], j[] and v[] at once. In finite element assembly, they can loop over elements and precompute i/j[] on host, assuming entries in the same element matrix will be stored contiguously in the arrays. Meanwhile, to prepare for the concurrent computing of element matrices on device, one also needs to know the offset of each element in the array v[]. With homogeneous elements so that each element matrix is the same size, the offsets can be computed analytically. Otherwise, one can create an auxiliary array, say offset[], to store the information and also copy it to the device. Then in the Jacobian, each GPU thread in charge of one element conveniently gets a pointer by &v[offset[tid]], to store element matrix values, where tid is the thread id.

The above COO matrix assembly APIs are portable across host and device, so that the value array v[] can be on device or on the host, depending on the type of the matrix. For example, if A𝐴Aitalic_A’s type is MATAIJ, a PETSc host CSR matrix type, then v[] needs to be on host. If it is MATAIJCUSPARSE, a PETSc matrix type for CUDA devices, then v[] needs to be in CUDA device memory. If it is MATAIJKOKKOS, a PETSc matrix type provided by the PETSc/Kokkos backend, then v[] needs to be in the Kokkos default memory space. To be complete, we also provide VecSetPreallocationCOO(x,n,i) and VecSetValuesCOO(x,v,mode) for COO vector assembly on device.

6 Asynchronous solvers on GPUs

6.1 Distributed asynchronous CG with PetscSF over NVSHMEM

In Section 4.3 we introduced a synchronization-free stream-aware PetscSF implementation with NVSHMEM. Taking advantage of that, we were able to adapt CG, the conjugate gradient Krylov solver in PETSc, to a prototype asynchronous version CGAsync. The full design was presented in Zhang et al. (2021). CGAsync runs with multiple MPI ranks and GPUs, does all its computation and communication on device, and does not need any synchronization on host. With a modular design, the PETSc CG implementation contains calls to PETSc vector and matrix operations and MPI collectives. In adapting CG to CGAsync, we had to make some changes. Key implementation differences between CG and CGAsync include: (1) A handful of PETSc routines they call are different. There are two categories. The first includes routines with scalar output parameters, for example, vector dot product. CG calls VecDot(Vec x, Vec y, double *a) with a being a host address, while CGAsync calls VecDotAsync(Vec x, Vec y, double *a) with a being a device address. In VecDot, each process calls cuBLAS routines to compute a partial dot product and then copies it back to the host, where it calls MPI_Allreduce to get the final dot product and stores it in the host buffer. Thus VecDot synchronizes the host and the device. While in VecDotAsync, once the partial dot product from cuBLAS is computed, each process calls a NVSHMEM reduction operation on PETSc’s default stream to get the final result and stores it in the device buffer. The second category of differences includes routines with scalar input parameters, such as VecAXPY(Vec y,double a, Vec x) calculating y += a*x. CG calls VecAXPY while CGAsync calls VecAXPYAsync(Vec y,double *a,Vec x) with a being a device pointer, so that VecAXPYAsync can be queued to a stream and a is computed on the device. (2) CG does scalar arithmetic (e.g., divide two scalars) on the CPU, while CGAsync does them with tiny scalar kernels on the device. (3) CG checks convergence (by comparison) in every iteration on the host to determine whether it should stop while CGAsync does not. Users need to specify the maximum number of iterations; nevertheless, this could be improved by checking for convergence every few (e.g., 20) iterations.

We tested CG and CGAsync without preconditioning on a single Summit compute node with two sparse matrices from the SuiteSparse Matrix Collection (Davis and Hu, 2011). CG was run with PetscSF over the IBM Spectrum CUDA-aware MPI, and CGAsync was run with PetscSF over NVSHMEM. The first matrix was Bump_2911 with about 3M rows and 128M nonzero entries. We ran both algorithms 10 iterations with 6 MPI ranks and one GPU per rank. Fig. 5 shows their timeline through the profiler NVIDIA NSight Systems. The kernel launches (labeled CUDA API) in CG were spread over the 10 iterations. The reason was that in each iteration, there were multiple MPI calls (mainly from distributed matrix-vector multiplication, vector dot and vector norm operations), which constantly blocked the kernel launch pipeline. In CGAsync, however, while the GPU was executing the 8th iteration (with profiling), the CPU had launched all kernels for the 10 iterations. The long red bar cudaMemcpyAsync indicates that after the kernel launches, the CPU was idle, waiting for the final result from the GPU.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Timeline of CG (top) and CGAsync (bottom) on rank 2. Each ran ten iterations. The blue csr… bars are csrMV (i.e., SpMV) kernels in cuSPARSE, and the red c… bars are cudaMemcpyAsync() copying data from device to host.

Test results show that the time per iteration for CG and CGAsync was about 690 μs𝜇𝑠\mu sitalic_μ italic_s and 676 μs𝜇𝑠\mu sitalic_μ italic_s, respectively. CGAsync gave merely a 2% improvement. This small improvement is because the matrix in question was huge, and computation took the vast majority of the time. From profiling, we knew matrix-vector multiplication alone (excluding communication) took 420 μs𝜇𝑠\mu sitalic_μ italic_s. If one removes the computational time, the improvement in communication time is substantial. Unfortunately, because of bugs in the NVSHMEM library with multiple nodes, we could not scale to more compute nodes. Instead, we used a smaller matrix, Kuu from SuiteSparse, of about 7K rows and 340K nonzero entries to see how CGAsync would perform in a strong-scaling sense. In the new tests, time per iteration for CG and CGAsync was about 300 μs𝜇𝑠\mu sitalic_μ italic_s and 250 μs𝜇𝑠\mu sitalic_μ italic_s. CGAsync exhibited an improvement of 16.7%. Note that this improvement was achieved even though PetscSF/NVSHMEM had much higher message latency than PetscSF/MPI (detailed in Zhang et al. (2021) but not shown here). Thus one can reasonably predict asynchronous solvers enabled by stream-aware communication have good potential.

6.2 Improve asynchronicity with PetscDeviceContext and managed memory

In addition to the ability to hide kernel launch latency, streams are useful for achieving kernel concurrency. Multiple kernels including memory copy operations can be assigned to different streams, and these streams can run at the same time if there are sufficient resources. Although using multiple streams in a confined scope might be manageable, doing so at library level is hard, as one has to pass the stream information through call chains, and maintain data dependence across streams. In Section 2 we discussed the stream usage in PETSc. To exploit kernel concurrency and manage asynchronicity in math library solvers, we proposed a preliminary infrastructure for safe, seamless, and scalable integration of asynchronous GPU streams in PETSc, detailed in Faibussowitsch et al. (2023). The infrastructure consists of two new PETSc types, a C PetscDeviceContext and a C++ Petsc::ManagedMemory (device context and managed memory thereafter).

Conceptually, device context is an abstraction for vendor GPU streams. It is comprised of the device over which it presides and the stream that it manages, but also bears other responsibilities. We provide APIs for one to synchronize a device context, or make one context wait for another. With vendor programming models, the inter-stream data dependence is usually done via event objects (e.g., cudaEvent_t) where a data producer records an event on a stream and a data consumer waits for the event on another stream. But using event objects requires users to store or pass the objects around, easily messing up the library call chain. Instead, we hide the complexity within PETSc device contexts and only expose high-level context control APIs. We provide memory registration APIs to assign memory regions a PETSc object ID. With that, we can connect PETSc GPU vector/matrix objects and memory regions with device contexts. It is done via a pair of Begin/End marking APIs that take arguments of a device context, a PETSc object ID, and an access mode as below:

PetscDeviceContextMarkIntentFromIDBegin/End(PetscDeviceContext dctx, PetscObjectId obj, PetscMemoryAccessMode mode, const char descr[]);

They notify the runtime the PETSc object (or memory region) in the enclosed scope is accessed read-only, write-only or read-write. Once an object has been marked, subsequent accesses to it are ordered according to its data dependencies. Operations using the same device context are ordered and executed as they appear in the source code. On the other hand, operations using separate device contexts which access a common object are strongly write-ordered. That is, operations are serialized by the order in which they are executed. Formally, given two operations A and B and an object, if A writes and B reads the object, and A appears before B in source code, then B shall not begin before A completes. The rule also holds for write-after-write or write-after-read. To enforce write-ordering, PETSc will internally create event objects for all input and output objects of an operation. The latest write event or all the read events since the last write are recorded after the operation has launched; they are used as semaphores for subsequent launches. If an object has previously recorded events, and the new operation conflicts with the previous operation (by rules defined above), then PETSc will ensure that the current operation does not begin until those events have completed. It is important to note that waiting for the previous operation is done asynchronously on the device whenever possible.

PETSc managed memory is used to facilitate managing a small collection of scalar variables. These may be needed to mutate or scale larger objects or, in linear solvers, for example, to determine algorithm convergence. Often, they are produced by some routine, trivially modified or inspected, and then passed directly on to a consumer. In the subsection above, we used raw device pointers for scalar variables. Managed memory models use a std::future for an opaque, host/device mirrored dual array of values. Based on that, subclass ManagedReal is used to manipulate real-valued scalar variables on device or host. ManagedReal has the ability to symbolically represent expressions and convert them into the corresponding GPU or CPU kernels by using C++ expression templates. These symbolic expressions are then executed on the GPU or CPU, depending on runtime options. An example asynchronously normalizing a vector is shown in Listing 5.

PetscDeviceContext dctx;
Petsc::ManagedReal alpha; // subclass of ManagedMemory
// Retrieve the current device context,
PetscDeviceContextGetCurrentContext(&dctx);
// Asynchronously compute the norm on the GPU,
// storing its future value in alpha
VecNormAsync(v, NORM_2, &alpha, dctx);
// Evaluate the reciprocal expression asynchronously
alpha = Petsc::Eval(1.0 / alpha, dctx);
// Complete the normalization, asynchronously
VecScaleAsync(v, alpha, dctx);
// Materialize the value of alpha, copying device-to-host
// and synchronizing the stream transparently
PetscReal alpha_host = alpha;
printf("Value of reciprocal norm was %"
Listing 5: Asynchronous normalization of a PETSc vector v

With help of the PETSc device context and managed memory, we implemented the CG solver and the Transpose-Free Quasi-Minimal Residual (TFQMR) solver (Freund, 1993), and tested on one NVIDIA A100 GPU on the Polaris computer at ALCF, a machine very similar to Perlmutter and employing identical GPUs. We used a finite difference PDE (fd) problem to compare two configurations:

  • main_gpu_dim_N_fd_M: Synchronous solvers as the baseline. This is the default option in PETSc.

  • async_dim_N_fd_M: Asynchronous solvers implemented with PETSc device context and managed memory.

We used a 5/9-points (𝐌=0/1𝐌01{\bf M}=0/1bold_M = 0 / 1) stencil for 2-D domains (𝐍=2𝐍2{\bf N}=2bold_N = 2), and 9/27-points (𝐌=0/1𝐌01{\bf M}=0/1bold_M = 0 / 1) stencil for 3-D domains (𝐍=3𝐍3{\bf N}=3bold_N = 3). The average solve time is shown in Figure 6, with respect to the problem size represented by the number of nonzeros (NNZ) of the system. The async implementation is faster than the baseline at every measured point, starting from speed-ups in the order of 2222X for relatively small problem sizes and with diminishing returns as the problem size increases, when matrix-vector multiplication tends to dominate the overall runtime. TFQMR shows a greater improvement than CG; for smaller problems (NNZ 106less-than-or-similar-toabsentsuperscript106\lesssim 10^{6}≲ 10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT) it is nearly twice as fast versus only an 80% improvement in CG. In future work, we will integrate these newly implemented solvers with stream-aware PetscSF to exploit asynchronous solvers in multi-node multi-GPU environments.

Refer to caption
Refer to caption
Figure 6: Synchronous vs asynchronous execution time of a solve versus the number of nonzeros of the system: CG (top) and TFQMR (bottom)

7 Algebraic multigrid on GPUs

Performance portability is demonstrated by a scaling study with PETSc’s built-in algebraic multigrid (AMG) solver, PCGAMG, using the PETSc/Kokkos backend over CUDA and HIP. PCGAMG uses the PETSc multigrid PCMG framework and can thus take advantage of optimized backend operations. This ability to abstract the AMG algorithm with standard sparse linear algebra has facilitated its widespread use in the PETSc and wider computational science communities. PETSc’s built-in functionality for finite elements is used to discretize the Laplacian operator with second-order elements. Each MPI process has a logical cube of hexahedral cells, with 64 such processes per node (e.g., 16 MPI tasks per A100). Increasingly larger grids are generated by uniform refinements.

Figure 7 shows performance data for the solve phase with several subdomain sizes as a function of the number of nodes, kee** the same number of cells per MPI task, that is, weak scaling where horizontal lines are perfect, on Perlmutter and Frontier. This shows that MPI parallel scaling is fairly good (there is a slight increase in iteration counts that is folded into the inefficiency) because the lines are almost flat, up to 512 nodes.

Refer to caption
Refer to caption
Figure 7: Solve time (sec.) for 1 solve of a 3D Laplacian with Q2 elements, a relative residual tolerance of 1012superscript101210^{-12}10 start_POSTSUPERSCRIPT - 12 end_POSTSUPERSCRIPT: Frontier (left) and Perlmutter (right).

The “setup” phase of AMG consists of two parts: the “mesh” setup, which constructs the coarse grid spaces (mostly graph work), and the “matrix” setup that constructs the coarse grid operators (a sparse matrix triple product). The mesh setup phase is fully amortized for long simulations without adaptive mesh refinement; mesh setup is required for each new mesh. The matrix setup is amortized for linear or secant Newton types of algorithms where the matrix does not change. Matrix setup can sometimes be further amortized by lagging, where coarse grid operators are used from old fine grid operators. Figure 8 shows performance data for the setup phase with several subdomain sizes as a function of the number of nodes, kee** the same number of cells per MPI task.

Refer to caption
Refer to caption
Figure 8: Setup time (sec.) for 1 solve of a 3D Laplacian with Q2 elements: Frontier (left) and Perlmutter (right).

This data was generated with a test harness111src/snes/tests/ex13.c that uses a two-level partitioning of Cartesian grids in DM, first to nodes and then to MPI tasks on each node. These experiments were run with 64 processes per node.

The matrix setup, mostly the sparse matrix triple product PTAPsuperscript𝑃𝑇𝐴𝑃P^{T}APitalic_P start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A italic_P (or RAP𝑅𝐴𝑃RAPitalic_R italic_A italic_P) construction of the coarse grid operator (AIJcoarse=PiIAijfinePjJsubscriptsuperscript𝐴𝑐𝑜𝑎𝑟𝑠𝑒𝐼𝐽subscript𝑃𝑖𝐼subscriptsuperscript𝐴𝑓𝑖𝑛𝑒𝑖𝑗subscript𝑃𝑗𝐽A^{coarse}_{IJ}=P_{iI}A^{fine}_{ij}P_{jJ}italic_A start_POSTSUPERSCRIPT italic_c italic_o italic_a italic_r italic_s italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_I italic_J end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_i italic_I end_POSTSUBSCRIPT italic_A start_POSTSUPERSCRIPT italic_f italic_i italic_n italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_j italic_J end_POSTSUBSCRIPT), has relatively high arithmetic intensity and a high degree of parallelism, but is a challenge to optimize for GPUs. The PETSc/Kokkos backend supports effective implementations of this operator. We use the Kokkos-Kernels spgemm (sparse matrix-matrix multiplication) interface wrap** around vendors’ implementations (e.g., cuSparse, rocSparse) as the building block within an MPI process, and use PetscSF to carry out the complex communication between processes.

8 Batched linear solvers

Batching is a technique for exposing PE-level parallelism in algorithms that previously ran on entire processes or multiple threads within a single MPI process. Kinetic discretizations of magnetized plasmas, for example, advance the Vlasov-Maxwell system, which is then followed by a fully implicit time advance of a collision operator. These collision advances are independent at each spatial point and are well suited to batch processing. The full implicit time integrator in our Landau operator (Section 10) requires linear solves that can effectively run these many small systems on GPUs. PETSc has developed batched version of two Krylov methods, TFQMR and BiCG, with diagonal preconditioning. Batched iterative solvers of this form have also been deployed in the Kokkos-Kernels and Ginkgo libraries (Liegeois et al., 2023; Anzt et al., 2022). Figure 9 shows the solves per second, as a function of batch size, of the batched and “ensemble” solvers. Ensemble solvers simply stack the individual systems in one large system and use standard PETSc solvers. A 3X performance increase is observed with batching. For details see Adams et al. (2024).

Refer to caption
Figure 9: Solver throughput vs problems size (batch size) for batched and ensemble solvers

9 Dense reformulations

Reformulating algorithms to replace level-1 BLAS operations with level-2 or level-3 operations generally improves performance, but dense reformulations are more challenging to formulate for flexible algorithms that use callbacks that cannot be vectorized. In PETSc/TAO we have developed a dense reformulation of the limited-memory BFGS method (L-BFGS, Liu and Nocedal (1989)), one of the most popular optimization algorithms, that improves performance while maintaining the flexibility of the original implementation.

At iteration k𝑘kitalic_k of the L-BFGS algorithm to minimize f(x)𝑓𝑥f(x)italic_f ( italic_x ), the two main steps are (1) updating the approximation from Hk1subscript𝐻𝑘1H_{k-1}italic_H start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT to Hksubscript𝐻𝑘H_{k}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where Hksubscript𝐻𝑘H_{k}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is an approximation of the inverse Hessian (f(xk))1superscript𝑓subscript𝑥𝑘1(\nabla f(x_{k}))^{-1}( ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, and (2) computing the matrix-vector product p=Hkg𝑝subscript𝐻𝑘𝑔p=H_{k}gitalic_p = italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_g. The reference recursive algorithm for step (2) consists of m𝑚mitalic_m rank-1 updates, followed by the application of a base inverse Hessian H0(k)superscriptsubscript𝐻0𝑘H_{0}^{(k)}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT, followed by m𝑚mitalic_m more rank-1 updates:

 

for j[k1,,km]𝑗𝑘1𝑘𝑚j\in[k-1,\dots,k-m]italic_j ∈ [ italic_k - 1 , … , italic_k - italic_m ] do
       αj(sjTg)/djsubscript𝛼𝑗superscriptsubscript𝑠𝑗𝑇𝑔subscript𝑑𝑗\alpha_{j}\leftarrow(s_{j}^{T}g)/d_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← ( italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g ) / italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
        // dot(n𝑛nitalic_n)
      
      ggαjyj𝑔𝑔subscript𝛼𝑗subscript𝑦𝑗g\leftarrow g-\alpha_{j}y_{j}italic_g ← italic_g - italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
        // axpy(n𝑛nitalic_n)
      
pH0(k)g𝑝superscriptsubscript𝐻0𝑘𝑔p\leftarrow H_{0}^{(k)}gitalic_p ← italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_g
  // base operator
for j[km,,k1]𝑗𝑘𝑚𝑘1j\in[k-m,\dots,k-1]italic_j ∈ [ italic_k - italic_m , … , italic_k - 1 ] do
       βj(yjTp)/djsubscript𝛽𝑗superscriptsubscript𝑦𝑗𝑇𝑝subscript𝑑𝑗\beta_{j}\leftarrow(y_{j}^{T}p)/d_{j}italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ) / italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
        // dot(n𝑛nitalic_n)
      
      pp+(αjβj)yj𝑝𝑝subscript𝛼𝑗subscript𝛽𝑗subscript𝑦𝑗p\leftarrow p+(\alpha_{j}-\beta_{j})y_{j}italic_p ← italic_p + ( italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT
        // axpy(n𝑛nitalic_n)
      
return p𝑝pitalic_p

 

The history size m𝑚mitalic_m is fixed, and the method requires the storage of history vectors Sk=[skm||sk1]subscript𝑆𝑘delimited-[]subscript𝑠𝑘𝑚subscript𝑠𝑘1S_{k}=[s_{k-m}|\cdots|s_{k-1}]italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ italic_s start_POSTSUBSCRIPT italic_k - italic_m end_POSTSUBSCRIPT | ⋯ | italic_s start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] and Yk=[ykm||yk1]subscript𝑌𝑘delimited-[]subscript𝑦𝑘𝑚subscript𝑦𝑘1Y_{k}=[y_{k-m}|\cdots|y_{k-1}]italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_k - italic_m end_POSTSUBSCRIPT | ⋯ | italic_y start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ] and the precomputation of the dot products dj:=sjTyjassignsubscript𝑑𝑗superscriptsubscript𝑠𝑗𝑇subscript𝑦𝑗d_{j}:=s_{j}^{T}y_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT := italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. In the implementation of the recursive algorithm in PETSc/TAO, H0(k)gsuperscriptsubscript𝐻0𝑘𝑔H_{0}^{(k)}gitalic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_g is computed as a callback to an arbitrary linear operator. This operator is not only arbitrary but may change from iteration to iteration, which allows this implementation of L-BFGS to support variable-metric methods (Davidon, 1991). For many problems, variable-metric L-BFGS has been shown to converge faster than L-BFGS with a constant base operator H0(k)H0superscriptsubscript𝐻0𝑘subscript𝐻0H_{0}^{(k)}\equiv H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ≡ italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (Dener and Munson, 2019). The recursive algorithm is work optimal, but it includes 2m2𝑚2m2 italic_m synchronizations (one for each dot product) and, as it is based on level-1 BLAS, has little temporal locality that takes advantage of the cache hierarchy.

A compact dense reformulation of L-BFGS was first studied by Byrd et al. (1994), which implements Hkgsubscript𝐻𝑘𝑔H_{k}gitalic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_g by adding to H0(k)gsuperscriptsubscript𝐻0𝑘𝑔H_{0}^{(k)}gitalic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_g the rank-2m2𝑚2m2 italic_m update

[SkRkTWk]Φk[Dk+WkTYkII0][Rk1SkTWkT]ΦkTg,subscriptmatrixsubscript𝑆𝑘superscriptsubscript𝑅𝑘𝑇subscript𝑊𝑘subscriptΦ𝑘matrixsubscript𝐷𝑘superscriptsubscript𝑊𝑘𝑇subscript𝑌𝑘𝐼𝐼0subscriptmatrixsuperscriptsubscript𝑅𝑘1superscriptsubscript𝑆𝑘𝑇superscriptsubscript𝑊𝑘𝑇superscriptsubscriptΦ𝑘𝑇𝑔\underbrace{\begin{bmatrix}-S_{k}R_{k}^{-T}&W_{k}\end{bmatrix}}_{\Phi_{k}}% \begin{bmatrix}D_{k}+W_{k}^{T}Y_{k}&I\\ I&0\end{bmatrix}\underbrace{\begin{bmatrix}-R_{k}^{-1}S_{k}^{T}\\ W_{k}^{T}\end{bmatrix}}_{\Phi_{k}^{T}}g,under⏟ start_ARG [ start_ARG start_ROW start_CELL - italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] end_ARG start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL italic_I end_CELL end_ROW start_ROW start_CELL italic_I end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] under⏟ start_ARG [ start_ARG start_ROW start_CELL - italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] end_ARG start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_g ,

where Wk=H0(k)Yksubscript𝑊𝑘superscriptsubscript𝐻0𝑘subscript𝑌𝑘W_{k}=H_{0}^{(k)}Y_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, Dk=diag(SkTYk)subscript𝐷𝑘diagsuperscriptsubscript𝑆𝑘𝑇subscript𝑌𝑘D_{k}=\mathrm{diag}(S_{k}^{T}Y_{k})italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_diag ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ), and Rk=triu(SkTYk)subscript𝑅𝑘triusuperscriptsubscript𝑆𝑘𝑇subscript𝑌𝑘R_{k}=\mathrm{triu}(S_{k}^{T}Y_{k})italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = roman_triu ( italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). This compact dense algorithm can be implemented with only one synchronization point (in computing ΦkTgsuperscriptsubscriptΦ𝑘𝑇𝑔\Phi_{k}^{T}groman_Φ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_g) and all its flops are in dense matrix-vector products and triangular solves. If H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is constant this compact dense algorithm requires only O(m2)𝑂superscript𝑚2O(m^{2})italic_O ( italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) more work than the recursive algorithm, which is negligible when nmmuch-greater-than𝑛𝑚n\gg mitalic_n ≫ italic_m. When H0(k)superscriptsubscript𝐻0𝑘H_{0}^{(k)}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT varies, however, none of the vectors in Wk1subscript𝑊𝑘1W_{k-1}italic_W start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT can be reused as part of Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, so m1𝑚1m-1italic_m - 1 additional matrix-vector products with H0(k)superscriptsubscript𝐻0𝑘H_{0}^{(k)}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT are required to compute Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and O(nm2)𝑂𝑛superscript𝑚2O(nm^{2})italic_O ( italic_n italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) additional work is required to compute WkTYksuperscriptsubscript𝑊𝑘𝑇subscript𝑌𝑘W_{k}^{T}Y_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

We have implemented an intermediate dense formulation of L-BFGS that computes Hkgsubscript𝐻𝑘𝑔H_{k}gitalic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_g as

Hkg=[ISkRkT][I0YkTI][H0(k)Dk][IYk0I][IRk1SkT]g.subscript𝐻𝑘𝑔matrix𝐼subscript𝑆𝑘superscriptsubscript𝑅𝑘𝑇matrix𝐼0superscriptsubscript𝑌𝑘𝑇𝐼matrixsuperscriptsubscript𝐻0𝑘missing-subexpressionmissing-subexpressionsubscript𝐷𝑘matrix𝐼subscript𝑌𝑘0𝐼matrix𝐼superscriptsubscript𝑅𝑘1superscriptsubscript𝑆𝑘𝑇𝑔H_{k}g=\begin{bmatrix}I&-S_{k}R_{k}^{-T}\end{bmatrix}\\ \begin{bmatrix}I&0\\ Y_{k}^{T}&I\end{bmatrix}\begin{bmatrix}H_{0}^{(k)}&\\ &D_{k}\end{bmatrix}\begin{bmatrix}I&Y_{k}\\ 0&I\end{bmatrix}\\ \begin{bmatrix}I\\ -R_{k}^{-1}S_{k}^{T}\end{bmatrix}g.start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_g = [ start_ARG start_ROW start_CELL italic_I end_CELL start_CELL - italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] end_CELL end_ROW start_ROW start_CELL [ start_ARG start_ROW start_CELL italic_I end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL start_CELL italic_I end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_I end_CELL start_CELL italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL italic_I end_CELL end_ROW end_ARG ] end_CELL end_ROW start_ROW start_CELL [ start_ARG start_ROW start_CELL italic_I end_CELL end_ROW start_ROW start_CELL - italic_R start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] italic_g . end_CELL end_ROW

Like the compact dense formulation, the product Hkgsubscript𝐻𝑘𝑔H_{k}gitalic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_g is computed entirely with dense matrix-vector products and triangular solves. This version has only two synchronization points (in the applications of SkTsuperscriptsubscript𝑆𝑘𝑇S_{k}^{T}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and YkTsuperscriptsubscript𝑌𝑘𝑇Y_{k}^{T}italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT). Unlike the compact dense formulation, there is no need for the vectors Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT or the matrix WkTYksuperscriptsubscript𝑊𝑘𝑇subscript𝑌𝑘W_{k}^{T}Y_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, so H0(k)superscriptsubscript𝐻0𝑘H_{0}^{(k)}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT can vary arbitrarily without increasing the work to update or apply Hksubscript𝐻𝑘H_{k}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

Refer to caption
Figure 10: Comparison of L-BFGS implementations on one node of Frontier, using host CPUs (top) and GPUs (bottom). Marker size indicates the history size m𝑚mitalic_m of the method, and measurements with the same problem size n𝑛nitalic_n and different history sizes are connected.

Figure 10 presents a comparison of the recursive and intermediate dense L-BFGS methods on one node of Frontier, using 8 MPI processes to drive either CPU-based computations (using 1 MPI rank per L3 cache) or GPU-based computations (using 1 MPI rank per GCD). For different problem sizes n𝑛nitalic_n and different history sizes m𝑚mitalic_m we plot Tupdate+Tsolvesubscript𝑇updatesubscript𝑇solveT_{\text{update}}+T_{\text{solve}}italic_T start_POSTSUBSCRIPT update end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT solve end_POSTSUBSCRIPT, the time it takes to update Hk1subscript𝐻𝑘1H_{k-1}italic_H start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT to Hksubscript𝐻𝑘H_{k}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and compute p=Hkg𝑝subscript𝐻𝑘𝑔p=H_{k}gitalic_p = italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_g.

Any algorithm implementing L-BFGS must, at minimum, read Sk,Yk,n×mS_{k},Y_{k},\in\mathbb{R}^{n\times m}italic_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT, read gn𝑔superscript𝑛g\in\mathbb{R}^{n}italic_g ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, and write pn𝑝superscript𝑛p\in\mathbb{R}^{n}italic_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT at each iteration, so we can define an effective bandwidth of each data point, Be:=n(2m+2)/(Tupdate+Tsolve)assignsubscript𝐵𝑒𝑛2𝑚2subscript𝑇updatesubscript𝑇solveB_{e}:=n(2m+2)/(T_{\text{update}}+T_{\text{solve}})italic_B start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT := italic_n ( 2 italic_m + 2 ) / ( italic_T start_POSTSUBSCRIPT update end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT solve end_POSTSUBSCRIPT ), that we use as the y-axis in Figure 10. This plot format is useful for comparing the tradeoffs between time-to-solution and efficiency for different algorithms and devices.

For small problem sizes (n103𝑛superscript103n\leq 10^{3}italic_n ≤ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT on the CPUs, n105𝑛superscript105n\leq 10^{5}italic_n ≤ 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT on the GPUs), when the latency of synchronization is the dominant cost, the intermediate dense method exhibits runtimes that are almost independent of m𝑚mitalic_m, unlike the recursive algorithm with its 2m2𝑚2m2 italic_m synchronizations. For the largest problems, when a single vector does not fit into the last level of cache, a rank-1 update cannot cache the vector that is being updated, so the 2m2𝑚2m2 italic_m rank-1 updates of the recursive method will require almost twice the memory traffic of two rank-m𝑚mitalic_m updates of the intermediate dense method. We see this in the effective bandwidth for the largest problem sizes on both CPUs and GPUs, and the same reasoning explains why the intermediate dense method achieves almost twice the bandwidth of the recursive dense method when all of the problem data fits into the last level of cache for CPU computations, such as in Figure 10 (top) when n=105𝑛superscript105n=10^{5}italic_n = 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT and m=50𝑚50m=50italic_m = 50. This data shows that the intermediate dense L-BFGS formulation in PETSc/TAO exhibits almost uniformly superior performance to the recursive formulation while retaining all of the flexibility of the recursive formulation in supporting variable-metric methods. Further dense reformulations of iterative methods that retain flexibility are planned as future improvements to PETSc/TAO.

10 Fully GPU enabled time evolution of Landau collision integral

PETSc provides GPU support for the entire PDE solver stack, from time integrators to nonlinear solvers, batched linear solvers, and COO matrix assembly. A fully GPU enabled Landau collision time advance is implemented entirely in PETSc with “mini-app” drivers as a PETSc example (Hirvijoki and Adams, 2017; Adams et al., 2017, 2022b, 2024). The structure of this code is shown in Figure 11.

Refer to caption
Figure 11: Code architecture of Landau collision application

This method is an example of building an entire HPC PDE solver in PETSc, with a small driver code222 src/ts/utils/dmplexlandau/tutorials/ex2.c in the PETSc repository mimicking an application, with verification tests, used for performance experiments, and a specific PDE operator (“Landau”) that would, in general, be in user code.

10.1 Landau collision integral

Many problems in physics are described with phase space models where density is a function of both space (𝐱𝐱\mathbf{x}bold_x) and velocity space (𝐯𝐯\mathbf{v}bold_v). One such problem of interest to the DOE is that of magnetized plasmas for fusion energy science (FES), which are similar to several problems in astrophysics. The governing equations for magnetized plasmas are the Vlasov-Maxwell-Landau (VML) system, where each species α𝛼\alphaitalic_α (electrons and ions) are evolved according to

dfαdtfαt+xtxfα+vtvfα=βC[fα,fβ]αβ.𝑑subscript𝑓𝛼𝑑𝑡subscript𝑓𝛼𝑡𝑥𝑡subscript𝑥subscript𝑓𝛼𝑣𝑡subscript𝑣subscript𝑓𝛼subscript𝛽𝐶subscriptsubscript𝑓𝛼subscript𝑓𝛽𝛼𝛽\frac{df_{\alpha}}{dt}\equiv\frac{\partial f_{\alpha}}{\partial t}+\frac{% \partial\vec{x}}{\partial t}\cdot\nabla_{x}f_{\alpha}+\frac{\partial\vec{v}}{% \partial t}\cdot\nabla_{v}f_{\alpha}=\sum_{\beta}C\left[f_{\alpha},f_{\beta}% \right]_{\alpha\beta}.divide start_ARG italic_d italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG ≡ divide start_ARG ∂ italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_t end_ARG + divide start_ARG ∂ over→ start_ARG italic_x end_ARG end_ARG start_ARG ∂ italic_t end_ARG ⋅ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT + divide start_ARG ∂ over→ start_ARG italic_v end_ARG end_ARG start_ARG ∂ italic_t end_ARG ⋅ ∇ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_C [ italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT .

This equation is composed of the symplectic Vlasov-Maxwell term dfαdt=0𝑑subscript𝑓𝛼𝑑𝑡0\frac{df_{\alpha}}{dt}=0divide start_ARG italic_d italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = 0 and a metric, or diffusive, collision operator C𝐶Citalic_C. The Landau form of Fokker-Planck collisions is a velocity space operator and is the gold standard for fusion plasmas. PETSc includes examples, with verification tests, that use our Landau collision operator to evolve dfαdt=βC[fα,fβ]αβ𝑑subscript𝑓𝛼𝑑𝑡subscript𝛽𝐶subscriptsubscript𝑓𝛼subscript𝑓𝛽𝛼𝛽\frac{df_{\alpha}}{dt}=\sum_{\beta}C\left[f_{\alpha},f_{\beta}\right]_{\alpha\beta}divide start_ARG italic_d italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = ∑ start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT italic_C [ italic_f start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_α italic_β end_POSTSUBSCRIPT, that runs an entire PDE simulation on the GPU with the PETSc GPU backends (Adams et al., 2024). Figure 12 shows the thermalization of a shifted Maxwellian distribution, with an initial drift velocity of 1.51.51.51.5, with a stationary ion population (not shown). The large mass of ions results in small equilibrium velocity shown in Figure 11(c).

Refer to caption
(a) t=0𝑡0t=0italic_t = 0
Refer to caption
(b) bi-modal
Refer to caption
(c) near equilibrium
Figure 12: Electron distributions of shifted Maxwellian deuterium plasmas: (a) initial condition, (b) penumbra in shift and early Maxwellian population, (c) near full thermalization with ions at origin (ions not shown)

10.2 Validation of Landau collision integral

A common anisotropic plasma verification test has recently been published that initializes a two species plasma (electrons and deuterium) with (4) different temperatures for the parallel and perpendicular temperature of each species that is allowed to evolve toward equilibrium. A driver code runs many of these problems simultaneously to mimic its use in an application where 1,000s of spatial vertices would be processed simultaneously. This problem uses PETc’s adaptive time step** with over 14,000 time steps to achieve near full thermalization, using new high-order simplex finite elements and verification with analytical results of thermalization rates (Adams et al., 2024). Figure 13 shows a temperature history for the P2𝑃2P2italic_P 2 case.

Refer to caption
Figure 13: Anisotropic relaxation test temperature vs. time, with the normalization time t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, of the P2𝑃2P2italic_P 2 element case, plotted with an analytical NRL results

Note, the difference with the NRL Plasma Formulary model are due to the NRL rates being derived with a simplifying assumption and others have observed similar differences (Hager et al., 2016).

11 Conclusion and Vision for the Future

We have summarized some of the challenges encountered and the advances made toward providing performance-portable support for GPUs in PETSc over the course of ECP. This effort has required development ranging across the different levels of the PETSc software stack. As accelerated GPU computation continues to shape the HPC field, we will continue to improve PETSc’s capabilities for this, adding new features (e.g., support for the block CSR matrix format on GPUs), and offloading more computations. We hope to consolidate PETSc GPU backends to simplify implementation and maintenance, and we will continue to collaborate with developers of PETSc external libraries to smooth the interface for GPU data passing between libraries. An important direction that PETSc developers are currently pursuing is adding robust support for machine learning computations in PETSc to support a broad range of applications, which demands high GPU performance.

Maintaining a stable interface is important for a numerical library like PETSc/TAO, but our experience has shown the tension between that goal and the goal of portable performance for pre- and early exascale machines. In some cases, as in the dense reformulation of L-BFGS, additional parallelism can be found within existing interfaces that is portable across systems; in other cases, as in the introduction of the ManagedMemory type, something that is technically a change to the library’s interface finds additional parallelism within existing programming patterns used by applications. First in introducing PetscSF and now in MatSetValuesCOO(), PETSc has had success with new interfaces that (a) are declarative, (b) expose as much of the parallelism within the desired computation as possible, and (c) are independent of the computational resources used to execute the operation and the memory resources holding the arguments. Once introduced, interfaces with these properties appear to become stable features of the library, and applications that adopt them should see benefits. In the future, continued evolution of PETSc/TAO along these lines will help applications achieve portable performance for larger and larger percentages of their workflows, both on exascale systems and those that will follow.

{acks}

The authors were supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research under Contract DE-AC02-06CH11357. This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357, resources of the National Energy Research Scientific Computing Center, a Department of Energy Office of Science User Facility, and resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725. The work was partially done on a pre-production supercomputer with early versions of the Aurora software development kit. The authors were partially supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. MGK was partially supported by NSF CCSI award 1931524.

References

  • Adams et al. (2022a) Adams M, Balay S, Marin O, McInnes LC, Mills RT, Munson T, Zhang H, Zhang J, Brown J, Eijkhout V, Faibussowitsch J, Knepley M, Kong F, Kruger S, Sanan P, Smith BF and Zhang H (2022a) The PETSc community as infrastructure. IEEE CiSE 24(3): 6–15. 10.1109/MCSE.2022.3169974.
  • Adams et al. (2022b) Adams MF, Brennan DP, Knepley MG and Wang P (2022b) Landau collision operator in the CUDA programming model applied to thermal quench plasmas. In: 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS). pp. 115–123. 10.1109/IPDPS53621.2022.00020.
  • Adams et al. (2017) Adams MF, Hirvijoki E, Knepley MG, Brown J, Isaac T and Mills R (2017) Landau collision integral solver with adaptive mesh refinement on emerging architectures. SIAM Journal on Scientific Computing 39(6): C452–C465. 10.1137/17M1118828.
  • Adams et al. (2024) Adams MF, Wang P, Merson J, Huck K and Knepley MG (2024) A performance portable, fully implicit landau collision operator with batched linear solvers. Summited to SISC.
  • AMD (2024) AMD (2024) HIP programming manual. URL https://rocm.docs.amd.com/projects/HIP/en/latest/user_guide/programming_manual.html.
  • Amestoy et al. (2001) Amestoy PR, Duff IS, L’Excellent JY and Koster J (2001) A fully asynchronous multifrontal solver using distributed dynamic scheduling. SIAM Journal on Matrix Analysis and Applications 23(1): 15–41.
  • Anzt et al. (2022) Anzt H, Cojean T, Flegar G, Göbel F, Grützmacher T, Nayak P, Ribizel T, Tsai YM and Quintana-Ortí ES (2022) Ginkgo: A Modern Linear Operator Algebra Framework for High Performance Computing. ACM Transactions on Mathematical Software 48(1): 2:1–2:33. 10.1145/3480935.
  • Balay et al. (2024) Balay S, Abhyankar S, Adams MF, Brown J, Brune P, Buschelman K, Dalcin L, Dener A, Eijkhout V, Gropp WD, Karpeyev D, Kaushik D, Knepley MG, May DA, McInnes LC, Mills RT, Munson T, Rupp K, Sanan P, Smith BF, Zampini S, Zhang H, Zhang H and Zhang J (2024) PETSc users manual. Technical Report ANL-21/39 - Revision 3.21, Argonne National Laboratory. URL https://petsc.org.
  • Beckingsale et al. (2019) Beckingsale DA, Burmark J, Hornung R, Jones H, Killian W, Kunen AJ, Pearce O, Robinson P, Ryu** BS and Scogland TR (2019) RAJA: Portable performance for large-scale scientific applications. In: 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC). IEEE, pp. 71–81.
  • Brown et al. (2012) Brown J, Knepley MG, May DA, McInnes LC and Smith BF (2012) Composable linear solvers for multiphysics. In: Proceeedings of the 11th International Symposium on Parallel and Distributed Computing (ISPDC 2012). IEEE Computer Society, pp. 55–62.
  • Byrd et al. (1994) Byrd RH, Nocedal J and Schnabel RB (1994) Representations of quasi-newton matrices and their use in limited memory methods. Mathematical Programming 63(1-3): 129–156.
  • Cecka et al. (2011) Cecka C, Lew AJ and Darve E (2011) Assembly of finite element methods on graphics processors. International Journal for Numerical Methods in Engineering 85(5): 640–669.
  • Davidon (1991) Davidon WC (1991) Variable metric method for minimization. SIAM Journal on optimization 1(1): 1–17.
  • Davis and Hu (2011) Davis TA and Hu Y (2011) The university of florida sparse matrix collection. ACM Transactions on Mathematical Software (TOMS) 38(1): 1–25.
  • Demmel et al. (2024) Demmel J, Gilbert J and Li X (2024) SuperLU Github. URL https://github.com/xiaoyeli/superlu.
  • Dener and Munson (2019) Dener A and Munson T (2019) Accelerating Limited-Memory Quasi-Newton Convergence for Large-Scale Optimization. Springer International Publishing. ISBN 9783030227449, p. 495–507. 10.1007/978-3-030-22744-9_39.
  • E4S Team (2024) E4S Team (2024) E4S Web page. https://e4s.io.
  • Faibussowitsch et al. (2023) Faibussowitsch J, Adams MF, Mills RT, Zampini S and Zhang J (2023) Safe, seamless, and scalable integration of asynchronous GPU streams in PETSc. arXiv preprint arXiv:2306.17801 .
  • Falgout (2023) Falgout R (2023) hypre users manual. Technical Report Revision 2.28, Lawrence Livermore National Laboratory. URL https://hypre.readthedocs.io/.
  • Freund (1993) Freund RW (1993) A transpose-free quasi-minimal residual algorithm for non-hermitian linear systems. SIAM Journal on Scientific Computing 14(2): 470–482. 10.1137/0914029.
  • Hager et al. (2016) Hager R, Yoon E, Ku SH, D’Azevedo EF, Worley PH and Chang CS (2016) A fully non-linear multi-species Fokker–Planck–Landau collision operator for simulation of fusion plasma. Journal of Computational Physics 315: 644–660. 10.1016/j.jcp.2016.03.064.
  • Hirvijoki and Adams (2017) Hirvijoki E and Adams MF (2017) Conservative discretization of the Landau collision integral. Physics of Plasmas 24(3): 032121. 10.1063/1.4979122.
  • Khronos SYCL Working Group (2020) Khronos SYCL Working Group (2020) SYCL 2020 specification). URL https://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf.
  • Kothe et al. (2019) Kothe D, Lee S and Qualters I (2019) Exascale computing in the United States. IEEE CiSE 21(1): 17–29. 10.1109/MCSE.2018.2875366.
  • Liegeois et al. (2023) Liegeois K, Rajamanickam S and Berger-Vergiat L (2023) Performance portable batched sparse linear solvers. IEEE Transactions on Parallel and Distributed Systems 34(5): 1524–1535. 10.1109/TPDS.2023.3249110.
  • Liu and Nocedal (1989) Liu DC and Nocedal J (1989) On the limited memory bfgs method for large scale optimization. Mathematical programming 45(1-3): 503–528.
  • Mills et al. (2021) Mills RT, Adams MF, Balay S, Brown J, Dener A, Knepley M, Kruger SE, Morgan H, Munson T, Rupp K et al. (2021) Toward performance-portable PETSc for GPU-based exascale systems. Parallel Computing 108: 102831.
  • NVIDA (2024) NVIDA (2024) CUDA C++ programming guide. URL https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf.
  • NVIDIA (2024) NVIDIA (2024) NVIDIA OpenSHMEM library (NVSHMEM) documentation. URL https://docs.nvidia.com/nvshmem/api/index.html.
  • Open Source Software Solutions, Inc. (2020) Open Source Software Solutions, Inc (2020) OpenSHMEM application programming interface v1.5. URL http://www.openshmem.org/.
  • OpenMP Architecture Review Board (2021) OpenMP Architecture Review Board (2021) OpenMP application programming interface. URL https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf.
  • Panda et al. (2024) Panda D et al. (2024) OSU microbenchmarks v7.3. http://mvapich.cse.ohio-state.edu/benchmarks/ .
  • Pennycook et al. (2013) Pennycook S, Hammond S, Wright S, Herdman J, Miller I and Jarvis S (2013) An investigation of the performance portability of OpenCL. Journal of Parallel and Distributed Computing 73(11): 1439–1450.
  • Rupp et al. (2016) Rupp K, Tillet P, Rudolf F, Weinbub J, Morhammer A, Grasser T, Jungel A and Selberherr S (2016) ViennaCL—linear algebra library for multi-and many-core architectures. SIAM Journal on Scientific Computing 38(5): S412–S439.
  • Trott et al. (2022) Trott CR, Lebrun-Grandié D, Arndt D, Ciesko J, Dang V, Ellingwood N, Gayatri R, Harvey E, Hollman DS, Ibanez D, Liber N, Madsen J, Miles J, Poliakoff D, Powell A, Rajamanickam S, Simberg M, Sunderland D, Turcksin B and Wilke J (2022) Kokkos 3: Programming model extensions for the exascale era. IEEE Transactions on Parallel and Distributed Systems 33(4): 805–817. 10.1109/TPDS.2021.3097283.
  • Trotter et al. (2023) Trotter JD, Langguth J and Cai X (2023) Targeting performance and user-friendliness: Gpu-accelerated finite element computation with automated code generation in fenics. Parallel Computing 118: 103051.
  • xSDK Team (2024) xSDK Team (2024) xSDK Web page. https://xsdk.info.
  • Zhang et al. (2022) Zhang H, Constantinescu EM and Smith BF (2022) PETSc TSAdjoint: a discrete adjoint ODE solver for first-order and second-order sensitivity analysis. SIAM Journal on Scientific Computing 44(1): C1–C24. 10.1137/21M140078X.
  • Zhang et al. (2021) Zhang J, Brown J, Balay S, Faibussowitsch J, Knepley M, Marin O, Mills RT, Munson T, Smith BF and Zampini S (2021) The PetscSF scalable communication layer. IEEE Transactions on Parallel & Distributed Systems 10.1109/TPDS.2021.3084070.
  • Zhang et al. (2019) Zhang W, Almgren A, Beckner V, Bell J, Blaschke J, Chan C, Day M, Friesen B, Gott K, Graves D, Katz M, Myers A, Nguyen T, Nonaka A, Rosso M, Williams S and Zingale M (2019) AMReX: a framework for block-structured adaptive mesh refinement. Journal of Open Source Software 4(37): 1370. 10.21105/joss.01370.
  • Zhou et al. (2022) Zhou H, Raffenetti K, Guo Y and Thakur R (2022) Mpix stream: An explicit solution to hybrid mpi+ x programming. In: Proceedings of the 29th European MPI Users’ Group Meeting. pp. 1–10.