HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: beramono
  • failed: esvect
  • failed: bold-extra

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-ND 4.0
arXiv:2401.02680v1 [cs.DC] 05 Jan 2024

Preliminary report:Initial evaluation of StdPar implementations on AMD GPUs for HPC

Wei-Chen Lin Department of Computer Science
University of Bristol
Bristol, UK
[email protected], [email protected]
Simon McIntosh-Smith Department of Computer Science
University of Bristol
Bristol, UK
[email protected], [email protected]
Tom Deakin Department of Computer Science
University of Bristol
Bristol, UK
[email protected], [email protected]

1 Introduction

Recently, AMD platforms have not supported offloading C++17 PSTL (StdPar) programs to the GPU. Our previous work[1] highlights how StdPar is able to achieve good performance across NVIDIA and Intel GPU platforms. In that work, we acknowledged AMD’s past effort such as HCC, which unfortunately is deprecated and does not support newer hardware platforms.

Recent developments by AMD, Codeplay, and AdaptiveCpp (previously known as hipSYCL or OpenSYCL) have enabled multiple paths for StdPar programs to run on AMD GPUs. This informal report discusses our experiences and evaluation of currently available StdPar implementations for AMD GPUs. We conduct benchmarks using our suite of HPC mini-apps with ports in many heterogeneous programming models, including StdPar. We then compare the performance of StdPar, using all available StdPar compilers, to contemporary heterogeneous programming models supported on AMD GPUs: HIP, OpenCL, Thrust, Kokkos, OpenMP, SYCL. Where appropriate, we discuss issues encountered and workarounds applied during our evaluation.

Finally, the StdPar model discussed in this report largely depends on Unified Shared Memory (USM) performance and very few AMD GPUs have proper support for this feature. As such, this report demonstrates a proof-of-concept host-side userspace pagefault solution for models that use the HIP API. We discuss performance improvements achieved with our solution using the same set of benchmarks.

2 StdPar implementations

The C++ semantics of StdPar programs dictate that all access to memory is inside a single address space. For StdPar to work on discrete accelerators with a separate pool of memory, the vendor driver must support some form of unified memory (e.g. UVM in CUDA, XNACK in HSA) to adhere to C++ memory model. As such, the performance of StdPar is directly tied to how well the unified memory implementation handles page migration and data residency.

This section introduces our StdPar implementations for AMD GPUs. We discuss how each implementation handles memory access between the host and the device.

2.1 AdaptiveCpp (hipSYCL) StdPar

The AdaptiveCpp (recently renamed from hipSYCL) project is an independent SYCL implementation with support of the following platforms: CUDA, HIP, OpenCL (SPIR-V ingestion required), LevelZero, and OpenMP. The project recently gained experimental support for StdPar111https://github.com/OpenSYCL/OpenSYCL/blob/12f8c24d27c2e33e7357bb1bc44a2d12e60f427b/doc/stdpar.md by reusing major parts of the existing SYCL offloading infrastructure. The changes required for StdPar are mostly contained in the compiler frontend; AdaptiveCpp enables StdPar on the same set of platforms it supports with SYCL, which includes the HIP backend.

To support StdPar’s address space requirement, AdaptiveCpp replaces all memory-related functions in the program with HIP’s unified memory API (hipMallocManaged) calls. This essentially forces all allocations on the host to be controlled by AMD’s KFD kernel driver. A compiler flag (--opensycl-stdpar-system-usm) can be used to disable this replacement behaviour on systems with either 1) coherent host and device memory access or 2) a kernel with HMM support.

2.2 ROCm StdPar

ROCm StdPar222https://github.com/ROCmSoftwarePlatform/roc-stdpar (rocStdPar hereafter) is a new experimental StdPar implementation from AMD. The implementation currently uses LLVM’s HIP support and delegates most algorithms to ROCm’s rocThrust implementation. The compiler can be built without any ROCm components; it currently exists as a patch to upstream LLVM and a single-file glue header to rocThrust.

Like AdaptiveCpp, ROCm StdPar satisfies the single address space requirement by either substituting memory-related functions with HIP ones (enabled with --hipstdpar-interpose-alloc) or expects host-device memory to be coherent.

2.3 Intel DPC++ w/ vendor plugin

Intel’s DPC++ (ICPX) is a fork of LLVM that adds support for the SYCL programming model. The DPC++ runtime is designed in a manner that allows the development of backend plugins, enabling the execution of SYCL programs on various vendor platforms. Presently, Intel’s subsidiary, Codeplay, maintains CUDA and experimental HIP backends for DPC++. Using Codeplay’s vendor plugin, we can run SYCL programs on AMD GPUs333https://developer.codeplay.com/products/oneapi/amd/2023.2.1/guides/.

For StdPar, our past study has shown that Intel’s oneDPL header-only library is able to bridge the gap by implementing StdPar on top of SYCL[1]. For oneDPL to work while kee** the program ISO C++ compliant, we use a small shim header that replaces memory allocation functions with SYCL’s USM allocations (i.e. sycl::malloc_shared). SYCL’s USM allocation is backed by hipMallocManaged from the HIP API. Like AdaptiveCpp, allocations implemented this way allow the host and device to share the same address space.

3 Mini-apps in the evaluation

Table 1: Mini-app benchmark configurations

Mini-app Input deck Grid size Steps Total Memory Requirement BabelStream N/A N/A 100 12.9 GB miniBUDE bm1 N/A 8 271 KB (n=65536) TeaLeaf BM5@4k 4000 2 1.96 GB CloverLeaf BM16 3840 300 2.95 GB

For a comprehensive evaluation, we select a range of HPC mini-apps that cover two main scenarios: compute bound and memory-bandwidth bound applications. This section introduces the mini-apps used in our evaluation.

3.1 BabelStream

BabelStream444https://github.com/UoB-HPC/BabelStream/tree/option_for_vec implements the standard McCalpin STREAM benchmark with an additional Dot product kernel in a wide range of programming models [2, 3]. This memory-bandwidth bound benchmark measures the time taken for each unique kernel and generates memory bandwidth data in MB/s.

As shown in Table 1, we use the default iteration count of 100 and set the array size to 229superscript2292^{29}2 start_POSTSUPERSCRIPT 29 end_POSTSUPERSCRIPT(\approx 4GB) to avoid any unrealistic caching behaviours.

3.2 miniBUDE

MiniBUDE 555https://github.com/UoB-HPC/miniBUDE/tree/v2 is a molecular docking benchmark that is reduced from the full scale Bristol University Docking Enging (BUDE). The mini-app implements the virtual screening process where we evaluate energy values from docking ligand and protein molecules in different poses. This application is compute-bound, primarily because the input decks typically occupy very little memory (i.e. approx𝑎𝑝𝑝𝑟𝑜𝑥approxitalic_a italic_p italic_p italic_r italic_o italic_x 300KB). Structured similarly to BabelStream, miniBUDE is also implemented in a wide range of programming models.

MiniBUDE exposes two tuning variables: wgsize and PPWI. Variable wgsize controls the hierarchical kernel launch’s group size (e.g. the workgroup size of an NDRange launch in OpenCL terminology). StdPar and Thrust does not yet expose a way to express hierarchical parallelism, so wgsize is not tunable for these two models. The PPWI variable controls the number of poses per task (e.g. workitem in OpenCL terminology). PPWI is used in the main kernel’s innermost loop, which is statically unrolled using C++ templates; this variable is supported in all models.

For our evaluation, we use the BM1 input, with iteration count shown in Table 1. BM1 is a small problem size with just 938 proteins and 26 ligands.

3.3 CloverLeaf

CloverLeaf is a hydrodynamics mini-app that solves the compressible Euler Equations using a structured grid. This is a complex mini-app with more than 100 unique kernels. The fluid simulation is done in configurable resolution and timestamps. Each timestep involves two main categories of computation: 2D reductions and 2D grid traversal. Like BabelStream, CloverLeaf has been ported to multiple programming models.

CloverLeaf’s kernel submission frequency is high, and each kernel’s data dependencies are complex: many involve more than a dozen independent buffers.

3.4 TeaLeaf

TeaLeaf is a diffusion mini-app that solves the heat conduction equations. Like CloverLeaf, this is a complex mini-app with a high number of unique kernels. The principal access pattern uses a 5-point stencil to compute the Sparse Matrix Vector Product (SpMV) of our grid.

For our evaluation, we use the Conjugate Gradient solver. This solver, in addition to SpMV, requires multiple 2D reductions at each step which can help identify any performance weaknesses of the underlying programming model or implementation.

4 USM (XNACK) on AMD GPUs

Table 2: Platform details

Name Architecture Abbreviation Attachment XNACK Theoretical Peak Mem. Bandwidth (GB/s) Theoretical Peak FP32 FLOP/s (GFLOP/s) AMD Instinct MI100 CDNA (gfx908) MI100 PCIe 4.0 x16 Disabled (kernel) 1228 23070 AMD Radeon VII Vega 20 (gfx906) RadeonVII PCIe 3.0 x16 Enabled (kernel) 1024 13800 AMD Ryzen 7950X IGP/APU Navi 2 (gfx1036) Raphael PCIe 4.0 x8 Unsupported 96 563.2

To be able to use a host pointer in a device kernel, the GPU hardware must be able to signal a pagefault to the host and retry the access once the page has been migrated to GPU memory in some way. On AMD GPUs, this hardware feature is called XNACK. Currently, only HPC and a handful of consumer GPUs support this feature.

Enabling XNACK on AMD GPUs requires a somewhat recent Linux kernel along with a non-default kernel argument: amdgpu.noretry=0. As of Linux kernel 6.2 (mainline), XNACK still appears to be a moving target in terms of stability; earlier (5.15) kernels exhibit random panics and hangs that require a physical power-cycle to recover. Documentation on XNACK is virtually non-existent beyond kernel source code and third-party discussions666https://niconiconi.neocities.org/tech-notes/xnack-on-amd-gpus/.

Due to the factors discussed above, very few production clusters that have AMD GPUs are configured with XNACK enabled. Currently, only ORNL’s Frontier and the Crusher testbed appear to have any support for XNACK. We are uncertain if LUMI has this feature enabled. Requesting access to these machines has a lead-time of up to 6 months.

4.1 USM without XNACK

Without XNACK, AMD GPU devices cannot handle pagefaults and ROCm degrades the allocation to host-resident memory where all accesses from the device must cross the host-device interconnect (e.g., PCIe). In this degraded mode, no page migration occurs, so performance for memory-bandwidth bound applications will be limited to interconnect performance. For example, a MI100 operating at PCIe 4.0 x16 without XNACK will see application performance capped at 31.5GB/s when device-resident memory should be capable of 1228.8 GB/s, a 40x difference.

To overcome this, we implement a simple LD_PRELOAD program called UTPX (Userspace Transparent Paging Extension)777https://github.com/UoB-HPC/utpx. UTPX is a proof-of-concept shim program that accelerates HIP managed allocations (e.g. hipMallocManaged) on systems without XNACK or with XNACK disabled.

4.2 UTPX design

Refer to caption

Figure 1: UTPX HIP API interposing sequence

UTPX solves USM by shifting pagefault handling to the host where it can be easily accomplished in userspace. The implementation performs userspace page migration using mprotect and signal handlers. UTPX is an LD_PRELOAD program, and as such, can be used on existing programs without any recompilation.

Paging is done at the granularity of the allocation itself using a Mirror-on-Access (mirror) scheme. With mirror, initial allocations are resident on the host, and a separate device-resident allocation is made whenever a kernel is launched that has a dependency on the allocation. A device to host write-back is triggered by an mprotect induced pagefault, this happens if the host copy of the device-resident memory is accessed in any way (e.g., through pointer dereference).

As an alternative to the mirror scheme mentioned above, we also implement two alternative schemes of memory management:

  • The device scheme replaces all allocations with hipMalloc, effectively making all memory allocations device resident. The userspace pagefault handler is not installed here because allocations made with hipMalloc are host-visible.

  • The advise scheme does not replace hipMallocManaged, but inserts extra hipMemAdvise on allocation calls and hipMemPrefetchAsync to kernel launches; on each kernel launch, instead of creating device allocations, we simply prefetch memory from the host to the device.

Per-kernel memory dependency is resolved by introspection of the .note section of the HSA code object (HSACO) ELF image. This ELF image is available if we carefully intercept hsa_code_object_reader_create_from_memory from LD_PRELOAD. For lambda capture objects, we perform a ranged scan of pointer values at a 2-byte increment against all known past allocations. A pointer scan is required because the program binary does not store layout information for structures.

It’s worth noting that our current implementation does not support pointer indirection beyond the first level. This limitation is relevant for complex data structures like std::vector. However, a straightforward approach of chasing pointers (such as treating each argument as roots in a garbage collector) could be employed to implement this.

To associate a kernel launch-site (e.g. very-much-less-thanabsentvery-much-greater-than\lll\ggg⋘ ⋙/hipLaunchKernelGGL) with the correct set of metadata, we intercept __hipRegisterFunction which includes the pointer to the kernel function and the mangled kernel name. If the program loads the kernel image at runtime, as implemented in ICPX’s HIP plugin, we make use of the fact that a hipFunction_t is internally a pointer to the amd::DeviceFunc structure. This structure stores the mangled name of the kernel as a std::string at a 144 byte offset from the base of the pointer. Once the mangled kernel name is obtained, we can find the correct argument information by doing a lookup with metadata obtained from the ELF image.

5 Performance results

Table 3: Software versions and configuration, MI100* is a separate HPE prototype system

Software Version OS RadeonVII: Ubuntu 22.04 LTS MI100: RHEL 8.6 MI100*: SLES 15 SP5 Raphael: Fedora 37 Kernel RadeonVII: 6.4.6+HMM, amdgpu.noretry=0 MI100: 4.18+HMM MI100*: 5.14+HMM, amdgpu.noretry=0 Raphael: 6.4.15+HMM ROCm (w/ rocThrust rocPRIM) RadeonVII: 5.5.1 (LLVM 16) MI100: 5.4.1 (LLVM15) MI100*: 5.5.1 (LLVM 16) Raphael: 5.5.1 (LLVM 16) GCC, libstdc++ 12.3.0 AOMP 18.0.0 (LLVM 18) ICPX, oneDPL 2023.2.1 (LLVM 17) AdaptiveCpp SHA: fd5d1c LLVM 18, SHA: ecb855a5 ROCm StdPar SHA: e900d4 LLVM 18, SHA: ecb855a5

As this report is intended to provide early feedback for emerging StdPar models on AMD GPUs, we have opted to conduct benchmarks on a local RadeonVII test system. RadeonVII is selected for its resemblance to MI50 and the official ROCm support from AMD.

For XNACK-disabled scenarios, we present single-node MI100 results obtained by the GW4 Isambard system. We encountered difficulties in accessing AMD HPC GPUs with XNACK enabled. However, we were able to locate a MI100 system with XNACK enabled in HPE’s Grenoble prototype system. Access to this system is only available for a limited time window, so we ran exclusively XNACK-enabled benchmarks to achieve XNACK-enabled coverage on MI100.

Specific details of our hardware selection are shown in Table 2. and Table 3 lists specific versions of the compilers and software used for benchmarks presented in this section.

The rocStdPar compiler supports two modes of operation. We use rocStdPar@ITP for benchmarks that were compiled with --hipstdpar-interpose-alloc and rocStdPar@HMM for ones without. The --hipstdpar-interpose-alloc flag replaces all memory allocations (e.g. malloc) with hipMallocManaged so that the pagefault behaviour does not require kernel-level HMM support. In theory, AdaptiveCpp also supports this via the -opensycl-stdpar-system-usm but due to time constraints, we did not test this.

5.1 Results: BabelStream

Refer to caption

Figure 2: BabelStream kernel bandwidth as fraction of theoretical peak on RadeonVII, higher is better

Refer to caption

Figure 3: BabelStream kernel bandwidth as fraction of theoretical peak on MI100, higher is better

Refer to caption

Figure 4: BabelStream kernel bandwidth as fraction of theoretical peak on Raphael, higher is better

BabelStream results are mostly inline with expectations for a GPU platform: most explicit models are reaching about 80% of the bandwidth on all five kernels. USM models without XNACK, as shown in the second group from the top of Fig. 2 and Fig. 3, are limited to PCIe bandwidth and match the documented behaviour of host-resident memory. Intel’s ICPX compiler as a whole appears suboptimal with Codeplay’s HIP backend for memory-bandwidth bound kernels.

On RadeonVII, the use of XNACK does not appear to impose significant overhead: StdPar models on all three implementations performed nearly on-par with other more established models. Likewise, on both RadeonVII and MI100, UTPX in mirror mode successfully restored near full performance for all USM models. The effectiveness of UTPX is unexpected; we expected that userspace pagefault handling would have significant overhead, and adding an extra layer of indirection for all kernel launches would also add severe latency to the program.

Results for rocStdPar presented in this section fail validation. Looking at the memory allocation logic of rocStdPar, it employs a fairly complex allocation scheme when the interposing mode is enabled: rocStdPar attempts to allocate a large page-aligned block of memory for both bookkee** and as the primary allocation on top of hipMallocManaged. When UTPX is used together with rocStdPar on large allocations (> 1GB), the host write-back (see Fig. 1) on pagefault seems to copy pages that are corrupted. It’s unclear why this only occurs for large allocations.

UTPX’s advice mode failed to provide a meaningful uplift for rocStdPar, with performance equal to the bandwidth of XNACK and UTPX disabled results. We suspect the complex allocation scheme broke ROCm’s undocumented alignment invariants required for hipMemPrefetchAsync and hipMemAdvise to function correctly. A simple validation was done where we replace the allocation behaviour of rocStdPar with just hipMallocManaged and nothing else. With this change, advice mode was able to match the performance of mirror mode, and BabelStream passes validation.

Device mode was tested with BabelStream reporting bandwidth figures on-par with explicit models. However, it’s important to note that BabelStream does not time the final device-to-host transfer, which is needed for validation. Consequently, while the bandwidth measured during kernel execution accurately reflects the performance, device mode introduces additional data movement time that can dominate the overall benchmark results. In light of this, we have omitted device mode results as we are unsure how this would fit with existing bandwidth measurements. An issue888https://github.com/UoB-HPC/BabelStream/issues/161 has been opened in the BabelStream repository for us to assess how the transfer timing information can be included in the output of the benchmark.

Results on the Raphael APUs, as shown in Fig. 4, stand out. Explicit models show a significant lead over USM ones, despite the GPU sharing the memory with the host. After consulting with AMD and reviewing the relevant hardware topology documentations, we believe the main cause of this was due to the limited attachment bandwidth; the GPU communicates with the CPU using an on-die PCIe 4.0 x8 connection.

Refer to caption

Figure 5: miniBUDE normalised runtime across all models against HIP on RadeonVII, lower is better

Refer to caption

Figure 6: miniBUDE normalised runtime across all models against HIP on MI100, lower is better

Refer to caption

Figure 7: miniBUDE normalised runtime across all models against HIP on Raphael, lower is better

5.2 Results: miniBUDE

Refer to caption

Figure 8: miniBUDE PPWI/wgsize tuning on RadeonVII, fastest combination outlined in red, lower is better

Refer to caption

Figure 9: miniBUDE PPWI/wgsize tuning on MI100, fastest combination outlined in red, lower is better
Figure 10: miniBUDE PPWI/wgsize tuning on Raphael, fastest combination outlined in red, lower is better

Refer to caption

Refer to caption (a) RadeonVII Refer to caption (b) MI100 Refer to caption (c) Raphael
Figure 10: miniBUDE PPWI/wgsize tuning on Raphael, fastest combination outlined in red, lower is better
Figure 11: miniBUDE PPWI/wgsize tuning for models without wgsize parameter, fastest combination outlined in red, lower is better

MiniBUDE results are presented in two parts to show 1) effects of the tuning parameters, or the lack thereof, on specific models, and 2) the effective performance after tuning.

Fig. 8, Fig. 9, Fig. 11 present heatmaps where the X axis scales PPWI and Y axis scales the wgsize parameter. This benchmark highlights a crucial tradeoff that models like StdPar and Thrust made: workgroup size is an implementation detail and not programmer adjustable. Across all tuning results shown here, the best performing PPWI and wgsize combination changes for different platforms and programming models. In certain cases, selecting parameters that are immediately adjacent in any direction to the best performing combination can see an up-to 50% loss (Fig. 9).

Using the best performing PPWI and wgsize combination, we compare results from all models and compilers in Fig. 5, Fig. 6, and Fig. 7. Here, the fastest model is OpenCL, even AMD’s first party model, HIP, trailed behind. This agrees with our past evaluations for miniBUDE[4]. We suspect the added complexity of C++-based abstractions is introducing unnecessary optimisation burdens to the main kernel. MiniBUDE is highly sensitive to missed or poorly optimised code due to its loop structure; the pose count will amplify any suboptimal code path in the ligand and atom inner loop.

StdPar models all performed about the same and matched the performance of HIP. Surprisingly, Thrust (backed by rocThrust) was a lot worse compared with StdPar results.

While miniBUDE is not a memory-bandwidth bound benchmark, we still enable UTPX to gauge the relative overhead of intercepting HIP API calls to perform complex kernel argument manipulations. The overall performance impact across both RadeonVII MI100 appears to be minimal.

Support for the Raphael APU is limited. For ICPX, the compiler is missing an entry from the architecture table, and a bug has been opened to track this issue 999https://github.com/intel/llvm/issues/11203. In general, we find downstream vendors are reluctant to support GPUs that are outside ROCm’s support, we discuss why this is an issue in Section 6. Possibly for the same reason, AMD’s own AOMP compiler was unable to target the APU even though a PR to support this has already been merged 101010https://github.com/ROCm-Developer-Tools/aomp/pull/452. In this instance, AOMP appears to be missing the required math library.

5.3 Results: CloverLeaf

Refer to caption

Figure 12: CloverLeaf normalised runtime across all models against HIP on RadeonVII, higher is better

Refer to caption

Figure 13: CloverLeaf normalised runtime across all models against HIP on MI100, higher is better

Refer to caption

Figure 14: CloverLeaf normalised runtime across all models against HIP on Raphael, higher is better

CloverLeaf results are presented in Fig. 14, Fig. 12, and Fig. 13. In these figures, all StdPar implementations were able to achieve approximately 70% of HIP performance with XNACK enabled. Performance without XNACK was notably poor, as expected due to the default host-resident behaviour of hipMallocManaged. The lower performance of ICPX, as discussed in Section 5.1, was also reflected here.

As with BabelStream, we applied our UTPX program for USM models, which resulted in a consistent and substantial performance improvement across with all UTPX modes. This improvement not only brought performance closer to what was achieved with HIP but, in specific cases, even matched it. Except ICPX, all StdPar models managed to outperform Kokkos and were roughly on par with the non-USM variant of SYCL. Enabling interposing for rocStdPar did not alter the performance characteristic in any meaningful way.

For USM models of CloverLeaf, data access after the initial buffer setup and domain decomposition was almost entirely device-resident, with host access only required for reductions occurring every 20 time steps. In such access patterns, the performance of UTPX in device mode was comparable to mirror mode.

Interestingly, both device and mirror mode outperformed the hardware-assisted XNACK and the less intrusive advise mode. For a device-resident application like CloverLeaf, the overhead of dynamic page fault management, whether done in software or hardware, may be challenging to recover without more advanced (preemptive or compiler-assisted) heuristics. However, we found the performance of XNACK compared to software solutions like UTPX to be underwhelming. This could potentially be attributed to the extra register usage, as the code was compiled without specific targeting of XNACK. Additionally, it’s worth noting that UTPX’s advise mode consistently failed on different systems, as observed from the results on RadeonVII and MI100. We suspect that ROCm failed to create an allocation with the correct alignment requirements at runtime.

On the Raphael APU, USM models, including all implementations of StdPar, achieved nearly 90% of HIP performance. This demonstrates that under ideal conditions with a shared address space between the CPU and the GPU, the model itself did not impose significant performance overheads. This consistent overhead aligns with our previous studies on other APU platforms as discussed in our past work [1]. As discussed in Section 5.2, AOMP and ICPX do not yet support the Raphael APU.

Note that UTPX was developed in a relatively short (< 3 days) amount of time. While benchmarking, we have identified an issue where the mirror mode would occasionally result in deadlocks on larger allocations. Rerunning the benchmark usually succeeds immediately after a failed run. We think the root cause is a stack value corruption that originates from within ROCm.

5.4 Result: TeaLeaf

Refer to caption

Figure 15: TeaLeaf normalised runtime across all models against HIP on RadeonVII, higher is better

Refer to caption

Figure 16: TeaLeaf normalised runtime across all models against HIP on MI100, higher is better

Refer to caption

Figure 17: TeaLeaf normalised runtime across all models against HIP on Raphael, higher is better

TeaLeaf results are presented in Fig. 17, Fig. 15, and Fig. 16. The overall outcome is similar to CloverLeaf discussed in Section 5.3 with a few notable differences.

Unfortunately, even in an application that has more mixed host-device access, XNACK was still unable to outperform software solutions in a significant way. Here, UTPX in mirror and device mode both showed performance that is closer to explicit memory models, whereas XNACK’s performance was on par with UTPX’s mirror mode.

UTPX’s advise mode performed very poorly except for AdaptiveCpp. We believe it’s the same alignment issue discussed in Section 3.1 and Section 5.3.

ICPX with oneDPL performed very poorly. This is due to the suboptimal std::transform_reduce implementation, possibly because it has not been tuned on AMD platforms. CloverLeaf did not have this issue because reductions are only used once per time step, whereas two thirds of the core CG solver in TeaLeaf are implemented with iterative calls to std::transform_reduce.

Results on the Raphael APU are less clear. While the overall performance for USM models is acceptable, only rocStdPar reached performance parity with explicit models like SYCL. As discussed in Section 5.2, AOMP and ICPX do not yet support the Raphael APU.

Results with UTPX enabled exhibited the same transient deadlock issue described in Section 5.3 with a much lower incidence.

6 ROCm experience

The overall stability and user experience of the ROCm software stack is poor. This section details a non-exhaustive list of issues we have encountered while collecting results for this report.

6.1 Documentation

The HIP API documentation is incomplete. For example, the recently launched (accessed September 2023) documentation website contains no description at all for the hipMalloc method 111111https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory.html. Even in the legacy documentation 121212https://rocm-developer-tools.github.io/HIP/group__Memory.html, the description failed to show a critical difference from CUDA’s cudaMalloc: hipMalloc allocates host-accessible (conditional on large-BAR support) but device-resident memory. In fact, this behaviour is not documented anywhere from AMD; only documentation from ORNL’s Crusher131313https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#enabling-gpu-page-migration had details on this.

The lack of detail on critical APIs like these is widespread. Almost none of the method descriptions in the Managed Memory section141414https://docs.amd.com/projects/HIP/en/latest/.doxygen/docBin/html/group___memory_m.html provided any comment on performance, known issues, or expected use cases. The matching CUDA API documentation does not have this issue.

6.2 Hardware support

ROCm currently only officially supports nine GPU SKUs on Linux. Of these, RadeonVII is the only consumer-grade card, and it is no longer available on the market for purchase. As such, it is essentially impossible to validate your code on AMD GPUs without having access to an HPC or cloud service provider. On NVIDIA platforms, one can simply purchase any consumer GPU and expect near identical baseline software support consistent with that on HPC systems.

Anecdotal evidence suggests that GPUs outside the official ROCm support list will still work. However, this is not always the case. For example, AOMP does not currently support first generation Navi GPUs (gfx1012). We have also observed validation failures with hipMallocManaged on both gfx1012 and gfx1036. For comparison, the CUDA SDK works on all NVIDIA GPUs (both consumer and HPC) as old as the Kepler generation.

Finally, ROCm 5.7.0 marks gfx906 (RadeonVII and MI50) as EOL, thus eliminating 30% of the ROCm supported GPUs and also leaves us with no supported consumer GPUs.

6.3 Validation errors

Enabling XNACK on ROCm 5.6.1 gives incorrect results for BabelStream implemented in HIP, rocStdPar, SYCL, and possibly all other models. This issue is reproducible on RadeonVII. We had brief access to a MI100 with XNACK enabled and were able to verify that this issue is also present on MI100s. It appears to be a bug in the kernel driver and rolling back to ROCm 5.5.1 or disabling XNACK resolves the issue.

We have confirmation from AMD that this has also been resolved for ROCm 5.7 and newer.

6.4 Driver quality

The KFD kernel driver or the device firmware appears to be quite brittle. Throughout benchmark runs, we’ve encountered random kernel panics originating from the amdgpu module, system hangs, and GPU resets due to IB timeout. On our test system that hosts the RadeonVII (see Table 3 for the kernel version), when an IB timeout occurs, the kernel initiated BACO reset frequently fails to revive the GPU. Furthermore, a user initiated BACO (Bus-Active, Chip-Off) reset sometimes causes the GPU to go offline for the entire session.

On MI100, we’ve encountered unexplainable HSA queue hangs which cause the application to deadlock. The issue only seems to appear during high-load scenarios and is hard to pinpoint.

We have confirmation from AMD that these issues are being investigated, and the situation may improve on Linux kernel 6.7 or newer when paired with an up-to-date ROCm release.

6.5 Build difficulties

The creation of UTPX is in part motivated by the fact that ROCm is incredibly challenging to build from source. The difficulty is reflected by the existence of multiple projects 151515https://github.com/PawseySC/rocm-from-source 161616https://github.com/xuhuisheng/rocm-build that attempts to document and provide more automated ways of building ROCm. To add to this, none of the README pages on any of the ROCm’s subproject are consistent: many contain outdated build and usage information.

The ROCm binary repositories for package managers do not appear to host source or debug packages. These are required for gdb to resolve symbols for libraries in the ROCm stack. Without these packages, gdb is unable to give precise backtraces and shows ?? next the stack pointer value. Interestingly, the official documentation that covers HIP debugging 171717https://rocm.docs.amd.com/projects/HIP/en/latest/how_to_guides/debugging.html provided a sample gdb session that displayed this very issue.

6.6 Missing tools

There is a distinct lack of coherent tooling for AMD GPUs.

  • ROCm’s rocgdb does not work on newer kernel versions such as the one used in our RadeonVII test system, rocgdb reports KFD_IOC_DBG_TRAP_GET_VERSION.

  • ROCm (>= 5.5.1) is released with two profilers, rocprof and rocprofv2, and it’s unclear which one is the preferred profiler. ROCm appears to have shipped a work-in-progress rocprofv2, as it contains a non-functional plugin system (e.g. the att plugin is incomplete).

  • Newer profiling tools such as OmniTrace do not support all ROCm platforms: only MI100 and newer CDNA cards are supported. Beyond this, there are no working graphical profiling or tracing solutions on Linux as CodeXL is deprecated.

7 Conclusion

This report evaluates three emerging StdPar implementations: ROCm StdPar, AdaptiveCpp StdPar, and Intel DPC++ with plugins. Overall, for memory-bandwidth bound applications, the performance of all StdPar implementations is tied to the availability and quality of USM, while for compute-bound applications, the performance is highly dependent on compiler optimisations and the programming model.

We find it encouraging that StdPar is now supported on all major GPU vendors with both first-party and third-party implementations. Specifically, AdaptiveCpp and rocStdPar both performed comparably on AMD GPUs, with similar performance to other C++ models like Kokkos or SYCL. However, we stress that this performance is only achievable with XNACK enabled, or alternatively, with software workarounds such as UTPX. Unfortunately, DPC++ with oneDPL showed poor performance for memory-bandwidth bound applications, and we suspect the DPC++ vendor plugin for AMD GPU was never tested in production or validated beyond basic smoke tests.

To provide the USM performance that StdPar requires, we’ve demonstrated a way to implement USM in userspace without needing compiler/runtime modifications or recompilation of the user program. Our UTPX design showed that the fix can be done transparently at the HIP API level; no changes are needed for programming models that use the HIP API, be it SYCL or StdPar. We hope that UTPX can motivate AMD to consider adding software-based USM support to improve ROCm’s usability and applicability.

On hardware-accelerated USM, even on GPUs with native support via XNACK, specific, non-default, kernel and software configuration are required. The performance of XNACK, in many cases, showed lower performance compared with UTPX. We find the overall state of USM support on AMD platforms alarming. It’s unclear how USM will be supported for newer GPUs such as Navi2/3 and onwards due to missing hardware feature. Fragmentation of the consumer and HPC space based on feature omissions like this only impacts the applicability of AMD GPUs.

Finally, our experience compiling this report and past effort porting our mini-apps to AMD GPUs[4] has highlighted weaknesses in the ROCm software stack. These shortcomings encompass documentation, tooling, stability, compatibility, and correctness. We hope that AMD continues to make efforts to improve the status quo, allowing programming models like StdPar, and many others, to fully exploit the potential of AMD hardware.

8 Future work

StdPar support for AMD platforms is a fast-moving area; as such, the intent of this report is to provide timely feedback for all StdPar implementations. To save time, we have omitted experiments with using USM for the HIP model (i.e. replacing hipMalloc with hipMallocManaged) as not all mini-apps have this implemented and validated. However, enabling this is straightforward and a future study should investigate this.

Rerunning this study on newer AMD HPC GPUs, such as the MI250X on LUMI, would provide valuable insight on whether XNACK can provide better performance than software-based solutions. Specifically, we would like to experiment with AOMP’s OpenMP USM support to see how it compares with our results from this report.

9 Acknowledgement

We would like to thank Aksel Alpay for his work on AdaptiveCpp and his continuous support on any issues we’ve raised while compiling this report.

We would like to thank Tim Dykes’ assistance from the HPE HPC/AI EMEA Research Lab, this work is carried out as part of the GW4 Isambard collaboration.

We would like to thank Alexandru Voicu for his feedback on many of the technical details concerning XNACK, the design of roc-stdpar, and the AMD platform in general.

This work used results and software developed as part of Intel oneAPI Centre of Excellence.

This work used the Isambard UK National Tier-2 HPC Service (https://gw4.ac.uk/isambard) operated by GW4 and the UK Met Office, and funded by EPSRC (EP/P020224/1). This work used the DiRAC@Durham facility managed by the Institute for Computational Cosmology on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). The equipment was funded by BEIS capital funding via STFC capital grants ST/P002293/1, ST/R002371/1 and ST/S002502/1, Durham University and STFC operations grant ST/R000832/1. DiRAC is part of the National e-Infrastructure. This work used the HPC Zoo, a multi-platform research cluster managed by the High-Performance Computing Group at the University of Bristol (https://uob-hpc.github.io/zoo).

References

  • [1] W.-C. Lin, T. Deakin, and S. McIntosh-Smith, “Evaluating ISO C++ Parallel Algorithms on Heterogeneous HPC Systems,” in International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems held in conjunction with Supercomputing (PMBS), IEEE, 2022.
  • [2] J. D. McCalpin et al., “Memory bandwidth and machine balance in current high performance computers,” IEEE computer society technical committee on computer architecture (TCCA) newsletter, vol. 2, no. 19-25, 1995.
  • [3] T. Deakin, J. Price, M. Martineau, and S. McIntosh-Smith, “Evaluating attainable memory bandwidth of parallel programming models via BabelStream,” International Journal of Computational Science and Engineering, vol. 17, no. 3, pp. 247–262, 2018.
  • [4] W.-C. Lin and S. McIntosh-Smith, “Comparing julia to performance portable parallel programming models for hpc,” in 2021 International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS), pp. 94–105, 2021.

Appendix A Artefact description

Source code for UTPX is available at https://github.com/UoB-HPC/utpx. We’ve used commit 2a38257b1800e4ac2a2c937ae08e26fd49960ddf for all experiments involving UTPX. UTPX is built with the default GCC compiler on each platform. Building UTPX is straightforward and documented in the README of the source code repository.

Source code for BabelStream, with all the models presented in this study, is available at https://github.com/UoB-HPC/BabelStream. We’ve used the option_for_vec branch at commit 87a38e949df2894a7d25ef8782dd96e3978f31ff. No modifications were made to BabelStream for any of the benchmarks.

Likewise, source code for miniBUDE is available at https://github.com/UoB-HPC/miniBUDE. We’ve used the v2 branch at commit bea30762acaefee54ebaf3c68713b66414345e12.

Finally, source code for CloverLeaf and TeaLeaf is available at https://github.com/UoB-HPC/cloverleaf and https://github.com/UoB-HPC/tealeaf respectively. We’ve used commit 4306b008eb21b0dbdad7bd241dfb6a5a337609ca for CloverLeaf and 5ee7d753bbb8e8d60b945c0359e00c07aafbab81 for TeaLeaf.

Build flags for each mini-app and compiler combination are recorded at https://github.com/UoB-HPC/performance-portability/tree/2023-benchmarking-amd-stdpar. For example, the build flags for miniBUDE on RadeonVII is available under benchmarking/2023/bude/radeonvii-local/benchmark.sh.