Search | arXiv e-print repository

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

Authors: Gabin Schieffer, Jacob Wahlgren, Jie Ren, Jennifer Faj, Ivy Peng

Abstract: Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of system allocated memory, and cache-coherent NVLink-C2C interconnect, bringing an alternative solution for enabl… ▽ More Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of system allocated memory, and cache-coherent NVLink-C2C interconnect, bringing an alternative solution for enabling a Unified Memory system. In this work, we provide the first in-depth study of the system memory management on the Grace Hopper Superchip, in both in-memory and memory oversubscription scenarios. We provide a suite of six representative applications, including the Qiskit quantum computing simulator, using system memory and managed memory. Using our memory utilization profiler and hardware counters, we quantify and characterize the impact of the integrated CPU-GPU system page table on GPU applications. Our study focuses on first-touch policy, page table entry initialization, page sizes, and page migration. We identify practical optimization strategies for different access patterns. Our results show that as a new solution for unified memory, the system-allocated memory can benefit most use cases with minimal porting efforts. △ Less

Submitted 10 July, 2024; originally announced July 2024.

Comments: Accepted to ICPP '24 (The 53rd International Conference on Parallel Processing)

arXiv:2406.11760 [pdf, other]

Understanding Layered Portability from HPC to Cloud in Containerized Environments

Authors: Daniel Medeiros, Gabin Schieffer, Jacob Wahlgren, Ivy Peng

Abstract: Recent development in lightweight OS-level virtualization, containers, provides a potential solution for running HPC applications on the cloud platform. In this work, we focus on the impact of different layers in a containerized environment when migrating HPC containers from a dedicated HPC system to a cloud platform. On three ARM-based platforms, including the latest Nvidia Grace CPU, we use six… ▽ More Recent development in lightweight OS-level virtualization, containers, provides a potential solution for running HPC applications on the cloud platform. In this work, we focus on the impact of different layers in a containerized environment when migrating HPC containers from a dedicated HPC system to a cloud platform. On three ARM-based platforms, including the latest Nvidia Grace CPU, we use six representative HPC applications to characterize the impact of container virtualization, host OS and kernel, and rootless and privileged container execution. Our results indicate less than 4\% container overhead in DGEMM, miniMD, and XSBench, but 8\%-10\% overhead in FFT, HPCG, and Hypre. We also show that changing between the container execution modes results in negligible performance differences in the six applications. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Submitted to ISC Workshop - Workshop on Converged Computing '24, preprint

arXiv:2308.14780 [pdf, other]

doi 10.1145/3581784.3607108

A Quantitative Approach for Adopting Disaggregated Memory in HPC Systems

Authors: Jacob Wahlgren, Gabin Schieffer, Maya Gokhale, Ivy Peng

Abstract: Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where shared memory pools supplement node-local memory. This work outlines the… ▽ More Memory disaggregation has recently been adopted in data centers to improve resource utilization, motivated by cost and sustainability. Recent studies on large-scale HPC facilities have also highlighted memory underutilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where shared memory pools supplement node-local memory. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system from the top down in three levels, moving from general, to multi-tier memory systems, and then to memory pooling. We provide a multi-level profiling tool and LBench to facilitate the quantitative approach. We evaluate a set of representative HPC workloads on an emulated platform. Our results show that prefetching activities can significantly influence memory traffic profiles. Interference in memory pooling has varied impacts on applications, depending on their access ratios to memory tiers and arithmetic intensities. Finally, in two case studies, we show the benefits of our findings at the application and system levels, achieving 50% reduction in remote access and 13% speedup in BFS, and reducing performance variation of co-located workloads in interference-aware job scheduling. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: Accepted to SC23 (The International Conference for High Performance Computing, Networking, Storage, and Analysis 2023)

arXiv:2308.00763 [pdf, other]

Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU

Authors: Gabin Schieffer, Nattawat Pornthisan, Daniel Araújo de Medeiros, Stefano Markidis, Jacob Wahlgren, Ivy Peng

Abstract: High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores and compare their performance and accuracy with single- and double-precision baselines on N… ▽ More High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores and compare their performance and accuracy with single- and double-precision baselines on Nvidia V100, A100, A40 and T4 GPUs. To mitigate numerical instability and precision losses, we introduce algorithmic changes in the particle filters. Using half-precision leads to a performance improvement of 1.5-2x and 2.5-4.6x with respect to single- and double-precision baselines respectively, at the cost of a relatively small loss of accuracy. △ Less

Submitted 1 August, 2023; originally announced August 2023.

Comments: 12 pages, 8 figures, conference. To be published in The 21st International Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms (HeteroPar2023)

arXiv:2307.14860 [pdf, other]

Quantum Computer Simulations at Warp Speed: Assessing the Impact of GPU Acceleration

Authors: Jennifer Faj, Ivy Peng, Jacob Wahlgren, Stefano Markidis

Abstract: Quantum computer simulators are crucial for the development of quantum computing. In this work, we investigate the suitability and performance impact of GPU and multi-GPU systems on a widely used simulation tool - the state vector simulator Qiskit Aer. In particular, we evaluate the performance of both Qiskit's default Nvidia Thrust backend and the recent Nvidia cuQuantum backend on Nvidia A100 GP… ▽ More Quantum computer simulators are crucial for the development of quantum computing. In this work, we investigate the suitability and performance impact of GPU and multi-GPU systems on a widely used simulation tool - the state vector simulator Qiskit Aer. In particular, we evaluate the performance of both Qiskit's default Nvidia Thrust backend and the recent Nvidia cuQuantum backend on Nvidia A100 GPUs. We provide a benchmark suite of representative quantum applications for characterization. For simulations with a large number of qubits, the two GPU backends can provide up to 14x speedup over the CPU backend, with Nvidia cuQuantum providing further 1.5-3x speedup over the default Thrust backend. Our evaluation on a single GPU identifies the most important functions in Nvidia Thrust and cuQuantum for different quantum applications and their compute and memory bottlenecks. We also evaluate the gate fusion and cache-blocking optimizations on different quantum applications. Finally, we evaluate large-number qubit quantum applications on multi-GPU and identify data movement between host and GPU as the limiting factor for the performance. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2211.02682 [pdf, other]

doi 10.1109/MCHPC56545.2022.00007

Evaluating Emerging CXL-enabled Memory Pooling for HPC Systems

Authors: Jacob Wahlgren, Maya Gokhale, Ivy B. Peng

Abstract: Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of compo… ▽ More Current HPC systems provide memory resources that are statically configured and tightly coupled with compute nodes. However, workloads on HPC systems are evolving. Diverse workloads lead to a need for configurable memory resources to achieve high performance and utilization. In this study, we evaluate a memory subsystem design leveraging CXL-enabled memory pooling. Two promising use cases of composable memory subsystems are studied -- fine-grained capacity provisioning and scalable bandwidth provisioning. We developed an emulator to explore the performance impact of various memory compositions. We also provide a profiler to identify the memory usage patterns in applications and their optimization opportunities. Seven scientific and six graph applications are evaluated on various emulated memory configurations. Three out of seven scientific applications had less than 10% performance impact when the pooled memory backed 75% of their memory footprint. The results also show that a dynamically configured high-bandwidth system can effectively support bandwidth-intensive unstructured mesh-based applications like OpenFOAM. Finally, we identify interference through shared memory pools as a practical challenge for adoption on HPC systems. △ Less

Submitted 4 November, 2022; originally announced November 2022.

Comments: 10 pages, 13 figures. Accepted for publication in Workshop on Memory Centric High Performance Computing (MCHPC'22) at SC22

arXiv:2207.07098 [pdf, other]

Large-Scale Direct Numerical Simulations of Turbulence Using GPUs and Modern Fortran

Authors: Martin Karp, Daniele Massaro, Niclas Jansson, Alistair Hart, Jacob Wahlgren, Philipp Schlatter, Stefano Markidis

Abstract: We present our approach to making direct numerical simulations of turbulence with applications in sustainable ship**. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by… ▽ More We present our approach to making direct numerical simulations of turbulence with applications in sustainable ship**. We use modern Fortran and the spectral element method to leverage and scale on supercomputers powered by the Nvidia A100 and the recent AMD Instinct MI250X GPUs, while still providing support for user software developed in Fortran. We demonstrate the efficiency of our approach by performing the world's first direct numerical simulation of the flow around a Flettner rotor at Re=30'000 and its interaction with a turbulent boundary layer. We present one of the first performance comparisons between the AMD Instinct MI250X and Nvidia A100 GPUs for scalable computational fluid dynamics. Our results show that one MI250X offers performance on par with two A100 GPUs and has a similar power efficiency. △ Less

Submitted 23 June, 2022; originally announced July 2022.

Comments: 13 pages, 7 figures

ACM Class: G.4; J.2

Showing 1–7 of 7 results for author: Wahlgren, J