-
Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter
Authors:
Jie Li,
George Michelogiannakis,
Brandon Cook,
Dulanya Cooray,
Yong Chen
Abstract:
Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application be…
▽ More
Resource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC's Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter's lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation.
△ Less
Submitted 12 March, 2023; v1 submitted 12 January, 2023;
originally announced January 2023.
-
Efficient Intra-Rack Resource Disaggregation for HPC Using Co-Packaged DWDM Photonics
Authors:
George Michelogiannakis,
Yehia Arafa,
Brandon Cook,
Liang Yuan Dai,
Abdel Hameed Badawy,
Madeleine Glick,
Yuyang Wang,
Keren Bergman,
John Shalf
Abstract:
The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth a…
▽ More
The diversity of workload requirements and increasing hardware heterogeneity in emerging high performance computing (HPC) systems motivate resource disaggregation. Resource disaggregation allows compute and memory resources to be allocated individually as required to each workload. However, it is unclear how to efficiently realize this capability and cost-effectively meet the stringent bandwidth and latency requirements of HPC applications. To that end, we describe how modern photonics can be co-designed with modern HPC racks to implement flexible intra-rack resource disaggregation and fully meet the bit error rate (BER) and high escape bandwidth of all chip types in modern HPC racks. Our photonic-based disaggregated rack provides an average application speedup of 11% (46% maximum) for 25 CPU and 61% for 24 GPU benchmarks compared to a similar system that instead uses modern electronic switches for disaggregation. Using observed resource usage from a production system, we estimate that an iso-performance intra-rack disaggregated HPC system using photonics would require 4x fewer memory modules and 2x fewer NICs than a non-disaggregated baseline.
△ Less
Submitted 17 July, 2023; v1 submitted 9 January, 2023;
originally announced January 2023.
-
PaST-NoC: A Packet-Switched Superconducting Temporal NoC
Authors:
Darren Lyles,
Patricia Gonzalez-Guerrero,
Meriam Gay Bautista,
George Michelogiannakis
Abstract:
Temporal computing promises to mitigate the stringent area constraints and clock distribution overheads of traditional superconducting digital computing. To design a scalable, area- and power-efficient superconducting network on chip (NoC), we propose packet-switched superconducting temporal NoC (PaST-NoC). PaST-NoC operates its control path in the temporal domain using race logic (RL), combined w…
▽ More
Temporal computing promises to mitigate the stringent area constraints and clock distribution overheads of traditional superconducting digital computing. To design a scalable, area- and power-efficient superconducting network on chip (NoC), we propose packet-switched superconducting temporal NoC (PaST-NoC). PaST-NoC operates its control path in the temporal domain using race logic (RL), combined with bufferless deflection flow control to minimize area. Packets encode their destination using RL and carry a collection of data pulses that the receiver can interpret as pulse trains, RL, serialized binary, or other formats. We demonstrate how to scale up PaST-NoC to arbitrary topologies based on 2x2 routers and 4x4 butterflies as building blocks. As we show, if data pulses are interpreted using RL, PaST-NoC outperforms state-of-the-art superconducting binary NoCs in throughput per area by as much as 5x for long packets.
△ Less
Submitted 9 January, 2023; v1 submitted 17 October, 2022;
originally announced October 2022.
-
TIGER: Topology-aware Assignment using Ising machines Application to Classical Algorithm Tasks and Quantum Circuit Gates
Authors:
Anastasiia Butko,
Ilyas Turimbetov,
George Michelogiannakis,
David Donofrio,
Didem Unat,
John Shalf
Abstract:
Optimally map** a parallel application to compute and communication resources is increasingly important as both system size and heterogeneity increase. A similar map** problem exists in gate-based quantum computing where the objective is to map tasks to gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing task assignment approaches are either heuris…
▽ More
Optimally map** a parallel application to compute and communication resources is increasingly important as both system size and heterogeneity increase. A similar map** problem exists in gate-based quantum computing where the objective is to map tasks to gates in a topology-aware fashion. This is an NP-complete graph isomorphism problem, and existing task assignment approaches are either heuristic or based on physical optimization algorithms, providing different speed and solution quality trade-offs. Ising machines such as quantum and digital annealers have recently become available and offer an alternative hardware solution to solve this type of optimization problems. In this paper, we propose an algorithm that allows solving the topology-aware assignment problem using Ising machines. We demonstrate the algorithm on two use cases, i.e. classical task scheduling and quantum circuit gate scheduling. TIGER---topology-aware task/gate assignment mapper tool---implements our proposed algorithms and automatically integrates them into the quantum software environment. To address the limitations of physical solver, we propose and implement a domain-specific partition strategy that allows solving larger-scale problems and a weight optimization algorithm that allows tuning Ising model parameters to achieve better restuls. We use D-Wave's quantum annealer to demonstrate our algorithm and evaluate the proposed tool flow in terms of performance, partition efficiency, and solution quality. Results show significant speed-up compared to classical solutions, better scalability, and higher solution quality when using TIGER together with the proposed partition method. It reduces the data movement cost by 68\% in average for quantum circuit assignment compared to the IBM QX optimizer.
△ Less
Submitted 21 September, 2020;
originally announced September 2020.
-
Understanding Quantum Control Processor Capabilities and Limitations through Circuit Characterization
Authors:
Anastasiia Butko,
George Michelogiannakis,
Samuel Williams,
Costin Iancu,
David Donofrio,
John Shalf,
Jonathan Carter,
Irfan Siddiqi
Abstract:
Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a method…
▽ More
Continuing the scaling of quantum computers hinges on building classical control hardware pipelines that are scalable, extensible, and provide real time response. The instruction set architecture (ISA) of the control processor provides functional abstractions that map high-level semantics of quantum programming languages to low-level pulse generation by hardware. In this paper, we provide a methodology to quantitatively assess the effectiveness of the ISA to encode quantum circuits for intermediate-scale quantum devices with O($10^2$) qubits. The characterization model that we define reflects performance, the ability to meet timing constraint implications, scalability for future quantum chips, and other important considerations making them useful guides for future designs. Using our methodology, we propose scalar (QUASAR) and vector (qV) quantum ISAs as extensions and compare them with other ISAs in metrics such as circuit encoding efficiency, the ability to meet real-time gate cycle requirements of quantum chips, and the ability to scale to more qubits.
△ Less
Submitted 3 December, 2020; v1 submitted 25 September, 2019;
originally announced September 2019.