Early Performance Results on 4th Gen Intel(R) Xeon (R) Scalable Processors with DDR and Intel(R) Xeon(R) processors, codenamed Sapphire Rapids with HBM
Authors:
Galen M. Shipman,
Sriram Swaminarayan,
Gary Grider,
Jim Lujan,
R. Joseph Zerr
Abstract:
The Crossroads supercomputer was designed to simulate some of the most complex physical devices in the world. These simulations routinely require 1/2 petabyte or more of system memory running on thousands of compute nodes for months at a time on the most powerful supercomputers. Improvements in time to solutions for these workloads can have major impact on our mission capabilities. In this paper w…
▽ More
The Crossroads supercomputer was designed to simulate some of the most complex physical devices in the world. These simulations routinely require 1/2 petabyte or more of system memory running on thousands of compute nodes for months at a time on the most powerful supercomputers. Improvements in time to solutions for these workloads can have major impact on our mission capabilities. In this paper we present early results of representative application workloads on 4th Gen Intel Xeon and Intel Xeon Processors codenamed Sapphire Rapids with HBM. These results demonstrate an extremely promising 8.57x improvement (node to node) over our prior generation Intel Broadwell (BDW) based HPC systems. No code modifications were required to achieve this speedup, providing a compelling path forward toward major reductions in time to solution and the complexity of physical systems that can be simulated in the future.
△ Less
Submitted 10 November, 2022;
originally announced November 2022.
Parthenon -- a performance portable block-structured adaptive mesh refinement framework
Authors:
Philipp Grete,
Joshua C. Dolence,
Jonah M. Miller,
Joshua Brown,
Ben Ryan,
Andrew Gaspar,
Forrest Glines,
Sriram Swaminarayan,
Jonas Lippuner,
Clell J. Solomon,
Galen Shipman,
Christoph Junghans,
Daniel Holladay,
James M. Stone,
Luke F. Roberts
Abstract:
On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived…
▽ More
On the path to exascale the landscape of computer device architectures and corresponding programming models has become much more diverse. While various low-level performance portable programming models are available, support at the application level lacks behind. To address this issue, we present the performance portable block-structured adaptive mesh refinement (AMR) framework Parthenon, derived from the well-tested and widely used Athena++ astrophysical magnetohydrodynamics code, but generalized to serve as the foundation for a variety of downstream multi-physics codes. Parthenon adopts the Kokkos programming model, and provides various levels of abstractions from multi-dimensional variables, to packages defining and separating components, to launching of parallel compute kernels. Parthenon allocates all data in device memory to reduce data movement, supports the logical packing of variables and mesh blocks to reduce kernel launch overhead, and employs one-sided, asynchronous MPI calls to reduce communication overhead in multi-node simulations. Using a hydrodynamics miniapp, we demonstrate weak and strong scaling on various architectures including AMD and NVIDIA GPUs, Intel and AMD x86 CPUs, IBM Power9 CPUs, as well as Fujitsu A64FX CPUs. At the largest scale on Frontier (the first TOP500 exascale machine), the miniapp reaches a total of $1.7\times10^{13}$ zone-cycles/s on 9,216 nodes (73,728 logical GPUs) at ~92% weak scaling parallel efficiency (starting from a single node). In combination with being an open, collaborative project, this makes Parthenon an ideal framework to target exascale simulations in which the downstream developers can focus on their specific application rather than on the complexity of handling massively-parallel, device-accelerated AMR.
△ Less
Submitted 21 November, 2022; v1 submitted 24 February, 2022;
originally announced February 2022.