-
CloverLeaf on Intel Multi-Core CPUs: A Case Study in Write-Allocate Evasion
Authors:
Jan Laukemann,
Thomas Gruber,
Georg Hager,
Dossay Oryspayev,
Gerhard Wellein
Abstract:
In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measur…
▽ More
In this paper we analyze the MPI-only version of the CloverLeaf code from the SPEChpc 2021 benchmark suite on recent Intel Xeon "Ice Lake" and "Sapphire Rapids" server CPUs. We observe peculiar breakdowns in performance when the number of processes is prime. Investigating this effect, we create first-principles data traffic models for each of the stencil-like hotspot loops. With application measurements and microbenchmarks to study memory data traffic behavior, we can connect the breakdowns to SpecI2M, a new write-allocate evasion feature in current Intel CPUs. For serial and full-node cases we are able to predict the memory data volume analytically with an error of a few percent. We find that if the number of processes is prime, SpecI2M fails to work properly, which we can attribute to short inner loops emerging from the one-dimensional domain decomposition in this case. We can also rule out other possible causes of the prime number effect, such as breaking layer conditions, MPI communication overhead, and load imbalance.
△ Less
Submitted 17 May, 2024; v1 submitted 8 November, 2023;
originally announced November 2023.
-
Accelerating quantum many-body configuration interaction with directives
Authors:
Brandon Cook,
Patrick J. Fasano,
Pieter Maris,
Chao Yang,
Dossay Oryspayev
Abstract:
Many-Fermion Dynamics-nuclear, or MFDn, is a configuration interaction (CI) code for nuclear structure calculations. It is a platform-independent Fortran 90 code using a hybrid MPI+X programming model. For CPU platforms the application has a robust and optimized OpenMP implementation for shared memory parallelism. As part of the NESAP application readiness program for NERSC's latest Perlmutter sys…
▽ More
Many-Fermion Dynamics-nuclear, or MFDn, is a configuration interaction (CI) code for nuclear structure calculations. It is a platform-independent Fortran 90 code using a hybrid MPI+X programming model. For CPU platforms the application has a robust and optimized OpenMP implementation for shared memory parallelism. As part of the NESAP application readiness program for NERSC's latest Perlmutter system, MFDn has been updated to take advantage of accelerators. The current mainline GPU port is based on OpenACC. In this work we describe some of the key challenges of creating an efficient GPU implementation. Additionally, we compare the support of OpenMP and OpenACC on AMD and NVIDIA GPUs.
△ Less
Submitted 20 October, 2021;
originally announced October 2021.
-
Accelerating an Iterative Eigensolver for Nuclear Structure Configuration Interaction Calculations on GPUs using OpenACC
Authors:
Pieter Maris,
Chao Yang,
Dossay Oryspayev,
Brandon Cook
Abstract:
To accelerate the solution of large eigenvalue problems arising from many-body calculations in nuclear physics on distributed-memory parallel systems equipped with general-purpose Graphic Processing Units (GPUs), we modified a previously developed hybrid MPI/OpenMP implementation of an eigensolver written in FORTRAN 90 by using an OpenACC directives based programming model. Such an approach requir…
▽ More
To accelerate the solution of large eigenvalue problems arising from many-body calculations in nuclear physics on distributed-memory parallel systems equipped with general-purpose Graphic Processing Units (GPUs), we modified a previously developed hybrid MPI/OpenMP implementation of an eigensolver written in FORTRAN 90 by using an OpenACC directives based programming model. Such an approach requires making minimal changes to the original code and enables a smooth migration of large-scale nuclear structure simulations from a distributed-memory many-core CPU system to a distributed GPU system. However, in order to make the OpenACC based eigensolver run efficiently on GPUs, we need to take into account the architectural differences between a many-core CPU and a GPU device. Consequently, the optimal way to insert OpenACC directives may be different from the original way of inserting OpenMP directives. We point out these differences in the implementation of sparse matrix-matrix multiplications (SpMM), which constitutes the main cost of the eigensolver, as well as other differences in the preconditioning step and dense linear algebra operations. We compare the performance of the OpenACC based implementation executed on multiple GPUs with the performance on distributed-memory many-core CPUs, and demonstrate significant speedup achieved on GPUs compared to the on-node performance of a many-core CPU. We also show that the overall performance improvement of the eigensolver on multiple GPUs is more modest due to the communication overhead among different MPI ranks.
△ Less
Submitted 1 September, 2021;
originally announced September 2021.
-
Comparing OpenMP Implementations With Applications Across A64FX Platforms
Authors:
Benjamin Michalowicz,
Eric Raut,
Yan Kang,
Tony Curtis,
Barbara Chapman,
Dossay Oryspayev
Abstract:
The development of the A64FX processor by Fujitsu has created a massive innovation in High-Performance Computing and the birth of Fugaku: the current world's fastest supercomputer. A variety of tools are used to analyze the run-times and performances of several applications, and in particular, how these applications scale on the A64FX processor. We examine the performance and behavior of applicati…
▽ More
The development of the A64FX processor by Fujitsu has created a massive innovation in High-Performance Computing and the birth of Fugaku: the current world's fastest supercomputer. A variety of tools are used to analyze the run-times and performances of several applications, and in particular, how these applications scale on the A64FX processor. We examine the performance and behavior of applications through OpenMP scaling and how their performance differs across different compilers on the new Ookami cluster at Stony Brook University as well as the Fugaku supercomputer at RIKEN in Japan.
△ Less
Submitted 21 July, 2021;
originally announced July 2021.
-
Comparing the behavior of OpenMP Implementations with various Applications on two different Fujitsu A64FX platforms
Authors:
Benjamin Michalowicz,
Eric Raut,
Yan Kang,
Tony Curtis,
Barbara Chapman,
Dossay Oryspayev
Abstract:
The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current world's fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and R…
▽ More
The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current world's fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.
△ Less
Submitted 17 June, 2021;
originally announced June 2021.
-
Ookami: Deployment and Initial Experiences
Authors:
Andrew Burford,
Alan C. Calder,
David Carlson,
Barbara Chapman,
Firat CoŞKun,
Tony Curtis,
Catherine Feldman,
Robert J. Harrison,
Yan Kang,
Benjamin Michalow-Icz,
Eric Raut,
Eva Siegmann,
Daniel G. Wood,
Robert L. Deleon,
Mathew Jones,
Nikolay A. Simakov,
Joseph P. White,
Dossay Oryspayev
Abstract:
Ookami is a computer technology testbed supported by the United States National Science Foundation. It provides researchers with access to the A64FX processor developed by Fujitsu in collaboration with RIKΞN for the Japanese path to exascale computing, as deployed in Fugaku, the fastest computer in the world. By focusing on crucial architectural details, the ARM-based, multi-core, 512-bit SIMD-vec…
▽ More
Ookami is a computer technology testbed supported by the United States National Science Foundation. It provides researchers with access to the A64FX processor developed by Fujitsu in collaboration with RIKΞN for the Japanese path to exascale computing, as deployed in Fugaku, the fastest computer in the world. By focusing on crucial architectural details, the ARM-based, multi-core, 512-bit SIMD-vector processor with ultrahigh-bandwidth memory promises to retain familiar and successful programming models while achieving very high performance for a wide range of applications. We review relevant technology and system details, and the main body of the paper focuses on initial experiences with the hardware and software ecosystem for micro-benchmarks, mini-apps, and full applications, and starts to answer questions about where such technologies fit into the NSF ecosystem.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.