-
Optimized thread-block arrangement in a GPU implementation of a linear solver for atmospheric chemistry mechanisms
Authors:
Christian Guzman Ruiz,
Mario Acosta,
Oriol Jorba,
Eduardo Cesar Galobardes,
Matthew Dawson,
Guillermo Oyarzun,
Carlos Pérez García-Pando,
Kim Serradell
Abstract:
Earth system models (ESM) demand significant hardware resources and energy consumption to solve atmospheric chemistry processes. Recent studies have shown improved performance from running these models on GPU accelerators. Nonetheless, there is room for improvement in exploiting even more GPU resources.
This study proposes an optimized distribution of the chemical solver's computational load on…
▽ More
Earth system models (ESM) demand significant hardware resources and energy consumption to solve atmospheric chemistry processes. Recent studies have shown improved performance from running these models on GPU accelerators. Nonetheless, there is room for improvement in exploiting even more GPU resources.
This study proposes an optimized distribution of the chemical solver's computational load on the GPU, named Block-cells. Additionally, we evaluate different configurations for distributing the computational load in an NVIDIA GPU.
We use the linear solver from the Chemistry Across Multiple Phases (CAMP) framework as our test bed. An intermediate-complexity chemical mechanism under typical atmospheric conditions is used. Results demonstrate a 35x speedup compared to the single-CPU thread reference case. Even using the full resources of the node (40 physical cores) on the reference case, the Block-cells version outperforms them by 50%. The Block-cells approach shows promise in alleviating the computational burden of chemical solvers on GPU architectures.
△ Less
Submitted 27 May, 2024;
originally announced May 2024.
-
A portable coding strategy to exploit vectorization on combustion simulations
Authors:
Fabio Banchelli,
Guillermo Oyarzun,
Marta Garcia-Gasulla,
Filippo Mantovani,
Ambrus Both,
Guillaume Houzeaux,
Daniel Mira
Abstract:
The complexity of combustion simulations demands the latest high-performance computing tools to accelerate its time-to-solution results. A current trend on HPC systems is the utilization of CPUs with SIMD or vector extensions to exploit data parallelism. Our work proposes a strategy to improve the automatic vectorization of finite element-based scientific codes. The approach applies a parametric c…
▽ More
The complexity of combustion simulations demands the latest high-performance computing tools to accelerate its time-to-solution results. A current trend on HPC systems is the utilization of CPUs with SIMD or vector extensions to exploit data parallelism. Our work proposes a strategy to improve the automatic vectorization of finite element-based scientific codes. The approach applies a parametric configuration to the data structures to help the compiler detect the block of codes that can take advantage of vector computation while maintaining the code portable. A detailed analysis of the computational impact of this methodology on the different stages of a CFD solver is studied on the PRECCINSTA burner simulation. Our parametric implementation has proven to help the compiler generate more vector instructions in the assembly operation: this results in a reduction of up to 9.3 times of the total executed instruction maintaining constant the Instructions Per Cycle and the CPU frequency. The proposed strategy improves the performance of the CFD case under study up to 4.67 times on the MareNostrum 4 supercomputer.
△ Less
Submitted 21 October, 2022;
originally announced October 2022.
-
An FPGA cached sparse matrix vector product (SpMV) for unstructured computational fluid dynamics simulations
Authors:
Guillermo Oyarzun,
Daniel Peyrolon,
Carlos Alvarez,
Xavier Martorell
Abstract:
Field Programmable Gate Arrays generate algorithmic specific architectures that improve the code's FLOP per watt ratio. Such devices are re-gaining interest due to the rise of new tools that facilitate their programming, such as OmpSs. The computational fluid dynamics community is always investigating new architectures that can improve its algorithm's performance. Commonly, those algorithms have a…
▽ More
Field Programmable Gate Arrays generate algorithmic specific architectures that improve the code's FLOP per watt ratio. Such devices are re-gaining interest due to the rise of new tools that facilitate their programming, such as OmpSs. The computational fluid dynamics community is always investigating new architectures that can improve its algorithm's performance. Commonly, those algorithms have a low arithmetic intensity and only reach a small percentage of the peak performance. The sparse matrix-vector multiplication is one of the most time-consuming operations on unstructured simulations. The matrix's sparsity pattern determines the indirect memory accesses of the multiplying vector. This data path is hard to predict, making traditional implementations fail. In this work, we present an FPGA architecture that maximizes the vector's re-usability by introducing a cache-like architecture. The cache is implemented as a circular list that maintains the BRAM vector components while needed. Following this strategy, up to 16 times of acceleration is obtained compared to a naive implementation of the algorithm.
△ Less
Submitted 24 July, 2021;
originally announced July 2021.
-
Performance assessment of CUDA and OpenACC in large scale combustion simulations
Authors:
Guillermo Oyarzun,
Daniel Mira,
Guillaume Houzeaux
Abstract:
GPUs have climbed up to the top of supercomputer systems making life harder to many legacy scientific codes. Nowadays, many recipes are being used in such code's portability, without any clarity of which is the best option. We present a comparative analysis of the two most common approaches, CUDA and OpenACC, into the multi-physics CFD code Alya. Our focus is the combustion problems which are one…
▽ More
GPUs have climbed up to the top of supercomputer systems making life harder to many legacy scientific codes. Nowadays, many recipes are being used in such code's portability, without any clarity of which is the best option. We present a comparative analysis of the two most common approaches, CUDA and OpenACC, into the multi-physics CFD code Alya. Our focus is the combustion problems which are one of the most computing demanding CFD simulations. The most computing-intensive parts of the code were analyzed in detail. New data structures for the matrix assembly step have been created to facilitate a SIMD execution that benefits vectorization in the CPU and stream processing in the GPU. As a result, the CPU code has improved its performance by up to 25%. In GPU execution, CUDA has proven to be up to 2 times faster than OpenACC for the assembly of the matrix. On the contrary, similar performance has been obtained in the kernels related to vector operations used in the linear solver, where there is minimal memory reuse.
△ Less
Submitted 31 July, 2021; v1 submitted 24 July, 2021;
originally announced July 2021.
-
Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics
Authors:
R. Borrell,
D. Dosimont,
M. Garcia-Gasulla,
G. Houzeaux,
O. Lehmkuhl,
V. Mehta,
H. Owen,
M. Vazquez,
G. Oyarzun
Abstract:
High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper…
▽ More
High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs.
△ Less
Submitted 6 July, 2020; v1 submitted 12 May, 2020;
originally announced May 2020.