Search | arXiv e-print repository

Alya towards Exascale: Optimal OpenACC Performance of the Navier-Stokes Finite Element Assembly on GPUs

Authors: Herbert Owen, Dominik Ernst, Thomas Gruber, Oriol Lemkuhl, Guillaume Houzeaux, Lucas Gasparino, Gerhard Wellein

Abstract: This paper addresses the challenge of providing portable and highly efficient code structures for CPU and GPU architectures. We choose the assembly of the right-hand term in the incompressible flow module of the High-Performance Computational Mechanics code Alya, which is one of the two CFD codes in the Unified European Benchmark Suite. Starting from an efficient CPU-code and a related OpenACC-por… ▽ More This paper addresses the challenge of providing portable and highly efficient code structures for CPU and GPU architectures. We choose the assembly of the right-hand term in the incompressible flow module of the High-Performance Computational Mechanics code Alya, which is one of the two CFD codes in the Unified European Benchmark Suite. Starting from an efficient CPU-code and a related OpenACC-port for GPUs we successively investigate performance potentials arising from code specialization, algorithmic restructuring and low-level optimizations. We demonstrate that only the combination of these different dimensions of runtime optimization unveils the full performance potential on the GPU and CPU. Roofline-based performance modelling is applied in this process and we demonstrate the need to investigate new optimization strategies if a classical roofline limit such as memory bandwidth utilization is achieved, rather than stop** the process. The final unified OpenACC-based implementation boosts performance by more than 50x on an NVIDIA A100 GPU (achieving approximately 2.5 TF/s FP64) and a further factor of 5x for an Intel Icelake based CPU-node (achieving approximately 1.0 TF/s FP64). The insights gained in our manual approach lays ground implementing unified but still highly efficient code structures for related kernels in Alya and other applications. These can be realized by manual coding or automatic code generation frameworks. △ Less

Submitted 22 January, 2024; originally announced March 2024.

arXiv:2401.08447 [pdf, other]

Monitoring the development of CFD applications on unstable HPC platforms

Authors: Damien Dosimont, Guillaume Houzeaux

Abstract: We tackle the challenging tasks of monitoring on unstable HPC platforms the performance of CFD applications all along their development. We have designed and implemented a monitoring framework, integrated at the end of a CI-CD pipeline. Measures retrieved during the automatic execution of production simulations are analyzed within a visual analytics interface we developed, providing advanced visua… ▽ More We tackle the challenging tasks of monitoring on unstable HPC platforms the performance of CFD applications all along their development. We have designed and implemented a monitoring framework, integrated at the end of a CI-CD pipeline. Measures retrieved during the automatic execution of production simulations are analyzed within a visual analytics interface we developed, providing advanced visualizations and interaction. We have validated this approach by monitoring the CFD code Alya over two years, detecting and resolving issues related to the platform, and highlighting performance improvement. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: ParCFD2023 34th International Conference on Parallel Computational Fluid Dynamics, May 29-31 2023, Cuenca, Ecuador

arXiv:2210.11917 [pdf, other]

A portable coding strategy to exploit vectorization on combustion simulations

Authors: Fabio Banchelli, Guillermo Oyarzun, Marta Garcia-Gasulla, Filippo Mantovani, Ambrus Both, Guillaume Houzeaux, Daniel Mira

Abstract: The complexity of combustion simulations demands the latest high-performance computing tools to accelerate its time-to-solution results. A current trend on HPC systems is the utilization of CPUs with SIMD or vector extensions to exploit data parallelism. Our work proposes a strategy to improve the automatic vectorization of finite element-based scientific codes. The approach applies a parametric c… ▽ More The complexity of combustion simulations demands the latest high-performance computing tools to accelerate its time-to-solution results. A current trend on HPC systems is the utilization of CPUs with SIMD or vector extensions to exploit data parallelism. Our work proposes a strategy to improve the automatic vectorization of finite element-based scientific codes. The approach applies a parametric configuration to the data structures to help the compiler detect the block of codes that can take advantage of vector computation while maintaining the code portable. A detailed analysis of the computational impact of this methodology on the different stages of a CFD solver is studied on the PRECCINSTA burner simulation. Our parametric implementation has proven to help the compiler generate more vector instructions in the assembly operation: this results in a reduction of up to 9.3 times of the total executed instruction maintaining constant the Instructions Per Cycle and the CPU frequency. The proposed strategy improves the performance of the CFD case under study up to 4.67 times on the MareNostrum 4 supercomputer. △ Less

Submitted 21 October, 2022; originally announced October 2022.

arXiv:2207.14292 [pdf, other]

doi 10.1016/j.compstruc.2022.106862

A parallel algorithm for unilateral contact problems

Authors: G. Guillamet, M. Rivero, M. Zavala-Aké, M. Vázquez, G. Houzeaux, S. Oller

Abstract: In this paper, we introduce a novel parallel contact algorithm designed to run efficiently in High-Performance Computing based supercomputers. Particular emphasis is put on its computational implementation in a multiphysics finite element code. The algorithm is based on the method of partial Dirichlet-Neumann boundary conditions and is capable to solve numerically a nonlinear contact problem betwe… ▽ More In this paper, we introduce a novel parallel contact algorithm designed to run efficiently in High-Performance Computing based supercomputers. Particular emphasis is put on its computational implementation in a multiphysics finite element code. The algorithm is based on the method of partial Dirichlet-Neumann boundary conditions and is capable to solve numerically a nonlinear contact problem between rigid and deformable bodies in a whole parallel framework. Its distinctive characteristic is that the contact is tackled as a coupled problem, in which the contacting bodies are treated separately, in a staggered way. Then, the coupling is performed through the exchange of boundary conditions at the contact interface following a Gauss-Seidel strategy. To validate this algorithm we conducted several benchmark tests by comparing the proposed solution against theoretical and other numerical solutions. Finally, we evaluated the parallel performance of the proposed algorithm in a real impact test to show its capabilities for large-scale applications. △ Less

Submitted 1 August, 2022; v1 submitted 27 July, 2022; originally announced July 2022.

Comments: 26 pages, 23 figures

MSC Class: 74M15; 74M20; 74F99; 74S05; 65Y05; ACM Class: J.2; I.6.3

Journal ref: Computers & Structures, Volume 271 (2022) 106862

arXiv:2112.09560 [pdf, other]

doi 10.1016/j.compfluid.2022.105577

Dynamic resource allocation for efficient parallel CFD simulations

Authors: G. Houzeaux, R. M. Badia, R. Borrell, D. Dosimont, J. Ejarque, M. Garcia-Gasulla, V. López

Abstract: CFD users of supercomputers usually resort to rule-of-thumb methods to select the number of subdomains (partitions) when relying on MPI-based parallelization. One common approach is to set a minimum number of elements or cells per subdomain, under which the parallel efficiency of the code is "known" to fall below a subjective level, say 80%. The situation is even worse when the user is not aware o… ▽ More CFD users of supercomputers usually resort to rule-of-thumb methods to select the number of subdomains (partitions) when relying on MPI-based parallelization. One common approach is to set a minimum number of elements or cells per subdomain, under which the parallel efficiency of the code is "known" to fall below a subjective level, say 80%. The situation is even worse when the user is not aware of the "good" practices for the given code and a huge amount of resources can thus be wasted. This work presents an elastic computing methodology to adapt at runtime the resources allocated to a simulation automatically. The criterion to control the required resources is based on a runtime measure of the communication efficiency of the execution. According to some analytical estimates, the resources are then expanded or reduced to fulfil this criterion and eventually execute an efficient simulation. △ Less

Submitted 29 June, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: 27 pages, 15 figures

MSC Class: 35-04 ACM Class: D.1; D.2; J.2; J.6

arXiv:2107.11541 [pdf, other]

Performance assessment of CUDA and OpenACC in large scale combustion simulations

Authors: Guillermo Oyarzun, Daniel Mira, Guillaume Houzeaux

Abstract: GPUs have climbed up to the top of supercomputer systems making life harder to many legacy scientific codes. Nowadays, many recipes are being used in such code's portability, without any clarity of which is the best option. We present a comparative analysis of the two most common approaches, CUDA and OpenACC, into the multi-physics CFD code Alya. Our focus is the combustion problems which are one… ▽ More GPUs have climbed up to the top of supercomputer systems making life harder to many legacy scientific codes. Nowadays, many recipes are being used in such code's portability, without any clarity of which is the best option. We present a comparative analysis of the two most common approaches, CUDA and OpenACC, into the multi-physics CFD code Alya. Our focus is the combustion problems which are one of the most computing demanding CFD simulations. The most computing-intensive parts of the code were analyzed in detail. New data structures for the matrix assembly step have been created to facilitate a SIMD execution that benefits vectorization in the CPU and stream processing in the GPU. As a result, the CPU code has improved its performance by up to 25%. In GPU execution, CUDA has proven to be up to 2 times faster than OpenACC for the assembly of the matrix. On the contrary, similar performance has been obtained in the kernels related to vector operations used in the linear solver, where there is minimal memory reuse. △ Less

Submitted 31 July, 2021; v1 submitted 24 July, 2021; originally announced July 2021.

arXiv:2005.13439 [pdf, other]

doi 10.1016/j.jfluidstructs.2020.103009

HPC compact quasi-Newton algorithm for interface problems

Authors: A. Santiago, M. Zavala-Aké, R. Borell, G. Houzeaux

Abstract: In this work we present a robust interface coupling algorithm called Compact Interface quasi-Newton (CIQN). It is designed for computationally intensive applications using an MPI multi-code partitioned scheme. The algorithm allows to reuse information from previous time steps, feature that has been previously proposed to accelerate convergence. Through algebraic manipulation, an efficient usage of… ▽ More In this work we present a robust interface coupling algorithm called Compact Interface quasi-Newton (CIQN). It is designed for computationally intensive applications using an MPI multi-code partitioned scheme. The algorithm allows to reuse information from previous time steps, feature that has been previously proposed to accelerate convergence. Through algebraic manipulation, an efficient usage of the computational resources is achieved by: avoiding construction of dense matrices and reduce every multiplication to a matrix-vector product and reusing the computationally expensive loops. This leads to a compact version of the original quasi-Newton algorithm. Altogether with an efficient communication, in this paper we show an efficient scalability up to 4800 cores. Three examples with qualitatively different dynamics are shown to prove that the algorithm can efficiently deal with added mass instability and two-field coupled problems. We also show how reusing histories and filtering does not necessarily makes a more robust scheme and, finally, we prove the necessity of this HPC version of the algorithm. The novelty of this article lies in the HPC focused implementation of the algorithm, detailing how to fuse and combine the composing blocks to obtain an scalable MPI implementation. Such an implementation is mandatory in large scale cases, for which the contact surface cannot be stored in a single computational node, or the number of contact nodes is not negligible compared with the size of the domain. \c{opyright} <2020> Elsevier. This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ △ Less

Submitted 1 June, 2020; v1 submitted 22 May, 2020; originally announced May 2020.

Comments: 33 pages: 23 manuscript, 10 appendix. 16 figures: 4 manuscript, 12 appendix. 10 Tables: 3 manuscript, 7 appendix

MSC Class: 68U20; 00A72; 68Q85; 65-04; 74F10

Journal ref: Journal of Fluids and Structures (2020) 103009

arXiv:2005.05899 [pdf, other]

doi 10.1016/j.future.2020.01.045

Heterogeneous CPU/GPU co-execution of CFD simulations on the POWER9 architecture: Application to airplane aerodynamics

Authors: R. Borrell, D. Dosimont, M. Garcia-Gasulla, G. Houzeaux, O. Lehmkuhl, V. Mehta, H. Owen, M. Vazquez, G. Oyarzun

Abstract: High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper… ▽ More High fidelity Computational Fluid Dynamics simulations are generally associated with large computing requirements, which are progressively acute with each new generation of supercomputers. However, significant research efforts are required to unlock the computing power of leading-edge systems, currently referred to as pre-Exascale systems, based on increasingly complex architectures. In this paper, we present the approach implemented in the computational mechanics code Alya. We describe in detail the parallelization strategy implemented to fully exploit the different levels of parallelism, together with a novel co-execution method for the efficient utilization of heterogeneous CPU/GPU architectures. The latter is based on a multi-code co-execution approach with a dynamic load balancing mechanism. The assessment of the performance of all the proposed strategies has been carried out for airplane simulations on the POWER9 architecture accelerated with NVIDIA Volta V100 GPUs. △ Less

Submitted 6 July, 2020; v1 submitted 12 May, 2020; originally announced May 2020.

Journal ref: Future Generation Computer Systems, Volume 107, 2020,Pages 31-48

arXiv:1805.03949 [pdf, other]

doi 10.1080/10618562.2019.1617856

MPI+X: task-based parallelization and dynamic load balance of finite element assembly

Authors: Marta Garcia-Gasulla, Guillaume Houzeaux, Roger Ferrer, Antoni Artigues, Victor López, Jesús Labarta, Mariano Vázquez

Abstract: The main computing tasks of a finite element code(FE) for solving partial differential equations (PDE's) are the algebraic system assembly and the iterative solver. This work focuses on the first task, in the context of a hybrid MPI+X paradigm. Although we will describe algorithms in the FE context, a similar strategy can be straightforwardly applied to other discretization methods, like the finit… ▽ More The main computing tasks of a finite element code(FE) for solving partial differential equations (PDE's) are the algebraic system assembly and the iterative solver. This work focuses on the first task, in the context of a hybrid MPI+X paradigm. Although we will describe algorithms in the FE context, a similar strategy can be straightforwardly applied to other discretization methods, like the finite volume method. The matrix assembly consists of a loop over the elements of the MPI partition to compute element matrices and right-hand sides and their assemblies in the local system to each MPI partition. In a MPI+X hybrid parallelism context, X has consisted traditionally of loop parallelism using OpenMP. Several strategies have been proposed in the literature to implement this loop parallelism, like coloring or substructuring techniques to circumvent the race condition that appears when assembling the element system into the local system. The main drawback of the first technique is the decrease of the IPC due to bad spatial locality. The second technique avoids this issue but requires extensive changes in the implementation, which can be cumbersome when several element loops should be treated. We propose an alternative, based on the task parallelism of the element loop using some extensions to the OpenMP programming model. The taskification of the assembly solves both aforementioned problems. In addition, dynamic load balance will be applied using the DLB library, especially efficient in the presence of hybrid meshes, where the relative costs of the different elements is impossible to estimate a priori. This paper presents the proposed methodology, its implementation and its validation through the solution of large computational mechanics problems up to 16k cores. △ Less

Submitted 9 May, 2018; originally announced May 2018.

Showing 1–9 of 9 results for author: Houzeaux, G