Search | arXiv e-print repository

arXiv:2306.14011 [pdf, other]

Machine Learning-driven Autotuning of Graphics Processing Unit Accelerated Computational Fluid Dynamics for Enhanced Performance

Authors: Weicheng Xue, Christohper John Roy

Abstract: Optimizing the performance of computational fluid dynamics (CFD) applications accelerated by graphics processing units (GPUs) is crucial for efficient simulations. In this study, we employed a machine learning-based autotuning technique to optimize 14 key parameters related to GPU kernel scheduling, including the number of thread blocks and threads within a block. Our approach utilizes fully conne… ▽ More Optimizing the performance of computational fluid dynamics (CFD) applications accelerated by graphics processing units (GPUs) is crucial for efficient simulations. In this study, we employed a machine learning-based autotuning technique to optimize 14 key parameters related to GPU kernel scheduling, including the number of thread blocks and threads within a block. Our approach utilizes fully connected neural networks as the underlying machine learning model, with the tuning parameters as inputs to the neural networks and the actual execution time of a simulation as the outputs. To assess the effectiveness of our autotuning approach, we conducted experiments on three different types of GPUs, with computational speeds ranging from low to high. We performed independent training for each GPU model and also explored combined training across multiple GPU models. By leveraging artificial neural networks, our autotuning technique achieved remarkable results in tuning a wide range of parameters, leading to enhanced performance for a CFD code. Importantly, our approach demonstrated its efficacy while requiring only a small fraction of samples from the large parameter search space. This efficiency is attributed to the effectiveness of the fully connected neural networks in capturing the complex relationships between the parameter settings and the resulting performance. Overall, our study showcases the potential of machine learning, specifically fully connected neural networks, in autotuning GPU-accelerated CFD codes. By leveraging this approach, researchers and practitioners can achieve high performance in scientific simulations with optimized parameter configurations. △ Less

Submitted 20 February, 2024; v1 submitted 24 June, 2023; originally announced June 2023.

arXiv:2305.18057 [pdf, other]

CPU-GPU Heterogeneous Code Acceleration of a Finite Volume Computational Fluid Dynamics Solver

Authors: Weicheng Xue, Hongyu Wang, Christopher J. Roy

Abstract: This work deals with the CPU-GPU heterogeneous code acceleration of a finite-volume CFD solver utilizing multiple CPUs and GPUs at the same time. First, a high-level description of the CFD solver called SENSEI, the discretization of SENSEI, and the CPU-GPU heterogeneous computing workflow in SENSEI leveraging MPI and OpenACC are given. Then, a performance model for CPU-GPU heterogeneous computing… ▽ More This work deals with the CPU-GPU heterogeneous code acceleration of a finite-volume CFD solver utilizing multiple CPUs and GPUs at the same time. First, a high-level description of the CFD solver called SENSEI, the discretization of SENSEI, and the CPU-GPU heterogeneous computing workflow in SENSEI leveraging MPI and OpenACC are given. Then, a performance model for CPU-GPU heterogeneous computing requiring ghost cell exchange is proposed to help estimate the performance of the heterogeneous implementation. The scaling performance of the CPU-GPU heterogeneous computing and its comparison with the pure multi-CPU/GPU performance for a supersonic inlet test case is presented to display the advantages of leveraging the computational power of both the CPU and the GPU. Using CPUs and GPUs as workers together, the performance can be improved further compared to using pure CPUs or GPUs, and the advantages can be fairly estimated by the performance model proposed in this work. Finally, conclusions are drawn to provide 1) suggestions for application users who have an interest to leverage the computational power of the CPU and GPU to accelerate their own scientific computing simulations and 2) feedback for hardware architects who have an interest to design a better CPU-GPU heterogeneous system for heterogeneous computing. △ Less

Submitted 29 May, 2023; originally announced May 2023.

arXiv:2012.02925 [pdf, other]

doi 10.1016/j.jpdc.2021.05.010

An Improved Framework of GPU Computing for CFD Applications on Structured Grids using OpenACC

Authors: Weicheng Xue, Charles W. Jackson, Christoper J. Roy

Abstract: This paper is focused on improving multi-GPU performance of a research CFD code on structured grids. MPI and OpenACC directives are used to scale the code up to 16 GPUs. This paper shows that using 16 P100 GPUs and 16 V100 GPUs can be 30$\times$ and 70$\times$ faster than 16 Xeon CPU E5-2680v4 cores for three different test cases, respectively. A series of performance issues related to the scaling… ▽ More This paper is focused on improving multi-GPU performance of a research CFD code on structured grids. MPI and OpenACC directives are used to scale the code up to 16 GPUs. This paper shows that using 16 P100 GPUs and 16 V100 GPUs can be 30$\times$ and 70$\times$ faster than 16 Xeon CPU E5-2680v4 cores for three different test cases, respectively. A series of performance issues related to the scaling for the multi-block CFD code are addressed by applying various optimizations. Performance optimizations such as the pack/unpack message method, removing temporary arrays as arguments to procedure calls, allocating global memory for limiters and connected boundary data, reordering non-blocking MPI I\_send/I\_recv and Wait calls, reducing unnecessary implicit derived type member data movement between the host and the device and the use of GPUDirect can improve the compute utilization, memory throughput, and asynchronous progression in the multi-block CFD code using modern programming features. △ Less

Submitted 4 December, 2020; originally announced December 2020.

Comments: 43 pages, 27 figures

arXiv:2006.02602 [pdf, other]

doi 10.1002/cpe.6036

Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms

Authors: Weicheng Xue, Christopher J. Roy

Abstract: This paper investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on different platforms. The paper shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs… ▽ More This paper investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on different platforms. The paper shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the paper. Since the buoyancy driven cavity code is latency-bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the latency cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost changes on all of the clusters tested. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging CPU's memory. Finally, overlap** the communication and computations is shown to be not efficient on multi-GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough overlap, the actual running does not utilize the overlap well due to a lack of asynchronous progression. △ Less

Submitted 3 June, 2020; originally announced June 2020.

arXiv:1607.06834 [pdf, other]

doi 10.1016/j.compfluid.2017.09.014

A Numerical Investigation of Matrix-Free Implicit Time-Step** Methods for Large CFD Simulations

Authors: Arash Sarshar, Paul Tranquilli, Brent Pickering, Andrew McCall, Adrian Sandu, Christopher J. Roy

Abstract: This paper is concerned with the development and testing of advanced time-step** methods suited for the integration of time-accurate, real-world applications of computational fluid dynamics (CFD). The performance of several time discretization methods is studied numerically with regards to computational efficiency, order of accuracy, and stability, as well as the ability to treat effectively sti… ▽ More This paper is concerned with the development and testing of advanced time-step** methods suited for the integration of time-accurate, real-world applications of computational fluid dynamics (CFD). The performance of several time discretization methods is studied numerically with regards to computational efficiency, order of accuracy, and stability, as well as the ability to treat effectively stiff problems. We consider matrix-free implementations, a popular approach for time-step** methods applied to large CFD applications due to its adherence to scalable matrix-vector operations and a small memory footprint. We compare explicit methods with matrix-free implementations of implicit, linearly-implicit, as well as Rosenbrock-Krylov methods. We show that Rosenbrock-Krylov methods are competitive with existing techniques excelling for a number of problem types and settings. △ Less

Submitted 30 September, 2017; v1 submitted 22 July, 2016; originally announced July 2016.

Report number: Computational Science Lab CSL-TR-16-6 MSC Class: 65L05; 65L06; 65L20

Journal ref: Computers & Fluids, Volume 159, 15 Dec. 2017, PP. 53-63

Showing 1–5 of 5 results for author: Roy, C J