-
Tensor-Train WENO Scheme for Compressible Flows
Authors:
Mustafa Engin Danis,
Duc Truong,
Ismael Boureima,
Oleg Korobkin,
Kim Rasmussen,
Boian Alexandrov
Abstract:
In this study, we introduce a tensor-train (TT) finite difference WENO method for solving compressible Euler equations. In a step-by-step manner, the tensorization of the governing equations is demonstrated. We also introduce \emph{LF-cross} and \emph{WENO-cross} methods to compute numerical fluxes and the WENO reconstruction using the cross interpolation technique. A tensor-train approach is deve…
▽ More
In this study, we introduce a tensor-train (TT) finite difference WENO method for solving compressible Euler equations. In a step-by-step manner, the tensorization of the governing equations is demonstrated. We also introduce \emph{LF-cross} and \emph{WENO-cross} methods to compute numerical fluxes and the WENO reconstruction using the cross interpolation technique. A tensor-train approach is developed for boundary condition types commonly encountered in Computational Fluid Dynamics (CFD). The performance of the proposed WENO-TT solver is investigated in a rich set of numerical experiments. We demonstrate that the WENO-TT method achieves the theoretical $\text{5}^{\text{th}}$-order accuracy of the classical WENO scheme in smooth problems while successfully capturing complicated shock structures. In an effort to avoid the growth of TT ranks, we propose a dynamic method to estimate the TT approximation error that governs the ranks and overall truncation error of the WENO-TT scheme. Finally, we show that the traditional WENO scheme can be accelerated up to 1000 times in the TT format, and the memory requirements can be significantly decreased for low-rank problems, demonstrating the potential of tensor-train approach for future CFD application. This paper is the first study that develops a finite difference WENO scheme using the tensor-train approach for compressible flows. It is also the first comprehensive work that provides a detailed perspective into the relationship between rank, truncation error, and the TT approximation error for compressible WENO solvers.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Tensor Networks for Solving Realistic Time-independent Boltzmann Neutron Transport Equation
Authors:
Duc P. Truong,
Mario I. Ortega,
Ismael Boureima,
Gianmarco Manzini,
Kim Ø. Rasmussen,
Boian S. Alexandrov
Abstract:
Tensor network techniques, known for their low-rank approximation ability that breaks the curse of dimensionality, are emerging as a foundation of new mathematical methods for ultra-fast numerical solutions of high-dimensional Partial Differential Equations (PDEs). Here, we present a mixed Tensor Train (TT)/Quantized Tensor Train (QTT) approach for the numerical solution of time-independent Boltzm…
▽ More
Tensor network techniques, known for their low-rank approximation ability that breaks the curse of dimensionality, are emerging as a foundation of new mathematical methods for ultra-fast numerical solutions of high-dimensional Partial Differential Equations (PDEs). Here, we present a mixed Tensor Train (TT)/Quantized Tensor Train (QTT) approach for the numerical solution of time-independent Boltzmann Neutron Transport equations (BNTEs) in Cartesian geometry. Discretizing a realistic three-dimensional (3D) BNTE by (i) diamond differencing, (ii) multigroup-in-energy, and (iii) discrete ordinate collocation leads to huge generalized eigenvalue problems that generally require a matrix-free approach and large computer clusters. Starting from this discretization, we construct a TT representation of the PDE fields and discrete operators, followed by a QTT representation of the TT cores and solving the tensorized generalized eigenvalue problem in a fixed-point scheme with tensor network optimization techniques. We validate our approach by applying it to two realistic examples of 3D neutron transport problems, currently solved by the PARallel TIme-dependent SN (PARTISN) solver. We demonstrate that our TT/QTT method, executed on a standard desktop computer, leads to a yottabyte compression of the memory storage, and more than 7500 times speedup with a discrepancy of less than 1e-5 when compared to the PARTISN solution.
△ Less
Submitted 13 September, 2023; v1 submitted 6 September, 2023;
originally announced September 2023.
-
Distributed Out-of-Memory SVD on CPU/GPU Architectures
Authors:
Ismael Boureima,
Manish Bhattarai,
Maksim E. Eren,
Nick Solovyev,
Hristo Djidjev,
Boian S. Alexandrov
Abstract:
We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous (CPU+GPU) high performance computing (HPC) systems. Various implementations of SVD have been proposed, but most only estimate the singular values as an estimation of the singular vectors which can significantly increase the time and memory complexity of the alg…
▽ More
We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous (CPU+GPU) high performance computing (HPC) systems. Various implementations of SVD have been proposed, but most only estimate the singular values as an estimation of the singular vectors which can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks seen in the power method are typically associated with the computation of the Gram matrix $\mat{A}^T\mat{A}$, which can be significant when $\mat{A}$ is large and dense, or when $\mat{A}$ is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of $\mat{A}^T\mat{A}$ by using a batching strategy where the intermediate factors are computed block by block. We also suppress I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlap** each batch copy with compute using CUDA streams. Furthermore, we use optimized \textit{NCCL} based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of size 128PB with 1e-6 sparsity.
△ Less
Submitted 17 August, 2022;
originally announced August 2022.
-
Distributed Out-of-Memory NMF on CPU/GPU Architectures
Authors:
Ismael Boureima,
Manish Bhattarai,
Maksim Eren,
Erik Skau,
Philip Romero,
Stephan Eidenbenz,
Boian Alexandrov
Abstract:
We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and…
▽ More
We propose an efficient distributed out-of-memory implementation of the Non-negative Matrix Factorization (NMF) algorithm for heterogeneous high-performance-computing (HPC) systems. The proposed implementation is based on prior work on NMFk, which can perform automatic model selection and extract latent variables and patterns from data. In this work, we extend NMFk by adding support for dense and sparse matrix operation on multi-node, multi-GPU systems. The resulting algorithm is optimized for out-of-memory (OOM) problems where the memory required to factorize a given matrix is greater than the available GPU memory. Memory complexity is reduced by batching/tiling strategies, and sparse and dense matrix operations are significantly accelerated with GPU cores (or tensor cores when available). Input/Output (I/O) latency associated with batch copies between host and device is hidden using CUDA streams to overlap data transfers and compute asynchronously, and latency associated with collective communications (both intra-node and inter-node) is reduced using optimized NVIDIA Collective Communication Library NCCL based communicators. Benchmark results show significant improvement, from 32X to 76x speedup, with the new implementation using GPUs over the CPU-based NMFk. Good weak scaling was demonstrated on up to 4096 multi-GPU cluster nodes with approximately 25,000 GPUs when decomposing a dense 340 Terabyte-size matrix and an 11 Exabyte-size sparse matrix of density 10e-6.
△ Less
Submitted 12 September, 2023; v1 submitted 18 February, 2022;
originally announced February 2022.
-
Local Wave Number Model for Inhomogeneous Two-Fluid Mixing
Authors:
Nairita Pal,
Ismael Boureima,
Noah Braun,
Susan Kurien,
Praveen Ramaprabhu,
Andrew Lawrie
Abstract:
We present a study of a two-point spectral turbulence model (Local Wave-Number model or LWN model) for the Rayleigh-Taylor (RT) instability. The model outcomes are compared with statistical quantities extracted from three-dimensional simulation of the RT problem. These simulations are initialized with high wavenumber perturbations at the interface of a heavy fluid placed on top of a light fluid so…
▽ More
We present a study of a two-point spectral turbulence model (Local Wave-Number model or LWN model) for the Rayleigh-Taylor (RT) instability. The model outcomes are compared with statistical quantities extracted from three-dimensional simulation of the RT problem. These simulations are initialized with high wavenumber perturbations at the interface of a heavy fluid placed on top of a light fluid so that the density gradient is in the direction opposite to acceleration due to gravity. We consider flows of low to medium density contrast and compare the LWN model against simulation data using the mix-width evolution as the primary metric. The original model specified physically reasonable but largely \emph{ad hoc} terms to account for the inhomogeneous mechanisms involved in growing the mixing layer. We systematically assess the role of each of the terms in the LWN model equations by comparison with simulation. Two of these, the kinematic source term, introduced to maintain a finite covariance between density and specific volume, and a spectral distortion term, introduced as spectral modifications of the density-specific-volume covariance, both result in severely over-predicting the mix layer growth. A simplified model eliminating those two terms is shown to improve the capture of both mix-width evolution as well as the turbulent mass flux velocity profiles across the mix layer at different times. However, this simplification reveals that fidelity to other metrics such as the density-specific-volume covariance, and the turbulent kinetic energy are somewhat compromised. The implications of this outcome are discussed with respect to the physics of the RT problem, and we provide this study as a guide for the practical use of such a model.
△ Less
Submitted 16 August, 2021; v1 submitted 18 March, 2020;
originally announced March 2020.
-
Turbulent transport and mixing in the multimode narrowband Richtmyer-Meshkov instability
Authors:
B. Thornber,
J. Griffond,
P. Bigdelou,
I. Boureima,
P. Ramaprabhu,
O. Schilling,
R. J. R. Williams
Abstract:
The mean momentum and heavy mass fraction, turbulent kinetic energy, and heavy mass fraction variance fields, as well as the budgets of their transport equations, are examined at several times during the evolution of a narrowband Richtmyer-Meshkov instability initiated by a Mach 1.84 shock traversing a perturbed interface separating gases with a density ratio of 3. The results are computed using t…
▽ More
The mean momentum and heavy mass fraction, turbulent kinetic energy, and heavy mass fraction variance fields, as well as the budgets of their transport equations, are examined at several times during the evolution of a narrowband Richtmyer-Meshkov instability initiated by a Mach 1.84 shock traversing a perturbed interface separating gases with a density ratio of 3. The results are computed using the `quarter scale' data from four algorithms presented in the θ-group study of Thornber et al. [Phys. Fluids 29, 105107 (2017)]. The present study is inspired by a previous similar study of Rayleigh-Taylor instability and mixing using direct numerical simulation data by Schilling and Mueschke [Phys. Fluids 22, 105102 (2010)]. In addition to comparing the predictions of the data from four implicit large-eddy simulation codes, the budgets are used to quantify the relative importance of the terms in the transport equations, and the balance of the terms is employed to infer the numerical dissipation. Terms arising from the compressibility of the flow are examined in particular, i.e., the pressure-dilatation. The results are useful for validation of large-eddy simulation and Reynolds-averaged modeling of Richtmyer-Meshkov instability.
△ Less
Submitted 31 October, 2019;
originally announced October 2019.