-
Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES
Authors:
Ichitaro Yamazaki,
Andrew J. Higgins,
Erik G. Boman,
Daniel B. Szyld
Abstract:
On current computer architectures, GMRES' performance can be limited by its communication cost to generate orthonormal basis vectors of the Krylov subspace. To address this performance bottleneck, its $s$-step variant orthogonalizes a block of $s$ basis vectors at a time, potentially reducing the communication cost by a factor of $s$. Unfortunately, for a large step size $s$, the solver can genera…
▽ More
On current computer architectures, GMRES' performance can be limited by its communication cost to generate orthonormal basis vectors of the Krylov subspace. To address this performance bottleneck, its $s$-step variant orthogonalizes a block of $s$ basis vectors at a time, potentially reducing the communication cost by a factor of $s$. Unfortunately, for a large step size $s$, the solver can generate extremely ill-conditioned basis vectors, and to maintain stability in practice, a conservatively small step size is used, which limits the performance of the $s$-step solver. To enhance the performance using a small step size, in this paper, we introduce a two-stage block orthogonalization scheme. Similar to the original scheme, the first stage of the proposed method operates on a block of $s$ basis vectors at a time, but its objective is to maintain the well-conditioning of the generated basis vectors with a lower cost. The orthogonalization of the basis vectors is delayed until the second stage when enough basis vectors are generated to obtain higher performance.
Our analysis shows the stability of the proposed two-stage scheme. The performance is improved because while the same amount of computation as the original scheme is required, most of the communication is done at the second stage of the proposed scheme, reducing the overall communication requirements. Our performance results with up to 192 NVIDIA V100 GPUs on the Summit supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage approach can reduce the orthogonalization time and the total time-to-solution by the respective factors of up to $2.6\times$ and $1.6\times$ over the original $s$-step GMRES, which had already obtained the respective speedups of $2.1\times$ and $1.8\times$ over the standard GMRES. Similar speedups were obtained for 3D problems and for matrices from the SuiteSparse Matrix Collection.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
Analysis of Randomized Householder-Cholesky QR Factorization with Multisketching
Authors:
Andrew J. Higgins,
Daniel B. Szyld,
Erik G. Boman,
Ichitaro Yamazaki
Abstract:
CholeskyQR2 and shifted CholeskyQR3 are two state-of-the-art algorithms for computing tall-and-skinny QR factorizations since they attain high performance on current computer architectures. However, to guarantee stability, for some applications, CholeskyQR2 faces a prohibitive restriction on the condition number of the underlying matrix to factorize. Shifted CholeskyQR3 is stable but has $50\%$ mo…
▽ More
CholeskyQR2 and shifted CholeskyQR3 are two state-of-the-art algorithms for computing tall-and-skinny QR factorizations since they attain high performance on current computer architectures. However, to guarantee stability, for some applications, CholeskyQR2 faces a prohibitive restriction on the condition number of the underlying matrix to factorize. Shifted CholeskyQR3 is stable but has $50\%$ more computational and communication costs than CholeskyQR2. In this paper, a randomized QR algorithm called Randomized Householder-Cholesky (\texttt{rand\_cholQR}) is proposed and analyzed. Using one or two random sketch matrices, it is proved that with high probability, its orthogonality error is bounded by a constant of the order of unit roundoff for any numerically full-rank matrix, and hence it is as stable as shifted CholeskyQR3. An evaluation of the performance of \texttt{rand\_cholQR} on a NVIDIA A100 GPU demonstrates that for tall-and-skinny matrices, \texttt{rand\_cholQR} with multiple sketch matrices is nearly as fast as, or in some cases faster than, CholeskyQR2. Hence, compared to CholeskyQR2, \texttt{rand\_cholQR} is more stable with almost no extra computational or memory cost, and therefore a superior algorithm both in theory and practice.
△ Less
Submitted 11 September, 2023;
originally announced September 2023.
-
An Experimental Study of Two-Level Schwarz Domain Decomposition Preconditioners on GPUs
Authors:
Ichitaro Yamazaki,
Alexander Heinlein,
Sivasankaran Rajamanickam
Abstract:
The generalized Dryja--Smith--Widlund (GDSW) preconditioner is a two-level overlap** Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlap** Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the sol…
▽ More
The generalized Dryja--Smith--Widlund (GDSW) preconditioner is a two-level overlap** Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlap** Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver's computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy.
The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlap** subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about $2\times$ using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
A Study of Mixed Precision Strategies for GMRES on GPUs
Authors:
Jennifer A. Loe,
Christian A. Glusa,
Ichitaro Yamazaki,
Erik G. Boman,
Sivasankaran Rajamanickam
Abstract:
Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision stra…
▽ More
Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of mixed precision strategies for accelerating this kernel on an NVIDIA V$100$ GPU with a Power 9 CPU. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when mixed precision GMRES will be effective and to choose parameters for a mixed precision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of mixed precision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.
△ Less
Submitted 2 September, 2021;
originally announced September 2021.
-
Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs
Authors:
Jennifer A. Loe,
Christian A. Glusa,
Ichitaro Yamazaki,
Erik G. Boman,
Sivasankaran Rajamanickam
Abstract:
Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strat…
▽ More
Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.
△ Less
Submitted 16 May, 2021;
originally announced May 2021.
-
Low-Synch Gram-Schmidt with Delayed Reorthogonalization for Krylov Solvers
Authors:
Daniel Bielich,
Julien Langou,
Stephen Thomas,
Kasia Swirydowicz,
Ichitaro Yamazaki,
Erik G. Boman
Abstract:
The parallel strong-scaling of Krylov iterative methods is largely determined by the number of global reductions required at each iteration. The GMRES and Krylov-Schur algorithms employ the Arnoldi algorithm for nonsymmetric matrices. The underlying orthogonalization scheme is left-looking and processes one column at a time. Thus, at least one global reduction is required per iteration. The tradit…
▽ More
The parallel strong-scaling of Krylov iterative methods is largely determined by the number of global reductions required at each iteration. The GMRES and Krylov-Schur algorithms employ the Arnoldi algorithm for nonsymmetric matrices. The underlying orthogonalization scheme is left-looking and processes one column at a time. Thus, at least one global reduction is required per iteration. The traditional algorithm for generating the orthogonal Krylov basis vectors for the Krylov-Schur algorithm is classical Gram Schmidt applied twice with reorthogonalization (CGS2), requiring three global reductions per step. A new variant of CGS2 that requires only one reduction per iteration is applied to the Arnoldi-QR iteration. Strong-scaling results are presented for finding eigenvalue-pairs of nonsymmetric matrices. A preliminary attempt to derive a similar algorithm (one reduction per Arnoldi iteration with a robust orthogonalization scheme) was presented by Hernandez et al.(2007). Unlike our approach, their method is not forward stable for eigenvalues.
△ Less
Submitted 15 May, 2021; v1 submitted 2 April, 2021;
originally announced April 2021.
-
Two-Stage Gauss--Seidel Preconditioners and Smoothers for Krylov Solvers on a GPU cluster
Authors:
Luc Berger-Vergiat,
Brian Kelley,
Sivasankaran Rajamanickam,
Jonathan Hu,
Katarzyna Swirydowicz,
Paul Mullowney,
Stephen Thomas,
Ichitaro Yamazaki
Abstract:
Gauss-Seidel (GS) relaxation is often employed as a preconditioner for a Krylov solver or as a smoother for Algebraic Multigrid (AMG). However, the requisite sparse triangular solve is difficult to parallelize on many-core architectures such as graphics processing units (GPUs). In the present study, the performance of the traditional GS relaxation based on a triangular solve is compared with two-s…
▽ More
Gauss-Seidel (GS) relaxation is often employed as a preconditioner for a Krylov solver or as a smoother for Algebraic Multigrid (AMG). However, the requisite sparse triangular solve is difficult to parallelize on many-core architectures such as graphics processing units (GPUs). In the present study, the performance of the traditional GS relaxation based on a triangular solve is compared with two-stage variants, replacing the direct triangular solve with a fixed number of inner Jacobi-Richardson (JR) iterations. When a small number of inner iterations is sufficient to maintain the Krylov convergence rate, the two-stage GS (GS2) often outperforms the traditional algorithm on many-core architectures. We also compare GS2 with JR. When they perform the same number of flops for SpMV (e.g. three JR sweeps compared to two GS sweeps with one inner JR sweep), the GS2 iterations, and the Krylov solver preconditioned with GS2, may converge faster than the JR iterations. Moreover, for some problems (e.g. elasticity), it was found that JR may diverge with a dam** factor of one, whereas two-stage GS may improve the convergence with more inner iterations. Finally, to study the performance of the two-stage smoother and preconditioner for a practical problem, %(e.g. using tuned dam** factors), these were applied to incompressible fluid flow simulations on GPUs.
△ Less
Submitted 24 April, 2021; v1 submitted 2 April, 2021;
originally announced April 2021.
-
A Survey of Numerical Methods Utilizing Mixed Precision Arithmetic
Authors:
Ahmad Abdelfattah,
Hartwig Anzt,
Erik G. Boman,
Erin Carson,
Terry Cojean,
Jack Dongarra,
Mark Gates,
Thomas Grützmacher,
Nicholas J. Higham,
Sherry Li,
Neil Lindquist,
Yang Liu,
Jennifer Loe,
Piotr Luszczek,
Pratik Nayak,
Sri Pranesh,
Siva Rajamanickam,
Tobias Ribizel,
Barry Smith,
Kasia Swirydowicz,
Stephen Thomas,
Stanimire Tomov,
Yaohung M. Tsai,
Ichitaro Yamazaki,
Urike Meier Yang
Abstract:
Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the server-line products are increasingly featuring low-precision special function units, such as the NVIDIA tensor cores in ORNL's Summit supercomputer providing more t…
▽ More
Within the past years, hardware vendors have started designing low precision special function units in response to the demand of the Machine Learning community and their demand for high compute power in low precision formats. Also the server-line products are increasingly featuring low-precision special function units, such as the NVIDIA tensor cores in ORNL's Summit supercomputer providing more than an order of magnitude higher performance than what is available in IEEE double precision. At the same time, the gap between the compute power on the one hand and the memory bandwidth on the other hand keeps increasing, making data access and communication prohibitively expensive compared to arithmetic operations. To start the multiprecision focus effort, we survey the numerical linear algebra community and summarize all existing multiprecision knowledge, expertise, and software capabilities in this landscape analysis report. We also include current efforts and preliminary results that may not yet be considered "mature technology," but have the potential to grow into production quality within the multiprecision focus effort. As we expect the reader to be familiar with the basics of numerical linear algebra, we refrain from providing a detailed background on the algorithms themselves but focus on how mixed- and multiprecision technology can help improving the performance of these methods and present highlights of application significantly outperforming the traditional fixed precision methods.
△ Less
Submitted 13 July, 2020;
originally announced July 2020.
-
A hybrid Hermitian general eigenvalue solver
Authors:
Raffaele Solcà,
Thomas C. Schulthess,
Azzam Haidar,
Stanimire Tomov,
Ichitaro Yamazaki,
Jack Dongarra
Abstract:
The adoption of hybrid GPU-CPU nodes in traditional supercomputing platforms opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium sized Hermitian generalized eigenvalue problems must be solved many times. The small size of the problems limits the scalability on a distributed memory system, hence they can benefit from t…
▽ More
The adoption of hybrid GPU-CPU nodes in traditional supercomputing platforms opens acceleration opportunities for electronic structure calculations in materials science and chemistry applications, where medium sized Hermitian generalized eigenvalue problems must be solved many times. The small size of the problems limits the scalability on a distributed memory system, hence they can benefit from the massive computational performance concentrated on a single node, hybrid GPU-CPU system. However, new algorithms that efficiently exploit heterogeneity and massive parallelism of not just GPUs, but of multi/many-core CPUs as well are required. Addressing these demands, we implemented a novel Hermitian general eigensolver algorithm. This algorithm is based on a standard eigenvalue solver, and existing algorithms can be used. The resulting eigensolvers are state-of-the-art in HPC, significantly outperforming existing libraries. We analyze their performance impact on applications of interest, when different fractions of eigenvectors are needed by the host electronic structure code.
△ Less
Submitted 7 July, 2012;
originally announced July 2012.