Search | arXiv e-print repository

Two-Stage Block Orthogonalization to Improve Performance of $s$-step GMRES

Authors: Ichitaro Yamazaki, Andrew J. Higgins, Erik G. Boman, Daniel B. Szyld

Abstract: On current computer architectures, GMRES' performance can be limited by its communication cost to generate orthonormal basis vectors of the Krylov subspace. To address this performance bottleneck, its $s$-step variant orthogonalizes a block of $s$ basis vectors at a time, potentially reducing the communication cost by a factor of $s$. Unfortunately, for a large step size $s$, the solver can genera… ▽ More On current computer architectures, GMRES' performance can be limited by its communication cost to generate orthonormal basis vectors of the Krylov subspace. To address this performance bottleneck, its $s$-step variant orthogonalizes a block of $s$ basis vectors at a time, potentially reducing the communication cost by a factor of $s$. Unfortunately, for a large step size $s$, the solver can generate extremely ill-conditioned basis vectors, and to maintain stability in practice, a conservatively small step size is used, which limits the performance of the $s$-step solver. To enhance the performance using a small step size, in this paper, we introduce a two-stage block orthogonalization scheme. Similar to the original scheme, the first stage of the proposed method operates on a block of $s$ basis vectors at a time, but its objective is to maintain the well-conditioning of the generated basis vectors with a lower cost. The orthogonalization of the basis vectors is delayed until the second stage when enough basis vectors are generated to obtain higher performance. Our analysis shows the stability of the proposed two-stage scheme. The performance is improved because while the same amount of computation as the original scheme is required, most of the communication is done at the second stage of the proposed scheme, reducing the overall communication requirements. Our performance results with up to 192 NVIDIA V100 GPUs on the Summit supercomputer demonstrate that when solving a 2D Laplace problem, the two-stage approach can reduce the orthogonalization time and the total time-to-solution by the respective factors of up to $2.6\times$ and $1.6\times$ over the original $s$-step GMRES, which had already obtained the respective speedups of $2.1\times$ and $1.8\times$ over the standard GMRES. Similar speedups were obtained for 3D problems and for matrices from the SuiteSparse Matrix Collection. △ Less

Submitted 22 February, 2024; originally announced February 2024.

Comments: Accepted for publication in IPDPS'24

arXiv:1301.2650 [pdf, ps, other]

doi 10.1016/j.apnum.2014.02.006

Krylov Subspace Recycling for Sequences of Shifted Linear Systems

Authors: Kirk M. Soodhalter, Daniel B. Szyld, Fei Xue

Abstract: We study the use of Krylov subspace recycling for the solution of a sequence of slowly-changing families of linear systems, where each family consists of shifted linear systems that differ in the coefficient matrix only by multiples of the identity. Our aim is to explore the simultaneous solution of each family of shifted systems within the framework of subspace recycling, using one augmented subs… ▽ More We study the use of Krylov subspace recycling for the solution of a sequence of slowly-changing families of linear systems, where each family consists of shifted linear systems that differ in the coefficient matrix only by multiples of the identity. Our aim is to explore the simultaneous solution of each family of shifted systems within the framework of subspace recycling, using one augmented subspace to extract candidate solutions for all the shifted systems. The ideal method would use the same augmented subspace for all systems and have fixed storage requirements, independent of the number of shifted systems per family. We show that a method satisfying both requirements cannot exist in this framework. As an alternative, we introduce two schemes. One constructs a separate deflation space for each shifted system but solves each family of shifted systems simultaneously. The other builds only one recycled subspace and constructs approximate corrections to the solutions of the shifted systems at each cycle of the iterative linear solver while only minimizing the base system residual. At convergence of the base system solution, we apply the method recursively to the remaining unconverged systems. We present numerical examples involving systems arising in lattice quantum chromodynamics. △ Less

Submitted 29 August, 2013; v1 submitted 12 January, 2013; originally announced January 2013.

Comments: 5 figures, 20 pages (main paper 18 pages + refs 2 pages)

MSC Class: 65F10 ACM Class: F.2.1; G.1.3

Journal ref: Applied Numerical Mathematics 81C (2014), pp. 105-118

arXiv:cs/0606047 [pdf, ps, other]

Asynchronous iterative computations with Web information retrieval structures: The PageRank case

Authors: Giorgos Kollias, Efstratios Gallopoulos, Daniel B. Szyld

Abstract: There are several ideas being used today for Web information retrieval, and specifically in Web search engines. The PageRank algorithm is one of those that introduce a content-neutral ranking function over Web pages. This ranking is applied to the set of pages returned by the Google search engine in response to posting a search query. PageRank is based in part on two simple common sense concepts… ▽ More There are several ideas being used today for Web information retrieval, and specifically in Web search engines. The PageRank algorithm is one of those that introduce a content-neutral ranking function over Web pages. This ranking is applied to the set of pages returned by the Google search engine in response to posting a search query. PageRank is based in part on two simple common sense concepts: (i)A page is important if many important pages include links to it. (ii)A page containing many links has reduced impact on the importance of the pages it links to. In this paper we focus on asynchronous iterative schemes to compute PageRank over large sets of Web pages. The elimination of the synchronizing phases is expected to be advantageous on heterogeneous platforms. The motivation for a possible move to such large scale distributed platforms lies in the size of matrices representing Web structure. In orders of magnitude: $10^{10}$ pages with $10^{11}$ nonzero elements and $10^{12}$ bytes just to store a small percentage of the Web (the already crawled); distributed memory machines are necessary for such computations. The present research is part of our general objective, to explore the potential of asynchronous computational models as an underlying framework for very large scale computations over the Grid. The area of ``internet algorithmics'' appears to offer many occasions for computations of unprecedent dimensionality that would be good candidates for this framework. △ Less

Submitted 11 June, 2006; originally announced June 2006.

Comments: 8 pages to appear at ParCo2005 Conference Proceedings

Report number: TR HPCLAB-SCG 5/08-05

Showing 1–3 of 3 results for author: Szyld, D B