Asynchronous iterations of HSS method for non-Hermitian linear systems

Guillaume Gbikpi-Benissan Université Paris-Saclay, CentraleSupélec, Gif-sur-Yvette, France ([email protected]). Qinmeng Zou Université Paris-Saclay, CentraleSupélec, Gif-sur-Yvette, France ([email protected]). Frédéric Magoulès Université Paris-Saclay, CentraleSupélec, Gif-sur-Yvette, France (correspondence, [email protected]).

Abstract

A general asynchronous alternating iterative model is designed, for which convergence is theoretically ensured both under classical spectral radius bound and, then, for a classical class of matrix splittings for $\mathsf{H}$ -matrices. The computational model can be thought of as a two-stage alternating iterative method, which well suits to the well-known Hermitian and skew-Hermitian splitting (HSS) approach, with the particularity here of considering only one inner iteration. Experimental parallel performance comparison is conducted between the generalized minimal residual (GMRES) algorithm, the standard HSS and our asynchronous variant, on both real and complex non-Hermitian linear systems respectively arising from convection-diffusion and structural dynamics problems. A significant gain on execution time is observed in both cases.

Keywords: Asynchronous iterations; alternating iterations; Hermitian and skew-Hermitian splitting; non-Hermitian problems; parallel computing

1 Introduction

Many applications in scientific computing and engineering lead to the following system of linear equations,

Ax=b,\quad A\in\mathbb{C}^{n\times n},\quad b\in\mathbb{C}^{n}.

(1)

Let $A=M-N$ and $A=F-G$ be two splittings of $A$ with $M$ and $F$ being nonsingular. The alternating iterative scheme for solving (1) is defined as follows,

\left\{\begin{array}[]{lcl}Mx^{k+\frac{1}{2}}&=&Nx^{k}+b,\\ Fx^{k+1}&=&Gx^{k+\frac{1}{2}}+b,\end{array}\right.

(2)

which can be viewed as a stationary iterative scheme with an iteration matrix $F^{-1}GM^{-1}N$ . Well-known early examples include the symmetric successive over-relaxation (SSOR) method [43, 17] and the alternating direction implicit (ADI) methods [40, 19, 38]. In [12] the convergence of some alternating iterations were analyzed by eliminating the intermediate solution term $x^{k+1/2}$ from (2); see also [1]. Recently, there has been growing interest in studies of the Hermitian and skew-Hermitian splitting (HSS) method [5] for solving (1) when $A$ is non-Hermitian. Let $\alpha>0$ be a given constant. The HSS method can be written in the form

\left\{\begin{array}[]{lcl}(\alpha I+H)x^{k+\frac{1}{2}}&=&(\alpha I-S)x^{k}+b% ,\\ (\alpha I+S)x^{k+1}&=&(\alpha I-H)x^{k+\frac{1}{2}}+b,\end{array}\right.

(3)

where $H=(A+A^{\mathsf{H}})/2$ and $S=(A-A^{\mathsf{H}})/2$ are the Hermitian and skew-Hermitian parts of $A$ , respectively, and $I$ is the identity matrix. Here, $A^{\mathsf{H}}$ denotes the conjugate transpose of $A$ . This method can be obtained from (2) by defining

\begin{array}[]{lcl}M&:=&\alpha I+H,\\ F&:=&\alpha I+S.\end{array}

(4)

It was proved in [5] that when $H$ is positive definite, namely, $A$ is non-Hermitian positive definite, HSS converges unconditionally to the unique solution $x^{*}$ for any initial guess $x^{0}$ . The linear subsystems, however, especially the one involving $\alpha I+S$ , may still be difficult to solve, therefore much attention has been devoted to the inexact implementation. More precisely, the tolerances for the inner iterative solvers may be relatively relaxed, while good convergence properties can still be retained according to numerical experiments; see [5, 11, 9, 6]. The HSS iterative scheme has been generalized to other splitting methods, as well as their preconditioned variants, for handling various problems in scientific computing; see, e.g., [13, 30, 9, 3, 44, 29, 2]. There is also a number of studies on the optimal selection of $\alpha$ ; see [5, 4, 28, 46]. The iterative scheme (3) can be equivalently written in a residual-updating form, which achieves a higher accuracy at the cost of more computational effort; see [6] for a detailed discussion.

Parallel computing could be extremely useful when $A$ has large dimension. In practice, the high cost of synchronization relative to that of computation is currently the major bottleneck in high-performance distributed computing systems, which motivates redesigning of parallel iterative algorithms. One of the most interesting approaches, arising from basic relaxation methods, is the so-called asynchronous iterations [16, 15]. Asynchronous iterative scheme gives a full overlap** of communication and computation. Every process has the flexibility to work at their own pace without waiting for the data acquisition. A major difference between synchronous and asynchronous iterations lies in their predictability properties. The former produces deterministic sequence of iterations, while the latter enables nondeterministic behaviors. In [16] the first convergence result was established for the solution of linear systems, which was followed by the investigation of general fixed-point iterative models; see [39, 7, 21, 14]. In recent years, with the advent of very high-performance computing environment, asynchronous iterative scheme has gained much popularity. The study of asynchronous domain decomposition methods, in both time and space domains, becomes an increasingly active area of research; see, e.g., [36, 35, 37, 32, 45, 20]. Another area that has seen growth in the last decades is the asynchronous convergence detection; see [33, 26] and the references therein.

In this paper we focus on the asynchronous formulation of alternating iterations. In Section 2, we recall some general tools and the asynchronous iterations theory used for the formulation and the convergence analysis of our asynchronous alternating scheme. Section 3 presents the main contribution where we formulate our asynchronous alternating scheme and sufficient conditions for its convergence. Section 5 is devoted to numerical experiments on a parallel computing platform, featuring both a real three dimensional convection-diffusion problem and a complex two dimensional structural dynamic problem. Finally, Section 6 gives our conclusions.

2 Generalities

2.1 $\mathsf{H}$ -matrix and $\mathsf{H}$ -splitting

In a general manner, let $\mathcal{A}_{i,j}$ denote the entry of a matrix $\mathcal{A}$ on its $i$ -th row and $j$ -th column, and let $x_{i}$ denote the $i$ -th entry of a vector $x$ . Comparisons $<$ , $\leq$ , $>$ , $\geq$ and $=$ between two matrices or vectors (of same shapes) are entrywise. The absolute value (or module) $|\mathcal{A}|$ of a matrix or a vector $\mathcal{A}$ is entrywise. The spectral radius of a matrix $\mathcal{A}$ is designated by $\rho(\mathcal{A})$ . In expressions like $\mathcal{A}<0$ and like $x<0$ with $\mathcal{A}$ and $x$ being a matrix and a vector, respectively, $0$ indicates a matrix and a vector, respectively, with all entries being $0$ . $I$ stands for the identity matrix.

We recall now few general tools later used for the convergence analysis of the proposed asynchronous iterative method.

Definition 1.

A square matrix $\mathcal{A}$ is an $\mathsf{M}$ -matrix if and only if

\exists\ \alpha\in\mathbb{R}:\quad\alpha I-\mathcal{A}\geq 0,\quad\alpha>\rho(% \alpha I-\mathcal{A}).

Definition 2.

The comparison matrix $\langle\mathcal{A}\rangle$ of a matrix $\mathcal{A}$ is defined as

\langle\mathcal{A}\rangle_{i,i}:=|\mathcal{A}_{i,i}|,\qquad\langle\mathcal{A}% \rangle_{i,j}:=-|\mathcal{A}_{i,j}|,\quad i\neq j.

Definition 3.

A square matrix $\mathcal{A}$ is an $\mathsf{H}$ -matrix if and only if its comparison matrix $\langle\mathcal{A}\rangle$ is an $\mathsf{M}$ -matrix.

Lemma 1.

A square matrix $\mathcal{A}$ is an $\mathsf{H}$ -matrix if and only if

\exists\ u>0:\quad\forall i,\ |\mathcal{A}_{i,i}|u_{i}>\sum_{j\neq i}|\mathcal% {A}_{i,j}|u_{j}.

Proof.

This is directly implied by Theorem 5’ in [22]. ∎

A splitting $\mathcal{A}=\mathcal{M}-\mathcal{N}$ of a matrix $\mathcal{A}$ consists of identifying a nonsingular matrix $\mathcal{M}$ and the resulting matrix $\mathcal{N}=\mathcal{M}-\mathcal{A}$ , so as to define a relaxation operator $\mathcal{M}^{-1}\mathcal{N}=I-\mathcal{M}^{-1}\mathcal{A}.$

Definition 4.

A splitting $\mathcal{A}=\mathcal{M}-\mathcal{N}$ is an $\mathsf{H}$ -splitting if and only if $\langle\mathcal{M}\rangle-|\mathcal{N}|$ is an $\mathsf{M}$ -matrix.

Lemma 2.

Let $\mathcal{A}=\mathcal{M}-\mathcal{N}$ be an $\mathsf{H}$ -splitting. Then, we have $\rho(|I-\mathcal{M}^{-1}\mathcal{A}|)<1.$

Proof.

This directly follows from Proof of Theorem 3.4 (c) in [23]. ∎

Lemma 3 (refer to, e.g., Corollary 6.1 in [15]).

Let $\mathcal{A}$ be a square matrix. Then, we have

\rho(\left|\mathcal{A}\right|)<1\quad\iff\quad\exists\ w>0:\ \left\|\mathcal{A% }\right\|_{\infty}^{w}<1,\qquad\quad\|\mathcal{A}\|_{\infty}^{w}:=\max_{i}% \frac{1}{w_{i}}\sum_{j}|\mathcal{A}_{i,j}|w_{j}.

2.2 Asynchronous iterations

Consider, again, the linear system (1), a splitting $A=M-N$ of the matrix $A$ and the resulting iterative scheme

x^{k+1}=\left(I-M^{-1}A\right)x^{k}+M^{-1}b=x^{k}+M^{-1}\left(b-Ax^{k}\right).

Assume a distribution

A=\begin{bmatrix}A^{(1)}\\ A^{(2)}\\ \vdots\\ A^{(m)}\end{bmatrix},\ \ b=\begin{bmatrix}b^{(1)}\\ b^{(2)}\\ \vdots\\ b^{(m)}\end{bmatrix},\ \ M=\begin{bmatrix}M^{(1)}&0&\cdots&0\\ 0&M^{(2)}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\cdots&0&M^{(m)}\end{bmatrix}

of both the system and the splitting of $A$ . Note that the problem (1) can also corresponds to an augmented system resulting from a domain decomposition with overlap** subdomains, i.e., some rows in a submatrix $A^{(s_{1})}$ are possibly replicated in another submatrix $A^{(s_{2})}$ , $s_{1},s_{2}\in\{1,\ldots,m\}$ . A classical parallel relaxation is then given by

	$\displaystyle x^{(s),k+1}$	$\displaystyle=x^{(s),k}+{M^{(s)}}^{-1}\left(b^{(s)}-A^{(s)}\begin{bmatrix}x^{(% 1),k}&\cdots&x^{(m),k}\end{bmatrix}^{\mathsf{T}}\right)\quad\forall s\in\{1,% \ldots,m\},$
		$\displaystyle=x^{(s),k}+{M^{(s)}}^{-1}\left(b^{(s)}-\sum_{q=1}^{m}A^{(s,q)}x^{% (q),k}\right)\quad\forall s\in\{1,\ldots,m\}$

with $A^{(s)}=\begin{bmatrix}A^{(s,1)}&\cdots&A^{(s,m)}\end{bmatrix}.$ The first feature of asynchronous iterations is the free steering (see, e.g., [42]), where, at each iteration $k$ , a random subset $\Omega_{k}\subset\{1,\ldots,m\}$ of block-components can be updated. It is convenient to state a natural assumption,

\operatorname{card}\left\{k\in\mathbb{N}:s\in\Omega_{k}\right\}=\infty\qquad% \forall s\in\{1,\ldots,m\},

which is implemented by the fact that no block-component stops being updated until convergence is globally reached. The second feature consists of modeling communication delays implying that at an iteration $k+1$ , a block-component $s_{1}\in\Omega_{k}$ is possibly updated using a block-component $s_{2}\in\{1,\ldots,m\}$ computed at a random previous iteration $\delta_{s_{1}}(s_{2},k)\leq k$ . It yields the parallel iterative scheme

x^{(s),k+1}=\left\{\begin{array}[]{ll}x^{(s),\delta_{s}(s,k)}+{M^{(s)}}^{-1}% \left(b^{(s)}-\displaystyle\sum_{q=1}^{m}A^{(s,q)}x^{(q),\delta_{s}(q,k)}% \right)&\forall s\in\Omega_{k},\\ x^{(s),k}&\forall s\notin\Omega_{k},\end{array}\right.

(5)

where, as well, another natural assumption is made, stating that

\lim_{k\to\infty}\delta_{s_{1}}(s_{2},k)=\infty\qquad\forall s_{1},s_{2}\in\{1% ,\ldots,m\}.

Theorem 5 (Chazan and Miranker (1969) [16]).

An asynchronous iterative method (5) converges from any initial guess $x^{0}$ , with any sequence $\{\Omega_{k}\}_{k\in\mathbb{N}}$ and any functions $\delta_{1}$ to $\delta_{m}$ if and only if $\rho(|I-M^{-1}A|)<1.$

The model (5) was later generalized by Baudet [7] to arbitrary fixed-point iterations

x^{(s),k+1}=\left\{\begin{array}[]{ll}f^{(s)}\left(x^{(1),\delta_{s,1}(1,k)},% \ldots,x^{(m),\delta_{s,1}(m,k)},\right.&\\ \left.\qquad\quad\ldots,x^{(1),\delta_{s,p}(1,k)},\ldots,x^{(m),\delta_{s,p}(m% ,k)}\right)&\forall s\in\Omega_{k},\\ x^{(s),k}&\forall s\notin\Omega_{k},\end{array}\right.

(6)

where the update of a block-component $s\in\Omega_{k}$ at an iteration $k$ depends on $p\in\mathbb{N}$ versions, $\delta_{s,1}(q,k)$ to $\delta_{s,p}(q,k)$ , of each block-component $q\in\{1,\ldots,m\}$ . Let us denote by $\max(x,y)$ the vector given by

(\max(x,y))_{i}:=\max\{x_{i},y_{i}\}

with $x$ and $y$ being two vectors of same size. Let $X:=(X_{1},\ldots,X_{p})$ and $Y:=(Y_{1},\ldots,Y_{p})$ denote collections of $p$ vectors, i.e.,

X_{t}=\begin{bmatrix}X_{t}^{(1)}&\cdots&X_{t}^{(m)}\end{bmatrix}^{\mathsf{T}},% \quad Y_{t}=\begin{bmatrix}Y_{t}^{(1)}&\cdots&Y_{t}^{(m)}\end{bmatrix}^{% \mathsf{T}},\qquad t\in\{1,\ldots,p\}.

Theorem 6 (Baudet (1978) [7]).

An asynchronous iterative method (6) converges from any initial guess $x^{0}$ , with any sequence $\{\Omega_{k}\}_{k\in\mathbb{N}}$ and any functions $\delta_{1,1}$ to $\delta_{m,p}$ if there exists a square matrix $\mathcal{P}$ such that $\mathcal{P}\geq 0$ , $\rho(\mathcal{P})<1$ and

\forall X,Y,\quad\left|f(X)-f(Y)\right|\leq\mathcal{P}\max\left(\left|X_{1}-Y_% {1}\right|,\ldots,\left|X_{p}-Y_{p}\right|\right).

3 Asynchronous alternating iterations

3.1 Computational scheme

Consider, now, the alternating scheme (2) which results in

	$\displaystyle x^{k+1}$	$\displaystyle=\left(I-F^{-1}A\right)x^{k+\frac{1}{2}}+F^{-1}b$
		$\displaystyle=\left(I-F^{-1}A\right)\left(I-M^{-1}A\right)x^{k}+\left(I-F^{-1}% A\right)M^{-1}b+F^{-1}b$
		$\displaystyle=\left(I-F^{-1}\left(M+F-A\right)M^{-1}A\right)x^{k}+F^{-1}\left(% M+F-A\right)M^{-1}b.$

Then, according to Theorem 5, such an induced parallel scheme is asynchronously convergent if $\rho\left(\left|I-F^{-1}\left(M+F-A\right)M^{-1}A\right|\right)<1,$ which is shown, in the next section, to be achieved under usual convergence conditions on the splittings $A=M-N$ and $A=F-G$ . Nevertheless, asynchronous relaxation based on such an operator cannot be implemented using the alternating form (2), since the said operator is induced by strictly synchronizing $x^{k+\frac{1}{2}}$ and $x^{k+1}$ .

Consider, then, an equivalent formulation of the alternating scheme (2),

\left\{\begin{array}[]{lcl}y^{k}&:=&x^{k}+M^{-1}\left(b-Ax^{k}\right),\\ x^{k+1}&=&y^{k}+F^{-1}\left(b-Ay^{k}\right),\end{array}\right.

and assume that $F$ is distributed as $M$ , i.e.,

F=\begin{bmatrix}F^{(1)}&0&\cdots&0\\ 0&F^{(2)}&\ddots&\vdots\\ \vdots&\ddots&\ddots&0\\ 0&\cdots&0&F^{(m)}\end{bmatrix}.

Parallel asynchronous alternating methods are thus given by the computational scheme

\left\{\begin{array}[]{lcl}y^{(s),k}&:=&x^{(s),\delta_{s}(s,k)}\\ &&\quad+\ {M^{(s)}}^{-1}\left(b^{(s)}-\displaystyle\sum_{q=1}^{m}A^{(s,q)}x^{(% q),\delta_{s}(q,k)}\right)\quad\forall s\in\{1,\ldots,m\},\\ x^{(s),k+1}&=&\left\{\begin{array}[]{ll}y^{(s),\delta_{s}(s,k)}&\\ \quad+\ {F^{(s)}}^{-1}\left(b^{(s)}-\displaystyle\sum_{q=1}^{m}A^{(s,q)}y^{(q)% ,\delta_{s}(q,k)}\right)&\forall s\in\Omega_{k},\\ x^{(s),k}&\forall s\notin\Omega_{k}.\end{array}\right.\end{array}\right.

(7)

Assuming that the identity matrix $I$ is distributed as $A$ , i.e.,

I=\begin{bmatrix}I^{(1,1)}&\cdots&I^{(1,m)}\\ \vdots&\ddots&\vdots\\ I^{(m,1)}&\cdots&I^{(m,m)}\end{bmatrix},

it yields

	$\displaystyle x^{(s),k+1}$	$\displaystyle=\sum_{q=1}^{m}\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)}\right)y^{(% q),\delta_{s}(q,k)}+{F^{(s)}}^{-1}b^{(s)}$
		$\displaystyle=\sum_{q=1}^{m}\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)}\right)% \left(\sum_{r=1}^{m}\left(I^{(q,r)}-{M^{(q)}}^{-1}A^{(q,r)}\right)x^{(r),% \delta_{q}(r,\delta_{s}(q,k))}\right.$
		$\displaystyle\left.\qquad\qquad\qquad\qquad\qquad\qquad\qquad+\ {M^{(q)}}^{-1}% b^{(q)}\right)+{F^{(s)}}^{-1}b^{(s)},$

which actually lies in the framework of the generalized model (6) with, here, $p=m$ , since each update of a block-component depends on $m$ versions of the other block-components. Considering, then, a collection $X=\left(X_{1},\ldots,X_{m}\right)$ of $m$ vectors, the corresponding map** $f$ is given by

	$\displaystyle f^{(s)}(X)$	$\displaystyle:=\sum_{q=1}^{m}\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)}\right)% \left(\sum_{r=1}^{m}\left(I^{(q,r)}-{M^{(q)}}^{-1}A^{(q,r)}\right)X_{q}^{(r)}\right.$
		$\displaystyle\left.\qquad\qquad\qquad\qquad\qquad\qquad\qquad+\ {M^{(q)}}^{-1}% b^{(q)}\right)+{F^{(s)}}^{-1}b^{(s)}$
		$\displaystyle=\sum_{q=1}^{m}P_{q}^{(s)}X_{q}+\left(I^{(s)}-{F^{(s)}}^{-1}A^{(s% )}\right)M^{-1}b+{F^{(s)}}^{-1}b^{(s)},$
	$\displaystyle f(X)$	$\displaystyle:=\sum_{q=1}^{m}P_{q}X_{q}+\left(I-F^{-1}A\right)M^{-1}b+F^{-1}b$

with $P_{q}^{(s)}:=\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)}\right)\left(I^{(q)}-{M^{(% q)}}^{-1}A^{(q)}\right),\ q,s\in\{1,\ldots,m\},$ and $P_{q}:=\begin{bmatrix}P_{q}^{(1)}&\cdots&P_{q}^{(m)}\end{bmatrix}^{\mathsf{T}}% ,\ q\in\{1,\ldots,m\}.$

3.2 Convergence conditions

We analyze, now, sufficient conditions for the convergence of our asynchronous alternating iterative scheme (7). To the best of our knowledge, Lemma 4, Proposition 1 and Corollary 1 are new. Proposition 1 and Corollary 1 highlight how combining properties of the operators $I-F^{-1}A$ and $I-M^{-1}A$ imply a resulting contracting operator $\left(I-F^{-1}A\right)\left(I-M^{-1}A\right)$ . Our main results consist of Theorem 7 and Corollary 2 where the same combined conditions are shown to be sufficient for the convergence of asynchronous alternating methods (7), despite the induced, slightly different, iterations operator.

Let, first, $\mathcal{A}$ be a matrix with arbitrary shape, let $w$ be a vector with as many entries as the number of columns in $\mathcal{A}$ , and let $v$ be a vector with as many entries as the number of rows in $\mathcal{A}$ , and with no $0$ entry. Let $\tau(\mathcal{A},w,v)$ denote the vector given by the row-sums

\tau_{i}(\mathcal{A},w,v):=\left(\tau(\mathcal{A},w,v)\right)_{i}:=\frac{1}{v_% {i}}\sum_{j}\left|\mathcal{A}_{i,j}\right|w_{j}\qquad\forall i.

Note, then, that, for a square matrix $\mathcal{A}$ ,

\|\mathcal{A}\|_{\infty}^{w}=\max_{i}\tau_{i}(\mathcal{A},w,w),\qquad w>0.

Lemma 4.

Let $\mathcal{A}$ and $\mathcal{B}$ be matrices with shapes such that $\mathcal{A}\mathcal{B}$ is calculable. Let $u>0$ , $v>0$ and $w$ be vectors with dimensions such that $\tau(\mathcal{A},u,v)$ and $\tau(\mathcal{B},w,u)$ are calculable. Then, we have

\tau(\mathcal{B},w,u)<\begin{bmatrix}1&1&\cdots&1\end{bmatrix}^{\mathsf{T}}% \quad\implies\quad\tau(\mathcal{A}\mathcal{B},w,v)<\tau(\mathcal{A},u,v).

Proof.

Let us index rows and columns of $\mathcal{A}$ by $i$ and $j$ , respectively, and columns of $\mathcal{B}$ by $l$ . We have

	$\displaystyle\tau_{i}(\mathcal{A}\mathcal{B},w,v):=\frac{1}{v_{i}}\sum_{l}% \left\|(\mathcal{A}\mathcal{B})_{i,l}\right\|w_{l}$	$\displaystyle=\frac{1}{v_{i}}\sum_{l}\left\|\sum_{j}\mathcal{A}_{i,j}\mathcal{B% }_{j,l}\right\|w_{l}$
		$\displaystyle\leq\frac{1}{v_{i}}\sum_{l}\sum_{j}\left\|\mathcal{A}_{i,j}% \mathcal{B}_{j,l}\right\|w_{l}$
		$\displaystyle=\frac{1}{v_{i}}\sum_{l}\sum_{j}\frac{1}{u_{j}}\left\|\mathcal{A}_% {i,j}\right\|\left\|\mathcal{B}_{j,l}\right\|u_{j}w_{l}$
		$\displaystyle=\frac{1}{v_{i}}\sum_{j}\left(\frac{1}{u_{j}}\sum_{l}\left\|% \mathcal{B}_{j,l}\right\|w_{l}\right)\left\|\mathcal{A}_{i,j}\right\|u_{j}$
		$\displaystyle=\frac{1}{v_{i}}\sum_{j}\tau_{j}(\mathcal{B},w,u)\left\|\mathcal{A% }_{i,j}\right\|u_{j}.$

It yields that if $\tau_{j}(\mathcal{B},w,u)<1$ for all $j$ , then

$\displaystyle\tau_{j}(\mathcal{B},w,u)\left\|\mathcal{A}_{i,j}\right\|u_{j}$	$\displaystyle<\left\|\mathcal{A}_{i,j}\right\|u_{j}\quad\forall j\ \forall i,$
$\displaystyle\frac{1}{v_{i}}\sum_{j}\tau_{j}(\mathcal{B},w,u)\left\|\mathcal{A}% _{i,j}\right\|u_{j}$	$\displaystyle<\frac{1}{v_{i}}\sum_{j}\left\|\mathcal{A}_{i,j}\right\|u_{j}\quad% \forall i,$
$\displaystyle\frac{1}{v_{i}}\sum_{l}\left\|(\mathcal{A}\mathcal{B})_{i,l}\right% \|w_{l}$	$\displaystyle\ \leq$	$\displaystyle\frac{1}{v_{i}}\sum_{j}\tau_{j}(\mathcal{B},w,u)\left\|\mathcal{A}% _{i,j}\right\|u_{j}$	$\displaystyle<\frac{1}{v_{i}}\sum_{j}\left\|\mathcal{A}_{i,j}\right\|u_{j}\quad% \forall i,$
$\displaystyle\tau_{i}(\mathcal{A}\mathcal{B},w,v)$			$\displaystyle<\tau_{i}(\mathcal{A},u,v)\quad\forall i,$

which concludes the proof. ∎

Proposition 1.

Let

Q:=\begin{bmatrix}0&I-M^{-1}A\\ I-F^{-1}A&0\end{bmatrix}.

We have

\rho(|Q|)<1\quad\implies\quad\rho\left(\left|I-F^{-1}\left(M+F-A\right)M^{-1}A% \right|\right)<1.

Proof.

According to Lemma 3,

\rho(|Q|)<1\quad\iff\quad\exists\ W>0:\ \|Q\|_{\infty}^{W}<1.

According to the two blocks of $Q$ , take $W=\begin{bmatrix}W_{1}&W_{2}\end{bmatrix}^{\mathsf{T}}.$ Then, we have both

\left\{\begin{array}[]{lcl}\tau\left(I-M^{-1}A,W_{2},W_{1}\right)&<&\begin{% bmatrix}1&1&\cdots&1\end{bmatrix}^{\mathsf{T}},\\ \tau\left(I-F^{-1}A,W_{1},W_{2}\right)&<&\begin{bmatrix}1&1&\cdots&1\end{% bmatrix}^{\mathsf{T}}.\end{array}\right.

Lemma 4 therefore ensures

	$\displaystyle\tau\left(\left(I-F^{-1}A\right)\left(I-M^{-1}A\right),W_{2},W_{2% }\right)$	$\displaystyle<\tau\left(I-F^{-1}A,W_{1},W_{2}\right)$
		$\displaystyle<\begin{bmatrix}1&1&\cdots&1\end{bmatrix}^{\mathsf{T}},$

which leads to $\left\|\left(I-F^{-1}A\right)\left(I-M^{-1}A\right)\right\|_{\infty}^{W_{2}}<1.$ Recall that

\left(I-F^{-1}A\right)\left(I-M^{-1}A\right)=I-F^{-1}\left(M+F-A\right)M^{-1}A.

Lemma 3 finally ensures $\rho\left(\left|I-F^{-1}\left(M+F-A\right)M^{-1}A\right|\right)<1,$ which concludes the proof. ∎

Corollary 1.

if $A$ is an $\mathsf{H}$ -matrix, then

\left\{\begin{array}[]{lcl}\langle M\rangle-|M-A|&=&\langle A\rangle,\\ \langle F\rangle-|F-A|&=&\langle A\rangle\end{array}\right.\quad\implies\quad% \rho\left(\left|I-F^{-1}\left(M+F-A\right)M^{-1}A\right|\right)<1.

Proof.

Considering that $A$ is an $\mathsf{H}$ -matrix, take $u>0$ like in Lemma 1, so as to have

|A_{i,i}|u_{i}>\sum_{j\neq i}|A_{i,j}|u_{j}\quad\forall i.

We also have

\langle M\rangle-|M-A|=\langle A\rangle\quad\implies\quad\forall i,\ \left\{% \begin{array}[]{lcl}|M_{i,i}|-|M_{i,i}-A_{i,i}|&=&|A_{i,i}|,\\ -|M_{i,j}|-|M_{i,j}-A_{i,j}|&=&-|A_{i,j}|\quad\forall j\neq i,\end{array}\right.

and, then,

\left\{\begin{array}[]{lcl}|M_{i,i}|u_{i}-|M_{i,i}-A_{i,i}|u_{i}&=&|A_{i,i}|u_% {i},\\ -|M_{i,j}|u_{j}-|M_{i,j}-A_{i,j}|u_{j}&=&-|A_{i,j}|u_{j}\quad\forall j\neq i.% \end{array}\right.

It yields that, $\forall i$ ,

	$\displaystyle\|M_{i,i}\|u_{i}-\sum_{j\neq i}\|M_{i,j}\|u_{j}-\|M_{i,i}-A_{i,i}\|u_{i% }-\sum_{j\neq i}\|M_{i,j}-A_{i,j}\|u_{j}$	$\displaystyle=\|A_{i,i}\|u_{i}-\sum_{j\neq i}\|A_{i,j}\|u_{j}$
		$\displaystyle>0,$

which implies, with $F$ also satisfying $\langle F\rangle-|F-A|=\langle A\rangle$ , that the matrix

\widehat{A}:=\begin{bmatrix}M&A-M\\ A-F&F\end{bmatrix}

is an $\mathsf{H}$ -matrix, according to Lemma 1. Define, then,

\widehat{M}:=\begin{bmatrix}M&0\\ 0&F\end{bmatrix},

and note that $\left\langle\widehat{M}\right\rangle-\left|\widehat{M}-\widehat{A}\right|=% \left\langle\widehat{A}\right\rangle$ , which implies, by Definition 3, that $\left\langle\widehat{M}\right\rangle-\left|\widehat{M}-\widehat{A}\right|$ is an $\mathsf{M}$ -matrix, hence, by Definition 4, $\widehat{A}=\widehat{M}-\left(\widehat{M}-\widehat{A}\right)$ is an $\mathsf{H}$ -splitting. Lemma 2 therefore ensures that $\rho\left(\left|\widehat{M}^{-1}\left(\widehat{M}-\widehat{A}\right)\right|% \right)<1,$ and one can verify that

\widehat{M}^{-1}\left(\widehat{M}-\widehat{A}\right)=\begin{bmatrix}0&I-M^{-1}% A\\ I-F^{-1}A&0\end{bmatrix}.

Proposition 1 therefore finally applies, which concludes the proof. ∎

Theorem 7.

Let

Q:=\begin{bmatrix}0&I-M^{-1}A\\ I-F^{-1}A&0\end{bmatrix}.

An asynchronous alternating method (7) converges from any initial guess $x^{0}$ , with any sequence $\{\Omega_{k}\}_{k\in\mathbb{N}}$ and any functions $\delta_{1}$ to $\delta_{m}$ if $\rho(|Q|)<1$ .

Proof.

Consider two collections, $X=\left(X_{1},\ldots,X_{m}\right)$ and $Y=\left(Y_{1},\ldots,Y_{m}\right)$ , of $m$ vectors. We have

	$\displaystyle\|f(X)-f(Y)\|$	$\displaystyle=\left\|\sum_{q=1}^{m}P_{q}\left(X_{q}-Y_{q}\right)\right\|$
		$\displaystyle\leq\sum_{q=1}^{m}\left\|P_{q}\right\|\max\left(\left\|X_{1}-Y_{1}% \right\|,\ldots,\left\|X_{m}-Y_{m}\right\|\right).$

Consequently, according to Theorem 6, an asynchronous alternating method (7) is convergent if $\rho\left(\sum_{q=1}^{m}\left|P_{q}\right|\right)<1.$ Recall, then, that according to Lemma 3,

\rho(|Q|)<1\quad\iff\quad\exists\ W>0:\ \|Q\|_{\infty}^{W}<1.

According to the two blocks of $Q$ , take $W=\begin{bmatrix}W_{1}&W_{2}\end{bmatrix}^{\mathsf{T}}.$ Then, we have both

\left\{\begin{array}[]{lcl}\tau\left(I-M^{-1}A,W_{2},W_{1}\right)&<&\begin{% bmatrix}1&1&\cdots&1\end{bmatrix}^{\mathsf{T}},\\ \tau\left(I-F^{-1}A,W_{1},W_{2}\right)&<&\begin{bmatrix}1&1&\cdots&1\end{% bmatrix}^{\mathsf{T}},\end{array}\right.

implying, as well,

\tau\left(I^{(q)}-{M^{(q)}}^{-1}A^{(q)},W_{2},W_{1}^{(q)}\right)<\begin{% bmatrix}1&1&\cdots&1\end{bmatrix}^{\mathsf{T}}\quad\forall q\in\{1,\ldots,m\}.

Lemma 4 therefore ensures, with $s\in\{1,\ldots,m\}$ ,

	$\displaystyle\tau\left(\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)}\right)\left(I^{% (q)}-{M^{(q)}}^{-1}A^{(q)}\right),W_{2},W_{2}^{(s)}\right)$	$\displaystyle<\tau\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)},\right.$
		$\displaystyle\left.\qquad\quad W_{1}^{(q)},W_{2}^{(s)}\right).$

Recall that $P_{q}^{(s)}:=\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)}\right)\left(I^{(q)}-{M^{(% q)}}^{-1}A^{(q)}\right),\ q,s\in\{1,\ldots,m\}.$ Then, we have

	$\displaystyle\tau\left(P^{(s)}_{q},W_{2},W_{2}^{(s)}\right)$	$\displaystyle<\tau\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)},W_{1}^{(q)},W_{2}^{(% s)}\right),$
	$\displaystyle\tau\left(\left\|P^{(s)}_{q}\right\|,W_{2},W_{2}^{(s)}\right)$	$\displaystyle<\tau\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)},W_{1}^{(q)},W_{2}^{(% s)}\right),$
	$\displaystyle\sum_{q=1}^{m}\tau\left(\left\|P^{(s)}_{q}\right\|,W_{2},W_{2}^{(s)% }\right)$	$\displaystyle<\sum_{q=1}^{m}\tau\left(I^{(s,q)}-{F^{(s)}}^{-1}A^{(s,q)},W_{1}^% {(q)},W_{2}^{(s)}\right),$
	$\displaystyle\tau\left(\sum_{q=1}^{m}\left\|P^{(s)}_{q}\right\|,W_{2},W_{2}^{(s)% }\right)$	$\displaystyle<\tau\left(I^{(s)}-{F^{(s)}}^{-1}A^{(s)},W_{1},W_{2}^{(s)}\right),$
	$\displaystyle\tau\left(\sum_{q=1}^{m}\left\|P_{q}\right\|,W_{2},W_{2}\right)$	$\displaystyle<\tau\left(I-F^{-1}A,W_{1},W_{2}\right),$
		$\displaystyle<\begin{bmatrix}1&1&\cdots&1\end{bmatrix}^{\mathsf{T}},$

which leads to $\left\|\sum_{q=1}^{m}\left|P_{q}\right|\right\|_{\infty}^{W_{2}}<1.$ By Lemma 3, we therefore satisfy $\rho\left(\sum_{q=1}^{m}\left|P_{q}\right|\right)<1,$ which concludes the proof. ∎

Corollary 2.

An asynchronous alternating method (7) converges from any initial guess $x^{0}$ , with any sequence $\{\Omega_{k}\}_{k\in\mathbb{N}}$ and any functions $\delta_{1}$ to $\delta_{m}$ if $A$ is an $\mathsf{H}$ -matrix and

\left\{\begin{array}[]{lcl}\langle M\rangle-|M-A|&=&\langle A\rangle,\\ \langle F\rangle-|F-A|&=&\langle A\rangle.\end{array}\right.

Proof.

This follows in the same way as Corollary 1. ∎

Let $\mathcal{D}(\mathcal{A})$ denote the diagonal matrix obtained from the diagonal of a matrix $\mathcal{A}$ .

Remark.

For practical applications of Corollary 2, let $\Lambda$ be a diagonal real matrix such that $\Lambda_{i,i}\geq 1\ \forall i.$ We straightforwardly have

\mathcal{M}=\Lambda\mathcal{D}(\mathcal{A})\quad\implies\quad\langle\mathcal{M% }\rangle-|\mathcal{M}-\mathcal{A}|=\langle\mathcal{A}\rangle.

Remark.

In regard to the HSS splitting, if $A$ is a real matrix with $\mathcal{D}(A)\geq 0$ , and splitting matrices $M$ and $F$ are given by

M:=\mathcal{D}(\alpha I+H),\qquad F:=\mathcal{D}(\alpha I+S),\qquad\alpha\geq% \max_{i}A_{i,i},

then we have both

M=\alpha I+\mathcal{D}(A)\geq\mathcal{D}(A),\qquad F=\alpha I\geq\mathcal{D}(A),

which satisfy $M=\Lambda_{M}\mathcal{D}(A),\ F=\Lambda_{F}\mathcal{D}(A),$ where $\Lambda_{M}$ and $\Lambda_{F}$ are two diagonal real matrices with entries greater than or equal to $1$ .

4 Implementation aspects

The two alternating iterations of the HSS method require the solution of two secondary problems involving the coefficient matrices $\alpha I+H$ and $\alpha I+S$ , respectively. In practice, as pointed out in, e.g., [5, 44], these problems are inexactly solved by means of iterative algorithms. A general description for both HSS and inexact HSS (IHSS) can be given by Algorithm 1.

Algorithm 1 HSS(solverH, solverS)

x

x^{0}

r

b-Ax

k

0

4: while

\|r\|>\varepsilon\|b\|

and

k<k_{\text{max}}

y

:= solverH.solve(

\alpha I+H

r

)

x

x+y

r

b-Ax

y

:= solverS.solve(

\alpha I+S

r

)

x

x+y

10:

r

b-Ax

11:

k

k+1

12: end while

We can then designate by, e.g, HSS(CG, GMRES) an IHSS algorithm with the conjugate gradient (CG) method [27] for solving the shifted Hermitian problem and the generalized minimal residual (GMRES) method [41] for solving the shifted skew-Hermitian one.

Asynchronous HSS iterations necessarily belong to the class of IHSS algorithms since they obviously require the inner solvers to be asynchronous too, which further reduces such an approach to the subclass of IHSS with inner splittings. Taking, then, e.g., a splitting $\alpha I+H=M-N,$ the solution, at each outer iteration $k$ , of

(\alpha I+H)y^{k}=b-Ax^{k}

can be given by several inner iterations

y^{k,l+1}=y^{k,l}+M^{-1}(b-Ax^{k}-(\alpha I+H)y^{k,l}),

(8)

where $l$ is the inner iteration variable. Furthermore, when dealing with two-stage asynchronous iterations, one should particularly take advantage of the possibility to use the inner solution vector $y^{k,l+1}$ with any value of $l$ , given that asynchronous relaxation is very likely to benefit from each newly updated data. We refer the reader to, e.g., [8, 25] for more insights into the so called “asynchronous iterations with flexible communication”. Moreover, analysis of matrix splittings for two-stage asynchronous iterations reveals that convergence of such methods can be guaranteed for any number of inner iterations (see, e.g., [24]). According, therefore, to efficiency aspects related to flexible communication ideas, it is of some interest, in the end, to simply consider only one iteration of (8). If, in particular, we also consider as initial guess $y^{k,0}:=0$ , then we can define

y^{k}:=y^{k,1}=M^{-1}(b-Ax^{k}),

so as to finally have

x^{k+\frac{1}{2}}=x^{k}+M^{-1}(b-Ax^{k}),

which falls under the general alternating scheme (2) that has been considered in our theoretical analysis. Such a specialization of Algorithm 1 is given by Algorithm 2, where $M^{-1}$ and $F^{-1}$ are preconditioners of $\alpha I+H$ and $\alpha I+S$ , respectively.

Algorithm 2 HSS(

M^{-1}

F^{-1}

)

x

x^{0}

r

b-Ax

k

0

4: while

\|r\|>\varepsilon\|b\|

and

k<k_{\text{max}}

x

x+M^{-1}r

r

b-Ax

x

x+F^{-1}r

r

b-Ax

k

k+1

10: end while

Note that Algorithm 2 needs to be specifically implemented instead of just using Algorithm 1 with calls of relaxation-based inner solvers with maximum number of iterations set to $1$ . Indeed, on pure computer science aspects, avoiding inner function calls and loops can result in a very significant execution time saving, which even makes HSS( $M^{-1}$ , $F^{-1}$ ) possibly competitive, in practice, with, e.g., HSS(CG, GMRES), as we shall see in Section 5.

From Algorithm 2, iterative scheme (7), programming models [31, 34] and convergence detection approach [26], asynchronous parallel implementation of HSS iterations is obtained as described by Algorithm 3, where the communication routines start with “Com” and are blocking by default. Their non-blocking counterparts are designated by “ICom” with the letter “I” standing for “immediate”, similarly to the Message Passing Interface (MPI) standard.

Algorithm 3 Asynchronous parallel HSS(

{M^{(s)}}^{-1}

{F^{(s)}}^{-1}

) on process

s\in\{1,\ldots,m\}

x^{(s)}

x^{(s),0}

x

:= IComSendRecvInit(

x^{(s)}

)

r^{(s)}

b^{(s)}-A^{(s)}x

{rr}^{(s)}

{r^{(s)}}^{\mathsf{H}}r^{(s)}

rr

:= ComSum(

{rr}^{(s)}

)

\|r\|

\sqrt{rr}

\tau

:= False

k

0

9: while

\|r\|>\varepsilon\|b\|

and

k<k_{\text{max}}

10:

x^{(s)}

x^{(s)}+{M^{(s)}}^{-1}r^{(s)}

11:

x

:= IComSendRecv(

x^{(s)}

)

12:

r^{(s)}

b^{(s)}-A^{(s)}x

13:

x^{(s)}

x^{(s)}+{F^{(s)}}^{-1}r^{(s)}

14:

x

:= IComSendRecv(

x^{(s)}

)

15:

r^{(s)}

b^{(s)}-A^{(s)}x

16: if not

\tau

then

17:

{rr}^{(s)}

{r^{(s)}}^{\mathsf{H}}r^{(s)}

18: ComRequest := IComSum(

{rr}^{(s)}

rr

)

19:

\tau

:= True

20: end if

21:

\sigma

:= ComTest(ComRequest)

22: if

\sigma

then

23:

\|r\|

\sqrt{rr}

24:

\tau

:= False

25:

k

k+1

26: end if

27: end while

The routines ComSum and IComSum are used to compute dot product $r^{\mathsf{H}}r$ with $r=b-Ax$ by global reduction operation

\sum_{q=1}^{m}{r^{(q)}}^{\mathsf{H}}r^{(q)},\qquad r^{(q)}=b^{(q)}-A^{(q)}x.

They can readily be replaced by MPI routines MPI_Allreduce and MPI_Iallreduce, respectively. The object ComRequest and the routine ComTest are therefore analogous to MPI_Request and MPI_Test. Such a simple way to reliably use the classical loop stop** criterion $\|r\|>\varepsilon\|b\|$ in case of asynchronous iterations is due to [26]. It also allows for considering a counter, $k$ , of the number of global convergence tests. On the other hand, the data exchange routine IComSendRecv has to be a bit constructed using, e.g., MPI routines MPI_Isend and MPI_Irecv. Briefly, the routine IComSendRecvInit triggers non-blocking requests for message sending ( $x^{(s)}$ ) and reception ( $x^{(q)}$ , $q\neq s$ ), and fills up the components $x^{(q)}$ , $q\neq s$ , of the vector $x$ with any arbitrary values. Note that both storage and communication of components $x^{(q)}$ , $q\neq s$ , should actually be limited to values which are necessary for computing the product $A^{(s)}x$ , according to the nonzero entries in $A^{(s)}$ . The subsequent calls to the routine IComSendRecv then check completion of previous requests, update $x$ with received data and trigger new instances of the completed requests. Further details can be found in, e.g., [34].

5 Numerical experiments

5.1 Problems and overall settings

Numerical experiments have been conducted on two kinds of problem. The first one consists of a three-dimensional (3D) convection-diffusion equation,

-\Delta u+c\cdot\nabla u=f\mbox{ in $\Omega$}

(9)

with $\Omega=[0,1]\times[0,1]\times[0,1]$ and Dirichlet boundary conditions. Discretization has been achieved using seven-point centered differences for both convection and diffusion terms. A fixed value, $20$ , has been used for all elements in the three-dimensional vector $c$ as convection parameter. The entries of the exact discrete solution, $x^{*}$ , have been taken randomly in $[0,1)$ and the right-hand side has then been constructed as $b=Ax^{*}$ .

The second kind of problem consists of a 2D structural dynamics equation (see, e.g., [10, 3]),

\left[\left(-\omega^{2}L+K\right)+\operatorname{i}\left(\omega C_{v}+C_{h}% \right)\right]x=b,

(10)

where $L$ and $K$ denote the mass and stiffness matrices, respectively; $C_{v}$ and $C_{h}$ denote the viscous and hysteretic dam** matrices, respectively; $\omega$ denotes the circular frequency. The values of the matrices and the parameters have been taken from [3]. The matrix $K$ is the five-point finite difference discretization of a diffusion term on the unit square $[0,1]\times[0,1]$ with Dirichlet boundary conditions. The other matrices have been set as $L=I$ , $C_{v}=10I$ , $C_{h}=\mu K$ , where $\mu=0.02$ , and $I$ denotes the $n\times n$ identity matrix. The circular frequency $\omega$ has been set to $\pi$ . The right-hand side has been taken as $b=(1+\operatorname{i})Aq$ with $q$ being a vector of $1$ , to ensure that all entries of $x^{*}$ equal $1+\operatorname{i}$ .

In the following, parallel execution times (wall-clock), numbers of iterations, $k$ , and final residual errors, $r$ , are reported for the GMRES [41], the IHSS [5] (Algorithms 1 and 2) and the asynchronous IHSS methods (Algorithm 3), with a stop** criterion set so as to have

r=\frac{\|b-Ax^{*}\|}{\|b\|}<10^{-6}.

In case of asynchronous execution, minimum and maximum numbers of local iterations, $k_{\text{min}}$ and $k_{\text{max}}$ , respectively, are considered since there is not global iterations $k$ . Both for synchronous and asynchronous HSS( ${M}^{-1}$ , ${F}^{-1}$ ) (respectively, Algorithms 2 and 3), we took

M:=\mathcal{D}(\alpha I+H),\qquad F:=\mathcal{D}(\alpha I+S).

All of the tests have been entirely implemented in the Python language, using NumPy, SciPy Sparse and MPI4Py [18] modules.

A comparison with some results in [3] about the problem (10) (Example 4.2 in [3]) is reported in Table 1 for single-process execution of full GMRES, GMRES(restart), and HSS(CG, GMRES(restart)) with inner residual threshold set to $10^{-10}$ in order to compare with an “exact” HSS.

Table 1: Comparison with Ref. [3] for the test case (10), number of processes

p=1

Experiment

Results

Ref. [3]

MATLAB

2.66 GHz CPU

1.97 GB RAM

$n$	$64\times 64$		$128\times 128$
Method	Clock (sec)	$k$	Clock (sec)	$k$
HSS	4.81	284	60	540
GMRES(10)	1.08	973	20	3096
GMRES(20)	1.50	632	22	1704
GMRES	2.98	161	45	308

Python

2.40 GHz CPU

174 GB RAM

$n$	$64\times 64$		$128\times 128$
Method	Clock (sec)	$k$	Clock (sec)	$k$
HSS(CG,GMRES(10))	4.80	284	44	540
GMRES(10)	0.36	1072	3.56	3346
GMRES(20)	0.33	672	2.70	1790
GMRES	0.44	161	5.19	308

The experimentally optimal value of $\alpha$ , according to [3], was considered for each problem size $n$ ( $\alpha=0.12$ for $n=64^{2}$ , and $\alpha=0.07$ for $n=128^{2}$ ). We recall that the experiments in [3] were run in MATLAB on a personal computer consisting of a 2.66 GHz Intel Core Duo central processing unit (CPU) and 1.97 GB of random access memory (RAM). Our single-process tests, here, have been performed on a computational cluster node consisting of a 2.40 GHz Intel Xeon Skylake CPU and 174 GB of RAM. Same numbers of iterations are obtained for our implementation of HSS(CG, GMRES(10)), where both CG and GMRES’s tolerances were set to $10^{-10}$ , and the HSS experimented in [3] with direct inner solvers. Same result is observed for full GMRES too, while very slight differences appear for the restarted GMRES.

The remaining tests, which involve multi-process execution, have been performed on cluster nodes consisting of 2 $\times$ 12-cores 2.30 GHz Intel Xeon Haswell CPU (24 cores per node) and 48 GB of RAM (2 GB per core). The nodes are interconnected through a 56 Gb/s fourteen data rate (FDR) Infiniband network, on which the SGI MPT library is used as implementation of the MPI standard.

5.2 Results on the 3D convection-diffusion problem

5.2.1 Optimal parameters

The 3D convection-diffusion test case (9) was run on an obtained discrete problem with $n=100^{3}$ unknowns, using from $p=48$ to $p=192$ processor cores (one MPI process per core).

Table 2 shows execution times for various values of the restart parameter of GMRES.

Table 2: Varying the restart parameter of GMRES for the 3D convection-diffusion test case (9), problem size

n=100^{3}

$p$	$48$			$192$
Restart	Clock (sec)	$k$	$r$	Clock (sec)	$k$	$r$
5	344	917	9.98E-07	187	917	9.98E-07
10	251	489	9.70E-07	149	489	9.70E-07
20	274	318	9.44E-07	161	318	9.44E-07
30	427	349	9.77E-07	247	349	9.77E-07
40	614	385	9.65E-07	349	385	9.65E-07
50	748	393	9.59E-07	440	393	9.59E-07
100	1765	457	9.80E-07	969	457	9.80E-07
(Full)	2695	281	8.56E-07	1677	281	8.56E-07

This allows us to choose the value 10 as the experimentally optimal one, however, performances for a restart value of 20 were quite similar.

We therefore looked for performance variation of HSS(CG, GMRES(10)) according to its parameter $\alpha$ and the inner residual threshold $\varepsilon_{\text{in}}$ set for both CG and GMRES(10). Convergence was obtained from $\varepsilon_{\text{in}}=10^{-2}$ , which also demonstrated more efficiency than lower thresholds, as shown in Table 3.

Table 3: Varying the parameter

\alpha

and the inner residual threshold

\varepsilon_{\text{in}}

of HSS(CG,GMRES(10)) for the 3D convection-diffusion test case (9), problem size

n=100^{3}

, number of processes

p=192

$\varepsilon_{\text{in}}$ = 1.00E-02					$\varepsilon_{\text{in}}$ = 1.00E-06
$\alpha$	Clock (sec)	$k$	$k_{\text{in}}$	$r$	$\alpha$	Clock (sec)	$k$	$k_{\text{in}}$	$r$
0.7	718	213	2182	9.84E-07	0.9	2431	270	7331	9.85E-07
0.6	712	186	2124	9.57E-07	0.8	2395	240	7129	9.85E-07
0.5	665	162	1949	9.94E-07	0.7	2398	210	6986	9.84E-07
0.4	844	164	2148	9.76E-07	0.6	2450	180	6916	9.84E-07

Quite surprisingly, the number of outer iterations even slightly increased when switching from $10^{-2}$ to $10^{-6}$ .

While a restart value of 10 resulted in the most efficient executions of the GMRES solver, it does not necessarily prove to be the best choice for HSS(CG, GMRES(restart)) as well. Handling a combination of three parameters, $\alpha$ , $\varepsilon_{\text{in}}$ and GMRES’ restart, is clearly a major drawback of HSS(CG, GMRES(restart)), especially if, additionally, the number of processes (and so, possibly, the load per process) might have an impact too. Our two-stage-splitting-based HSS( ${M}^{-1}$ , ${F}^{-1}$ ) with single inner iteration takes the set of parameters back to $\alpha$ , as in the case of exact HSS. Moreover, as mentioned in Section 4, avoiding inner solver function calls and loops might constitute an attractive feature, considering pure computer science aspects. This is shown here by comparing Tables 3 and 4.

Table 4: Varying the parameter

\alpha

of HSS(

{M}^{-1}

{F}^{-1}

) for the 3D convection-diffusion test case (9), problem size

n=100^{3}

$p$	$48$			$192$
$\alpha$	Clock (sec)	$k$	$r$	Clock (sec)	$k$	$r$
6.0	566	2348	9.98E-07	252	2307	9.98E-07
5.0	485	2008	9.99E-07	214	1965	9.94E-07
4.0	399	1657	9.94E-07	177	1611	9.98E-07
3.0	311	1288	9.90E-07	136	1239	9.70E-07

For $p=192$ processes, best execution times of HSS(CG, GMRES(10)) and HSS( ${M}^{-1}$ , ${F}^{-1}$ ) are, respectively, 665 and 136 seconds. Note that the former performed 1949 inner iterations while the latter converged in 2576 inner iterations (2 $\times$ 1288 outer iterations since there is one inner iteration using ${M}^{-1}$ and another one using ${F}^{-1}$ ). Such a surprisingly quite small gap in convergence speed confirms the possibility to achieve a faster solver in execution time by avoiding inner function calls and loops. Still, an important drawback for HSS( ${M}^{-1}$ , ${F}^{-1}$ ) is that it turned divergent for $\alpha\leq 2.0$ .

Finally, Table 5 shows that $\alpha=3.0$ was experimentally optimal for the asynchronous HSS( ${M}^{-1}$ , ${F}^{-1}$ ) too. And here as well, divergence has been observed for $\alpha\leq 2.0$ .

Table 5: Varying the parameter

\alpha

of asynchronous HSS(

{M}^{-1}

{F}^{-1}

) for the 3D convection-diffusion test case (9), problem size

n=100^{3}

$p$	$48$				$192$
$\alpha$	Clock (sec)	$k_{\text{min}}$	$k_{\text{max}}$	$r$	Clock (sec)	$k_{\text{min}}$	$k_{\text{max}}$	$r$
6.0	24	3134	4609	4.32E-07	7.46	7299	9491	4.83E-07
5.0	22	2812	3969	4.31E-07	7.04	6832	9175	6.57E-07
4.0	20	2573	3695	4.21E-07	6.82	6668	8846	5.12E-07
3.0	17	2278	3080	5.49E-07	6.24	5950	7996	9.78E-07

5.2.2 Performance comparison

Using experimentally obtained optimal parameters, a performance comparison on $p=48$ to $p=192$ cores is summarized here in Table 6, where we dropped off the HSS(CG, GMRES(10)) due to memory limits exceeded for $p\leq 120$ .

Table 6: Performances from the 3D convection-diffusion test case (9), problem size

n=100^{3}

	GMRES(10)			HSS( ${M}^{-1}$ , ${F}^{-1}$ , 3.0)			Async. HSS( ${M}^{-1}$ , ${F}^{-1}$ , 3.0)
$p$	Clock	$k$	$r$	Clock	$k$	$r$	Clock	$k_{\text{min}}$	$k_{\text{max}}$	$r$
	(sec)			(sec)			(sec)
48	251	489	9.70E-07	311	1288	9.90E-07	17	2278	3080	5.49E-07
72	197	489	9.70E-07	222	1222	9.92E-07	12	3401	3912	8.44E-07
96	239	489	9.70E-07	203	1177	9.92E-07	14	5682	6678	9.21E-07
120	151	489	9.70E-07	193	1228	9.97E-07	12	6541	8233	8.79E-07
144	169	489	9.70E-07	179	1229	9.93E-07	10	7176	9394	9.50E-07
168	150	489	9.70E-07	133	1240	9.89E-07	6.20	5526	7562	8.59E-07
192	149	489	9.70E-07	136	1239	9.70E-07	6.24	5950	7996	9.78E-07

One can see a significant gain by asynchronous HSS( ${M}^{-1}$ , ${F}^{-1}$ , 3.0), which was, e.g., at $p=192$ processor cores, about 20 times faster (in execution time) than both GMRES(10) and synchronous HSS( ${M}^{-1}$ , ${F}^{-1}$ , 3.0). While the second-stage splittings using preconditioners ${M}^{-1}$ and ${F}^{-1}$ were introduced here to achieve a fully asynchronous version of HSS, such a gap between the performances of synchronous and asynchronous HSS( ${M}^{-1}$ , ${F}^{-1}$ , 3.0) in a homogeneous high-speed computational environment shows that there is a true advantage in resorting to asynchronous iterations, which is not due to possible programming biases introduced by this particular implementation of HSS.

5.3 Results on the 2D structural dynamics problem

5.3.1 Optimal parameters

The complex 2D structural dynamics test case (10) was run on an obtained discrete problem with $n=350^{2}$ unknowns, using from $p=24$ to $p=54$ processor cores (one MPI process per core).

Table 7 shows execution times for various values of the restart parameter of GMRES.

Table 7: Varying the restart parameter of GMRES for the 2D structural dynamics test case (10), problem size

n=350^{2}

, number of processes

p=48

Restart	Clock (sec)	$k$	$r$
5	5405	36594	1.00E-06
10	3960	19679	1.00E-06
20	3068	9072	1.01E-06
30	3053	6386	1.02E-06
40	3158	5125	1.04E-06
50	3084	4080	9.84E-07
100	3433	2727	7.89E-07
(Full)	7898	789	9.63E-07

This allows us to choose the value 30 as the experimentally optimal one, however, performances for restart values of 20 to 50 were quite similar.

Both HSS(CG, GMRES(30)) and HSS( ${M}^{-1}$ , ${F}^{-1}$ ) failed to converge within two hours of execution on $p=48$ cores for various values of their parameters, which made them unpractical for the current test case.

Nevertheless, asynchronous HSS( ${M}^{-1}$ , ${F}^{-1}$ ) took reasonable times to converge, and Table 8 shows an experimentally optimal $\alpha=2.0$ . Divergence was observed for $\alpha\leq 1.0$ .

Table 8: Varying the parameter

\alpha

of asynchronous HSS(

{M}^{-1}

{F}^{-1}

) for the 2D structural dynamics test case (10), problem size

n=350^{2}

, number of processes

p=48

$\alpha$	Clock (sec)	$k_{\text{min}}$	$k_{\text{max}}$	$r$
5.0	273	398754	493820	7.19E-07
4.0	235	349111	425328	8.71E-07
3.0	198	293439	357005	1.04E-06
2.0	156	231787	281838	9.50E-07

5.3.2 Performance comparison

Using experimentally obtained optimal parameters, a performance comparison on $p=24$ to $p=54$ cores is summarized in Table 9.

Table 9: Performances from the complex 2D structural dynamics test case (10), problem size

n=350^{2}

$p$	Clock (sec)	$k$	$r$	Clock (sec)	$k_{\text{min}}$	$k_{\text{max}}$	$r$
	GMRES(30)			Async. HSS( ${M}^{-1}$ , ${F}^{-1}$ , 2.0)
24	2941	6486	9.99E-07	308	183861	203002	8.50E-07
30	2722	6419	9.99E-07	253	212597	249716	8.81E-07
36	2967	6510	1.02E-06	241	236977	277301	9.86E-07
42	2656	6479	1.02E-06	154	211052	257389	1.01E-06
48	3053	6386	1.02E-06	156	231787	281838	9.50E-07
54	2829	6479	1.01E-06	159	251221	310456	9.13E-07

Again, a significant gain is obtained by asynchronous HSS( ${M}^{-1}$ , ${F}^{-1}$ , 2.0), which was, e.g., at $p=48$ processor cores, about 20 times faster than GMRES(30), similarly to the real 3D convection-diffusion test case. Here as well an even more important performance gap is observed between asynchronous and synchronous HSS( ${M}^{-1}$ , ${F}^{-1}$ , 2.0) which did not terminate within 7200 seconds. This confirms, for the complex test case as well, the benefit purely from asynchronous iterations.

6 Conclusion

Asynchronous alternating iterations are revealed here as a practical breakthrough in improving computational time of parallel solution of non-Hermitian problems, compared to the well-known GMRES and HSS methods. Classical asynchronous convergence conditions are investigated for a general practical parallel scheme of alternating iterations. In particular, it can result in a two-stage variant of the HSS method with one inner iteration for each of the outer alternating ones. Performance experiments have been conducted for such an asynchronous variant which has significantly outperformed both the GMRES and the classical HSS methods, both on a real convection-diffusion and a complex structural dynamics problem.

Acknowledgement

The paper has been prepared with the support of the “RUDN University Program 5-100”, the French national program LEFE/INSU, the project ADOM (Méthodes de décomposition de domaine asynchrones) of the French National Research Agency (ANR), and using HPC resources from the “Mésocentre” computing center of CentraleSupélec and École Normale Supérieure Paris-Saclay supported by CNRS and Région Île-de-France.

References

[1] Z.-Z. Bai. On the convergence of additive and multiplicative splitting iterations for systems of linear equations. J. Comput. Appl. Math., 154(1):195–214, 2003.
[2] Z.-Z. Bai. Regularized HSS iteration methods for stabilized saddle-point problems. IMA J. Numer. Anal., 39(4):1888–1923, 2019.
[3] Z.-Z. Bai, M. Benzi, and F. Chen. Modified HSS iteration methods for a class of complex symmetric linear systems. Computing, 87(3):93–111, 2010.
[4] Z.-Z. Bai, G. H. Golub, and C.-K. Li. Optimal parameter in Hermitian and skew-Hermitian splitting method for certain two-by-two block matrices. SIAM J. Sci. Comput., 28(2):583–603, 2006.
[5] Z.-Z. Bai, G. H. Golub, and M. K. Ng. Hermitian and skew-Hermitian splitting methods for non-Hermitian positive definite linear systems. SIAM J. Matrix Anal. Appl., 24(3):603–626, 2003.
[6] Z.-Z. Bai and M. Rozložník. On the numerical behavior of matrix splitting iteration methods for solving linear systems. SIAM J. Numer. Anal., 53(4):1716–1737, 2015.
[7] G. M. Baudet. Asynchronous iterative methods for multiprocessors. J. ACM, 25(2):226–244, 1978.
[8] D. E. Baz, P. Spiteri, J. C. Miellou, and D. Gazen. Asynchronous iterative algorithms with flexible communication for nonlinear network flow problems. J. Parallel Distrib. Comput., 38(1):1 – 15, 1996.
[9] M. Benzi. A generalization of the Hermitian and skew-Hermitian splitting iteration. SIAM J. Matrix Anal. Appl., 31(2):360–374, 2009.
[10] M. Benzi and D. Bertaccini. Block preconditioning of real-valued iterative algorithms for complex linear systems. IMA J. Numer. Anal., 28(3):598–618, 2008.
[11] M. Benzi and J. Liu. An efficient solver for the incompressible Navier-Stokes equations in rotation form. SIAM J. Sci. Comput., 29(5):1959–1981, 2007.
[12] M. Benzi and D. B. Szyld. Existence and uniqueness of splittings for stationary iterative methods with applications to alternating methods. Numer. Math., 76(3):309–321, 1997.
[13] D. Bertaccini, G. H. Golub, S. S. Capizzano, and C. T. Possio. Preconditioned HSS methods for the solution of non-Hermitian positive definite linear systems and applications to the discrete convection-diffusion equation. Numer. Math., 99(3):441–484, 2005.
[14] D. P. Bertsekas. Distributed asynchronous computation of fixed points. Math. Program., 27(1):107–120, 1983.
[15] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989.
[16] D. Chazan and W. Miranker. Chaotic relaxation. Linear Algebra Appl., 2(2):199–222, 1969.
[17] V. Conrad and Y. Wallach. Alternating methods for sets of linear equations. Numer. Math., 32(1):105–108, 1979.
[18] L. D. Dalcín, R. R. Paz, and M. A. Storti. MPI for Python. J. Parallel Distrib. Comput., 65(9):1108–1115, 2005.
[19] J. Douglas. On the numerical integration of $\frac{\partial^{2}u}{\partial x^{2}}+\frac{\partial^{2}u}{\partial y^{2}}=% \frac{\partial u}{\partial t}$ by implicit methods. J. Soc. Ind. Appl. Math., 3(1):42–65, 1955.
[20] M. El Haddad, J. C. Garay, F. Magoulès, and D. B. Szyld. Synchronous and asynchronous optimized Schwarz methods for one-way subdivision of bounded domains. Numer. Linear Algebra Appl., 27(2):e2227, 2020.
[21] M. N. El Tarazi. Some convergence results for asynchronous algorithms. Numer. Math., 39(3):325–340, 1982. (in French).
[22] K. Fan. Topological proofs for certain theorems on matrices with non-negative elements. Monatshefte für Mathematik, 62:219–237, 1958.
[23] A. Frommer and D. B. Szyld. H-splittings and two-stage iterative methods. Numer. Math., 63(1):345–356, 1992.
[24] A. Frommer and D. B. Szyld. Asynchronous two-stage iterative methods. Numer. Math., 69(2):141–153, 1994.
[25] A. Frommer and D. B. Szyld. Asynchronous iterations with flexible communication for linear systems. Calculateurs Parallèles, 10:421–429, 1998.
[26] G. Gbikpi-Benissan and F. Magoulès. Protocol-free asynchronous iterations termination. Adv. Eng. Softw., 146:102827, 2020.
[27] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952.
[28] Y.-M. Huang. A practical formula for computing optimal parameters in the HSS iteration methods. J. Comput. Appl. Math., 255:142–149, 2014.
[29] C.-X. Li and S.-L. Wu. A single-step HSS method for non-Hermitian positive definite linear systems. Appl. Math. Lett., 44:26–29, 2015.
[30] L. Li, T.-Z. Huang, and X.-P. Liu. Modified Hermitian and skew-Hermitian splitting methods for non-Hermitian positive-definite linear systems. Numer. Linear Algebra Appl., 14(3):217–235, 2007.
[31] F. Magoulès and G. Gbikpi-Benissan. JACK: An asynchronous communication kernel library for iterative algorithms. J. Supercomput., 73(8):3468–3487, 2017.
[32] F. Magoulès and G. Gbikpi-Benissan. Asynchronous Parareal time discretization for partial differential equations. SIAM J. Sci. Comput., 40(6):C704–C725, 2018.
[33] F. Magoulès and G. Gbikpi-Benissan. Distributed convergence detection based on global residual error under asynchronous iterations. IEEE Trans. Parallel Distrib. Syst., 29(4):819–829, 2018.
[34] F. Magoulès and G. Gbikpi-Benissan. JACK2: An MPI-based communication library with non-blocking synchronization for asynchronous iterations. Adv. Eng. Softw., 119:116–133, 2018.
[35] F. Magoulès, G. Gbikpi-Benissan, and Q. Zou. Asynchronous iterations of Parareal algorithm for option pricing models. Mathematics, 6(4):1–18, 2018.
[36] F. Magoulès, D. B. Szyld, and C. Venet. Asynchronous optimized Schwarz methods with and without overlap. Numer. Math., 137(1):199–227, 2017.
[37] F. Magoulès and C. Venet. Asynchronous iterative sub-structuring methods. Math. Comput. Simul., 145:34–49, 2018.
[38] G. I. Marchuk. Splitting and alternating direction methods. In Handbook of Numerical Analysis, volume 1, pages 197–462. Elsevier, 1990.
[39] J.-C. Miellou. Algorithmes de relaxation chaotique à retards. ESAIM: M2AN, 9(R1):55–82, 1975. (in French).
[40] D. W. Peaceman and H. H. Rachford. The numerical solution of parabolic and elliptic differential equations. J. Soc. Indust. Appl. Math., 3(1):28–41, 1955.
[41] Y. Saad and M. H. Schultz. Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7(3):856–869, 1986.
[42] S. Schechter. Relaxation methods for linear equations. Comm. Pure Appl. Math., 12(2):313–335, 1959.
[43] J. W. Sheldon. On the numerical solution of elliptic difference equations. MTAC, 9(51):101–112, 1955.
[44] S.-L. Wu. Several variants of the Hermitian and skew-Hermitian splitting method for a class of complex symmetric linear systems. Numer. Linear Algebra Appl., 22(2):338–356, 2015.
[45] I. Yamazaki, E. Chow, A. Bouteiller, and J. J. Dongarra. Performance of asynchronous optimized Schwarz with one-sided communication. Parallel Comput., 86:66–81, 2019.
[46] Q. Zou and F. Magoulès. Parameter estimation in the Hermitian and skew-Hermitian splitting method using gradient iterations. Numer. Linear Algebra Appl., 27:e2304, 2020.

	$\displaystyle\tau_{i}(\mathcal{A}\mathcal{B},w,v):=\frac{1}{v_{i}}\sum_{l}% \left\|(\mathcal{A}\mathcal{B})_{i,l}\right\|w_{l}$	$\displaystyle=\frac{1}{v_{i}}\sum_{l}\left\|\sum_{j}\mathcal{A}_{i,j}\mathcal{B% }_{j,l}\right\|w_{l}$
		$\displaystyle\leq\frac{1}{v_{i}}\sum_{l}\sum_{j}\left\|\mathcal{A}_{i,j}% \mathcal{B}_{j,l}\right\|w_{l}$
		$\displaystyle=\frac{1}{v_{i}}\sum_{l}\sum_{j}\frac{1}{u_{j}}\left\|\mathcal{A}_% {i,j}\right\|\left\|\mathcal{B}_{j,l}\right\|u_{j}w_{l}$
		$\displaystyle=\frac{1}{v_{i}}\sum_{j}\left(\frac{1}{u_{j}}\sum_{l}\left\|% \mathcal{B}_{j,l}\right\|w_{l}\right)\left\|\mathcal{A}_{i,j}\right\|u_{j}$
		$\displaystyle=\frac{1}{v_{i}}\sum_{j}\tau_{j}(\mathcal{B},w,u)\left\|\mathcal{A% }_{i,j}\right\|u_{j}.$

$\displaystyle\tau_{j}(\mathcal{B},w,u)\left\|\mathcal{A}_{i,j}\right\|u_{j}$	$\displaystyle<\left\|\mathcal{A}_{i,j}\right\|u_{j}\quad\forall j\ \forall i,$
$\displaystyle\frac{1}{v_{i}}\sum_{j}\tau_{j}(\mathcal{B},w,u)\left\|\mathcal{A}% _{i,j}\right\|u_{j}$	$\displaystyle<\frac{1}{v_{i}}\sum_{j}\left\|\mathcal{A}_{i,j}\right\|u_{j}\quad% \forall i,$
$\displaystyle\frac{1}{v_{i}}\sum_{l}\left\|(\mathcal{A}\mathcal{B})_{i,l}\right% \|w_{l}$	$\displaystyle\ \leq$	$\displaystyle\frac{1}{v_{i}}\sum_{j}\tau_{j}(\mathcal{B},w,u)\left\|\mathcal{A}% _{i,j}\right\|u_{j}$	$\displaystyle<\frac{1}{v_{i}}\sum_{j}\left\|\mathcal{A}_{i,j}\right\|u_{j}\quad% \forall i,$
$\displaystyle\tau_{i}(\mathcal{A}\mathcal{B},w,v)$			$\displaystyle<\tau_{i}(\mathcal{A},u,v)\quad\forall i,$

Asynchronous iterations of HSS method for non-Hermitian linear systems

Abstract

1 Introduction

2 Generalities

2.1 𝖧𝖧\mathsf{H}sansserif_H-matrix and 𝖧𝖧\mathsf{H}sansserif_H-splitting

Definition 1.

Definition 2.

Definition 3.

Lemma 1.

Proof.

Definition 4.

Lemma 2.

Proof.

Lemma 3 (refer to, e.g., Corollary 6.1 in [15]).

2.2 Asynchronous iterations

Theorem 5 (Chazan and Miranker (1969) [16]).

Theorem 6 (Baudet (1978) [7]).

3 Asynchronous alternating iterations

3.1 Computational scheme

3.2 Convergence conditions

Lemma 4.

Proof.

Proposition 1.

Proof.

Corollary 1.

Proof.

Theorem 7.

Proof.

Corollary 2.

Proof.

Remark.

Remark.

4 Implementation aspects

5 Numerical experiments

5.1 Problems and overall settings

5.2 Results on the 3D convection-diffusion problem

5.2.1 Optimal parameters

5.2.2 Performance comparison

5.3 Results on the 2D structural dynamics problem

5.3.1 Optimal parameters

5.3.2 Performance comparison

6 Conclusion

Acknowledgement

References

2.1 $\mathsf{H}$ -matrix and $\mathsf{H}$ -splitting