Asynchronous iterations of HSS method for non-Hermitian linear systems
Abstract
A general asynchronous alternating iterative model is designed, for which convergence is theoretically ensured both under classical spectral radius bound and, then, for a classical class of matrix splittings for -matrices. The computational model can be thought of as a two-stage alternating iterative method, which well suits to the well-known Hermitian and skew-Hermitian splitting (HSS) approach, with the particularity here of considering only one inner iteration. Experimental parallel performance comparison is conducted between the generalized minimal residual (GMRES) algorithm, the standard HSS and our asynchronous variant, on both real and complex non-Hermitian linear systems respectively arising from convection-diffusion and structural dynamics problems. A significant gain on execution time is observed in both cases.
Keywords: Asynchronous iterations; alternating iterations; Hermitian and skew-Hermitian splitting; non-Hermitian problems; parallel computing
1 Introduction
Many applications in scientific computing and engineering lead to the following system of linear equations,
(1) |
Let and be two splittings of with and being nonsingular. The alternating iterative scheme for solving (1) is defined as follows,
(2) |
which can be viewed as a stationary iterative scheme with an iteration matrix . Well-known early examples include the symmetric successive over-relaxation (SSOR) method [43, 17] and the alternating direction implicit (ADI) methods [40, 19, 38]. In [12] the convergence of some alternating iterations were analyzed by eliminating the intermediate solution term from (2); see also [1]. Recently, there has been growing interest in studies of the Hermitian and skew-Hermitian splitting (HSS) method [5] for solving (1) when is non-Hermitian. Let be a given constant. The HSS method can be written in the form
(3) |
where and are the Hermitian and skew-Hermitian parts of , respectively, and is the identity matrix. Here, denotes the conjugate transpose of . This method can be obtained from (2) by defining
(4) |
It was proved in [5] that when is positive definite, namely, is non-Hermitian positive definite, HSS converges unconditionally to the unique solution for any initial guess . The linear subsystems, however, especially the one involving , may still be difficult to solve, therefore much attention has been devoted to the inexact implementation. More precisely, the tolerances for the inner iterative solvers may be relatively relaxed, while good convergence properties can still be retained according to numerical experiments; see [5, 11, 9, 6]. The HSS iterative scheme has been generalized to other splitting methods, as well as their preconditioned variants, for handling various problems in scientific computing; see, e.g., [13, 30, 9, 3, 44, 29, 2]. There is also a number of studies on the optimal selection of ; see [5, 4, 28, 46]. The iterative scheme (3) can be equivalently written in a residual-updating form, which achieves a higher accuracy at the cost of more computational effort; see [6] for a detailed discussion.
Parallel computing could be extremely useful when has large dimension. In practice, the high cost of synchronization relative to that of computation is currently the major bottleneck in high-performance distributed computing systems, which motivates redesigning of parallel iterative algorithms. One of the most interesting approaches, arising from basic relaxation methods, is the so-called asynchronous iterations [16, 15]. Asynchronous iterative scheme gives a full overlap** of communication and computation. Every process has the flexibility to work at their own pace without waiting for the data acquisition. A major difference between synchronous and asynchronous iterations lies in their predictability properties. The former produces deterministic sequence of iterations, while the latter enables nondeterministic behaviors. In [16] the first convergence result was established for the solution of linear systems, which was followed by the investigation of general fixed-point iterative models; see [39, 7, 21, 14]. In recent years, with the advent of very high-performance computing environment, asynchronous iterative scheme has gained much popularity. The study of asynchronous domain decomposition methods, in both time and space domains, becomes an increasingly active area of research; see, e.g., [36, 35, 37, 32, 45, 20]. Another area that has seen growth in the last decades is the asynchronous convergence detection; see [33, 26] and the references therein.
In this paper we focus on the asynchronous formulation of alternating iterations. In Section 2, we recall some general tools and the asynchronous iterations theory used for the formulation and the convergence analysis of our asynchronous alternating scheme. Section 3 presents the main contribution where we formulate our asynchronous alternating scheme and sufficient conditions for its convergence. Section 5 is devoted to numerical experiments on a parallel computing platform, featuring both a real three dimensional convection-diffusion problem and a complex two dimensional structural dynamic problem. Finally, Section 6 gives our conclusions.
2 Generalities
2.1 -matrix and -splitting
In a general manner, let denote the entry of a matrix on its -th row and -th column, and let denote the -th entry of a vector . Comparisons , , , and between two matrices or vectors (of same shapes) are entrywise. The absolute value (or module) of a matrix or a vector is entrywise. The spectral radius of a matrix is designated by . In expressions like and like with and being a matrix and a vector, respectively, indicates a matrix and a vector, respectively, with all entries being . stands for the identity matrix.
We recall now few general tools later used for the convergence analysis of the proposed asynchronous iterative method.
Definition 1.
A square matrix is an -matrix if and only if
Definition 2.
The comparison matrix of a matrix is defined as
Definition 3.
A square matrix is an -matrix if and only if its comparison matrix is an -matrix.
Lemma 1.
A square matrix is an -matrix if and only if
Proof.
This is directly implied by Theorem 5’ in [22]. ∎
A splitting of a matrix consists of identifying a nonsingular matrix and the resulting matrix , so as to define a relaxation operator
Definition 4.
A splitting is an -splitting if and only if is an -matrix.
Lemma 2.
Let be an -splitting. Then, we have
Proof.
This directly follows from Proof of Theorem 3.4 (c) in [23]. ∎
Lemma 3 (refer to, e.g., Corollary 6.1 in [15]).
Let be a square matrix. Then, we have
2.2 Asynchronous iterations
Consider, again, the linear system (1), a splitting of the matrix and the resulting iterative scheme
Assume a distribution
of both the system and the splitting of . Note that the problem (1) can also corresponds to an augmented system resulting from a domain decomposition with overlap** subdomains, i.e., some rows in a submatrix are possibly replicated in another submatrix , . A classical parallel relaxation is then given by
with The first feature of asynchronous iterations is the free steering (see, e.g., [42]), where, at each iteration , a random subset of block-components can be updated. It is convenient to state a natural assumption,
which is implemented by the fact that no block-component stops being updated until convergence is globally reached. The second feature consists of modeling communication delays implying that at an iteration , a block-component is possibly updated using a block-component computed at a random previous iteration . It yields the parallel iterative scheme
(5) |
where, as well, another natural assumption is made, stating that
Theorem 5 (Chazan and Miranker (1969) [16]).
An asynchronous iterative method (5) converges from any initial guess , with any sequence and any functions to if and only if
The model (5) was later generalized by Baudet [7] to arbitrary fixed-point iterations
(6) |
where the update of a block-component at an iteration depends on versions, to , of each block-component . Let us denote by the vector given by
with and being two vectors of same size. Let and denote collections of vectors, i.e.,
3 Asynchronous alternating iterations
3.1 Computational scheme
Consider, now, the alternating scheme (2) which results in
Then, according to Theorem 5, such an induced parallel scheme is asynchronously convergent if which is shown, in the next section, to be achieved under usual convergence conditions on the splittings and . Nevertheless, asynchronous relaxation based on such an operator cannot be implemented using the alternating form (2), since the said operator is induced by strictly synchronizing and .
Consider, then, an equivalent formulation of the alternating scheme (2),
and assume that is distributed as , i.e.,
Parallel asynchronous alternating methods are thus given by the computational scheme
(7) |
Assuming that the identity matrix is distributed as , i.e.,
it yields
which actually lies in the framework of the generalized model (6) with, here, , since each update of a block-component depends on versions of the other block-components. Considering, then, a collection of vectors, the corresponding map** is given by
with and
3.2 Convergence conditions
We analyze, now, sufficient conditions for the convergence of our asynchronous alternating iterative scheme (7). To the best of our knowledge, Lemma 4, Proposition 1 and Corollary 1 are new. Proposition 1 and Corollary 1 highlight how combining properties of the operators and imply a resulting contracting operator . Our main results consist of Theorem 7 and Corollary 2 where the same combined conditions are shown to be sufficient for the convergence of asynchronous alternating methods (7), despite the induced, slightly different, iterations operator.
Let, first, be a matrix with arbitrary shape, let be a vector with as many entries as the number of columns in , and let be a vector with as many entries as the number of rows in , and with no entry. Let denote the vector given by the row-sums
Note, then, that, for a square matrix ,
Lemma 4.
Let and be matrices with shapes such that is calculable. Let , and be vectors with dimensions such that and are calculable. Then, we have
Proof.
Let us index rows and columns of by and , respectively, and columns of by . We have
It yields that if for all , then
which concludes the proof. ∎
Proposition 1.
Let
We have
Proof.
Corollary 1.
if is an -matrix, then
Proof.
Considering that is an -matrix, take like in Lemma 1, so as to have
We also have
and, then,
It yields that, ,
which implies, with also satisfying , that the matrix
is an -matrix, according to Lemma 1. Define, then,
and note that , which implies, by Definition 3, that is an -matrix, hence, by Definition 4, is an -splitting. Lemma 2 therefore ensures that and one can verify that
Proposition 1 therefore finally applies, which concludes the proof. ∎
Theorem 7.
Let
An asynchronous alternating method (7) converges from any initial guess , with any sequence and any functions to if .
Proof.
Consider two collections, and , of vectors. We have
Consequently, according to Theorem 6, an asynchronous alternating method (7) is convergent if Recall, then, that according to Lemma 3,
According to the two blocks of , take Then, we have both
implying, as well,
Lemma 4 therefore ensures, with ,
Recall that Then, we have
which leads to By Lemma 3, we therefore satisfy which concludes the proof. ∎
Corollary 2.
An asynchronous alternating method (7) converges from any initial guess , with any sequence and any functions to if is an -matrix and
Proof.
This follows in the same way as Corollary 1. ∎
Let denote the diagonal matrix obtained from the diagonal of a matrix .
Remark.
For practical applications of Corollary 2, let be a diagonal real matrix such that We straightforwardly have
Remark.
In regard to the HSS splitting, if is a real matrix with , and splitting matrices and are given by
then we have both
which satisfy where and are two diagonal real matrices with entries greater than or equal to .
4 Implementation aspects
The two alternating iterations of the HSS method require the solution of two secondary problems involving the coefficient matrices and , respectively. In practice, as pointed out in, e.g., [5, 44], these problems are inexactly solved by means of iterative algorithms. A general description for both HSS and inexact HSS (IHSS) can be given by Algorithm 1.
We can then designate by, e.g, HSS(CG, GMRES) an IHSS algorithm with the conjugate gradient (CG) method [27] for solving the shifted Hermitian problem and the generalized minimal residual (GMRES) method [41] for solving the shifted skew-Hermitian one.
Asynchronous HSS iterations necessarily belong to the class of IHSS algorithms since they obviously require the inner solvers to be asynchronous too, which further reduces such an approach to the subclass of IHSS with inner splittings. Taking, then, e.g., a splitting the solution, at each outer iteration , of
can be given by several inner iterations
(8) |
where is the inner iteration variable. Furthermore, when dealing with two-stage asynchronous iterations, one should particularly take advantage of the possibility to use the inner solution vector with any value of , given that asynchronous relaxation is very likely to benefit from each newly updated data. We refer the reader to, e.g., [8, 25] for more insights into the so called “asynchronous iterations with flexible communication”. Moreover, analysis of matrix splittings for two-stage asynchronous iterations reveals that convergence of such methods can be guaranteed for any number of inner iterations (see, e.g., [24]). According, therefore, to efficiency aspects related to flexible communication ideas, it is of some interest, in the end, to simply consider only one iteration of (8). If, in particular, we also consider as initial guess , then we can define
so as to finally have
which falls under the general alternating scheme (2) that has been considered in our theoretical analysis. Such a specialization of Algorithm 1 is given by Algorithm 2, where and are preconditioners of and , respectively.
Note that Algorithm 2 needs to be specifically implemented instead of just using Algorithm 1 with calls of relaxation-based inner solvers with maximum number of iterations set to . Indeed, on pure computer science aspects, avoiding inner function calls and loops can result in a very significant execution time saving, which even makes HSS(, ) possibly competitive, in practice, with, e.g., HSS(CG, GMRES), as we shall see in Section 5.
From Algorithm 2, iterative scheme (7), programming models [31, 34] and convergence detection approach [26], asynchronous parallel implementation of HSS iterations is obtained as described by Algorithm 3, where the communication routines start with “Com” and are blocking by default. Their non-blocking counterparts are designated by “ICom” with the letter “I” standing for “immediate”, similarly to the Message Passing Interface (MPI) standard.
The routines ComSum and IComSum are used to compute dot product with by global reduction operation
They can readily be replaced by MPI routines MPI_Allreduce and MPI_Iallreduce, respectively. The object ComRequest and the routine ComTest are therefore analogous to MPI_Request and MPI_Test. Such a simple way to reliably use the classical loop stop** criterion in case of asynchronous iterations is due to [26]. It also allows for considering a counter, , of the number of global convergence tests. On the other hand, the data exchange routine IComSendRecv has to be a bit constructed using, e.g., MPI routines MPI_Isend and MPI_Irecv. Briefly, the routine IComSendRecvInit triggers non-blocking requests for message sending () and reception (, ), and fills up the components , , of the vector with any arbitrary values. Note that both storage and communication of components , , should actually be limited to values which are necessary for computing the product , according to the nonzero entries in . The subsequent calls to the routine IComSendRecv then check completion of previous requests, update with received data and trigger new instances of the completed requests. Further details can be found in, e.g., [34].
5 Numerical experiments
5.1 Problems and overall settings
Numerical experiments have been conducted on two kinds of problem. The first one consists of a three-dimensional (3D) convection-diffusion equation,
(9) |
with and Dirichlet boundary conditions. Discretization has been achieved using seven-point centered differences for both convection and diffusion terms. A fixed value, , has been used for all elements in the three-dimensional vector as convection parameter. The entries of the exact discrete solution, , have been taken randomly in and the right-hand side has then been constructed as .
The second kind of problem consists of a 2D structural dynamics equation (see, e.g., [10, 3]),
(10) |
where and denote the mass and stiffness matrices, respectively; and denote the viscous and hysteretic dam** matrices, respectively; denotes the circular frequency. The values of the matrices and the parameters have been taken from [3]. The matrix is the five-point finite difference discretization of a diffusion term on the unit square with Dirichlet boundary conditions. The other matrices have been set as , , , where , and denotes the identity matrix. The circular frequency has been set to . The right-hand side has been taken as with being a vector of , to ensure that all entries of equal .
In the following, parallel execution times (wall-clock), numbers of iterations, , and final residual errors, , are reported for the GMRES [41], the IHSS [5] (Algorithms 1 and 2) and the asynchronous IHSS methods (Algorithm 3), with a stop** criterion set so as to have
In case of asynchronous execution, minimum and maximum numbers of local iterations, and , respectively, are considered since there is not global iterations . Both for synchronous and asynchronous HSS(, ) (respectively, Algorithms 2 and 3), we took
All of the tests have been entirely implemented in the Python language, using NumPy, SciPy Sparse and MPI4Py [18] modules.
A comparison with some results in [3] about the problem (10) (Example 4.2 in [3]) is reported in Table 1 for single-process execution of full GMRES, GMRES(restart), and HSS(CG, GMRES(restart)) with inner residual threshold set to in order to compare with an “exact” HSS.
Experiment | Results | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
||||||||||||||||||||||||||||||||||
|
|
The experimentally optimal value of , according to [3], was considered for each problem size ( for , and for ). We recall that the experiments in [3] were run in MATLAB on a personal computer consisting of a 2.66 GHz Intel Core Duo central processing unit (CPU) and 1.97 GB of random access memory (RAM). Our single-process tests, here, have been performed on a computational cluster node consisting of a 2.40 GHz Intel Xeon Skylake CPU and 174 GB of RAM. Same numbers of iterations are obtained for our implementation of HSS(CG, GMRES(10)), where both CG and GMRES’s tolerances were set to , and the HSS experimented in [3] with direct inner solvers. Same result is observed for full GMRES too, while very slight differences appear for the restarted GMRES.
The remaining tests, which involve multi-process execution, have been performed on cluster nodes consisting of 2 12-cores 2.30 GHz Intel Xeon Haswell CPU (24 cores per node) and 48 GB of RAM (2 GB per core). The nodes are interconnected through a 56 Gb/s fourteen data rate (FDR) Infiniband network, on which the SGI MPT library is used as implementation of the MPI standard.
5.2 Results on the 3D convection-diffusion problem
5.2.1 Optimal parameters
The 3D convection-diffusion test case (9) was run on an obtained discrete problem with unknowns, using from to processor cores (one MPI process per core).
Table 2 shows execution times for various values of the restart parameter of GMRES.
Restart | Clock (sec) | Clock (sec) | ||||
---|---|---|---|---|---|---|
5 | 344 | 917 | 9.98E-07 | 187 | 917 | 9.98E-07 |
10 | 251 | 489 | 9.70E-07 | 149 | 489 | 9.70E-07 |
20 | 274 | 318 | 9.44E-07 | 161 | 318 | 9.44E-07 |
30 | 427 | 349 | 9.77E-07 | 247 | 349 | 9.77E-07 |
40 | 614 | 385 | 9.65E-07 | 349 | 385 | 9.65E-07 |
50 | 748 | 393 | 9.59E-07 | 440 | 393 | 9.59E-07 |
100 | 1765 | 457 | 9.80E-07 | 969 | 457 | 9.80E-07 |
(Full) | 2695 | 281 | 8.56E-07 | 1677 | 281 | 8.56E-07 |
This allows us to choose the value 10 as the experimentally optimal one, however, performances for a restart value of 20 were quite similar.
We therefore looked for performance variation of HSS(CG, GMRES(10)) according to its parameter and the inner residual threshold set for both CG and GMRES(10). Convergence was obtained from , which also demonstrated more efficiency than lower thresholds, as shown in Table 3.
= 1.00E-02 | = 1.00E-06 | ||||||||
---|---|---|---|---|---|---|---|---|---|
Clock (sec) | Clock (sec) | ||||||||
0.7 | 718 | 213 | 2182 | 9.84E-07 | 0.9 | 2431 | 270 | 7331 | 9.85E-07 |
0.6 | 712 | 186 | 2124 | 9.57E-07 | 0.8 | 2395 | 240 | 7129 | 9.85E-07 |
0.5 | 665 | 162 | 1949 | 9.94E-07 | 0.7 | 2398 | 210 | 6986 | 9.84E-07 |
0.4 | 844 | 164 | 2148 | 9.76E-07 | 0.6 | 2450 | 180 | 6916 | 9.84E-07 |
Quite surprisingly, the number of outer iterations even slightly increased when switching from to .
While a restart value of 10 resulted in the most efficient executions of the GMRES solver, it does not necessarily prove to be the best choice for HSS(CG, GMRES(restart)) as well. Handling a combination of three parameters, , and GMRES’ restart, is clearly a major drawback of HSS(CG, GMRES(restart)), especially if, additionally, the number of processes (and so, possibly, the load per process) might have an impact too. Our two-stage-splitting-based HSS(, ) with single inner iteration takes the set of parameters back to , as in the case of exact HSS. Moreover, as mentioned in Section 4, avoiding inner solver function calls and loops might constitute an attractive feature, considering pure computer science aspects. This is shown here by comparing Tables 3 and 4.
Clock (sec) | Clock (sec) | |||||
---|---|---|---|---|---|---|
6.0 | 566 | 2348 | 9.98E-07 | 252 | 2307 | 9.98E-07 |
5.0 | 485 | 2008 | 9.99E-07 | 214 | 1965 | 9.94E-07 |
4.0 | 399 | 1657 | 9.94E-07 | 177 | 1611 | 9.98E-07 |
3.0 | 311 | 1288 | 9.90E-07 | 136 | 1239 | 9.70E-07 |
For processes, best execution times of HSS(CG, GMRES(10)) and HSS(, ) are, respectively, 665 and 136 seconds. Note that the former performed 1949 inner iterations while the latter converged in 2576 inner iterations (2 1288 outer iterations since there is one inner iteration using and another one using ). Such a surprisingly quite small gap in convergence speed confirms the possibility to achieve a faster solver in execution time by avoiding inner function calls and loops. Still, an important drawback for HSS(, ) is that it turned divergent for .
Finally, Table 5 shows that was experimentally optimal for the asynchronous HSS(, ) too. And here as well, divergence has been observed for .
Clock (sec) | Clock (sec) | |||||||
---|---|---|---|---|---|---|---|---|
6.0 | 24 | 3134 | 4609 | 4.32E-07 | 7.46 | 7299 | 9491 | 4.83E-07 |
5.0 | 22 | 2812 | 3969 | 4.31E-07 | 7.04 | 6832 | 9175 | 6.57E-07 |
4.0 | 20 | 2573 | 3695 | 4.21E-07 | 6.82 | 6668 | 8846 | 5.12E-07 |
3.0 | 17 | 2278 | 3080 | 5.49E-07 | 6.24 | 5950 | 7996 | 9.78E-07 |
5.2.2 Performance comparison
Using experimentally obtained optimal parameters, a performance comparison on to cores is summarized here in Table 6, where we dropped off the HSS(CG, GMRES(10)) due to memory limits exceeded for .
GMRES(10) | HSS(, , 3.0) | Async. HSS(, , 3.0) | ||||||||
Clock | Clock | Clock | ||||||||
(sec) | (sec) | (sec) | ||||||||
48 | 251 | 489 | 9.70E-07 | 311 | 1288 | 9.90E-07 | 17 | 2278 | 3080 | 5.49E-07 |
72 | 197 | 489 | 9.70E-07 | 222 | 1222 | 9.92E-07 | 12 | 3401 | 3912 | 8.44E-07 |
96 | 239 | 489 | 9.70E-07 | 203 | 1177 | 9.92E-07 | 14 | 5682 | 6678 | 9.21E-07 |
120 | 151 | 489 | 9.70E-07 | 193 | 1228 | 9.97E-07 | 12 | 6541 | 8233 | 8.79E-07 |
144 | 169 | 489 | 9.70E-07 | 179 | 1229 | 9.93E-07 | 10 | 7176 | 9394 | 9.50E-07 |
168 | 150 | 489 | 9.70E-07 | 133 | 1240 | 9.89E-07 | 6.20 | 5526 | 7562 | 8.59E-07 |
192 | 149 | 489 | 9.70E-07 | 136 | 1239 | 9.70E-07 | 6.24 | 5950 | 7996 | 9.78E-07 |
One can see a significant gain by asynchronous HSS(, , 3.0), which was, e.g., at processor cores, about 20 times faster (in execution time) than both GMRES(10) and synchronous HSS(, , 3.0). While the second-stage splittings using preconditioners and were introduced here to achieve a fully asynchronous version of HSS, such a gap between the performances of synchronous and asynchronous HSS(, , 3.0) in a homogeneous high-speed computational environment shows that there is a true advantage in resorting to asynchronous iterations, which is not due to possible programming biases introduced by this particular implementation of HSS.
5.3 Results on the 2D structural dynamics problem
5.3.1 Optimal parameters
The complex 2D structural dynamics test case (10) was run on an obtained discrete problem with unknowns, using from to processor cores (one MPI process per core).
Table 7 shows execution times for various values of the restart parameter of GMRES.
Restart | Clock (sec) | ||
---|---|---|---|
5 | 5405 | 36594 | 1.00E-06 |
10 | 3960 | 19679 | 1.00E-06 |
20 | 3068 | 9072 | 1.01E-06 |
30 | 3053 | 6386 | 1.02E-06 |
40 | 3158 | 5125 | 1.04E-06 |
50 | 3084 | 4080 | 9.84E-07 |
100 | 3433 | 2727 | 7.89E-07 |
(Full) | 7898 | 789 | 9.63E-07 |
This allows us to choose the value 30 as the experimentally optimal one, however, performances for restart values of 20 to 50 were quite similar.
Both HSS(CG, GMRES(30)) and HSS(, ) failed to converge within two hours of execution on cores for various values of their parameters, which made them unpractical for the current test case.
Nevertheless, asynchronous HSS(, ) took reasonable times to converge, and Table 8 shows an experimentally optimal . Divergence was observed for .
Clock (sec) | ||||
---|---|---|---|---|
5.0 | 273 | 398754 | 493820 | 7.19E-07 |
4.0 | 235 | 349111 | 425328 | 8.71E-07 |
3.0 | 198 | 293439 | 357005 | 1.04E-06 |
2.0 | 156 | 231787 | 281838 | 9.50E-07 |
5.3.2 Performance comparison
Using experimentally obtained optimal parameters, a performance comparison on to cores is summarized in Table 9.
GMRES(30) | Async. HSS(, , 2.0) | ||||||
---|---|---|---|---|---|---|---|
Clock (sec) | Clock (sec) | ||||||
24 | 2941 | 6486 | 9.99E-07 | 308 | 183861 | 203002 | 8.50E-07 |
30 | 2722 | 6419 | 9.99E-07 | 253 | 212597 | 249716 | 8.81E-07 |
36 | 2967 | 6510 | 1.02E-06 | 241 | 236977 | 277301 | 9.86E-07 |
42 | 2656 | 6479 | 1.02E-06 | 154 | 211052 | 257389 | 1.01E-06 |
48 | 3053 | 6386 | 1.02E-06 | 156 | 231787 | 281838 | 9.50E-07 |
54 | 2829 | 6479 | 1.01E-06 | 159 | 251221 | 310456 | 9.13E-07 |
Again, a significant gain is obtained by asynchronous HSS(, , 2.0), which was, e.g., at processor cores, about 20 times faster than GMRES(30), similarly to the real 3D convection-diffusion test case. Here as well an even more important performance gap is observed between asynchronous and synchronous HSS(, , 2.0) which did not terminate within 7200 seconds. This confirms, for the complex test case as well, the benefit purely from asynchronous iterations.
6 Conclusion
Asynchronous alternating iterations are revealed here as a practical breakthrough in improving computational time of parallel solution of non-Hermitian problems, compared to the well-known GMRES and HSS methods. Classical asynchronous convergence conditions are investigated for a general practical parallel scheme of alternating iterations. In particular, it can result in a two-stage variant of the HSS method with one inner iteration for each of the outer alternating ones. Performance experiments have been conducted for such an asynchronous variant which has significantly outperformed both the GMRES and the classical HSS methods, both on a real convection-diffusion and a complex structural dynamics problem.
Acknowledgement
The paper has been prepared with the support of the “RUDN University Program 5-100”, the French national program LEFE/INSU, the project ADOM (Méthodes de décomposition de domaine asynchrones) of the French National Research Agency (ANR), and using HPC resources from the “Mésocentre” computing center of CentraleSupélec and École Normale Supérieure Paris-Saclay supported by CNRS and Région Île-de-France.
References
- [1] Z.-Z. Bai. On the convergence of additive and multiplicative splitting iterations for systems of linear equations. J. Comput. Appl. Math., 154(1):195–214, 2003.
- [2] Z.-Z. Bai. Regularized HSS iteration methods for stabilized saddle-point problems. IMA J. Numer. Anal., 39(4):1888–1923, 2019.
- [3] Z.-Z. Bai, M. Benzi, and F. Chen. Modified HSS iteration methods for a class of complex symmetric linear systems. Computing, 87(3):93–111, 2010.
- [4] Z.-Z. Bai, G. H. Golub, and C.-K. Li. Optimal parameter in Hermitian and skew-Hermitian splitting method for certain two-by-two block matrices. SIAM J. Sci. Comput., 28(2):583–603, 2006.
- [5] Z.-Z. Bai, G. H. Golub, and M. K. Ng. Hermitian and skew-Hermitian splitting methods for non-Hermitian positive definite linear systems. SIAM J. Matrix Anal. Appl., 24(3):603–626, 2003.
- [6] Z.-Z. Bai and M. Rozložník. On the numerical behavior of matrix splitting iteration methods for solving linear systems. SIAM J. Numer. Anal., 53(4):1716–1737, 2015.
- [7] G. M. Baudet. Asynchronous iterative methods for multiprocessors. J. ACM, 25(2):226–244, 1978.
- [8] D. E. Baz, P. Spiteri, J. C. Miellou, and D. Gazen. Asynchronous iterative algorithms with flexible communication for nonlinear network flow problems. J. Parallel Distrib. Comput., 38(1):1 – 15, 1996.
- [9] M. Benzi. A generalization of the Hermitian and skew-Hermitian splitting iteration. SIAM J. Matrix Anal. Appl., 31(2):360–374, 2009.
- [10] M. Benzi and D. Bertaccini. Block preconditioning of real-valued iterative algorithms for complex linear systems. IMA J. Numer. Anal., 28(3):598–618, 2008.
- [11] M. Benzi and J. Liu. An efficient solver for the incompressible Navier-Stokes equations in rotation form. SIAM J. Sci. Comput., 29(5):1959–1981, 2007.
- [12] M. Benzi and D. B. Szyld. Existence and uniqueness of splittings for stationary iterative methods with applications to alternating methods. Numer. Math., 76(3):309–321, 1997.
- [13] D. Bertaccini, G. H. Golub, S. S. Capizzano, and C. T. Possio. Preconditioned HSS methods for the solution of non-Hermitian positive definite linear systems and applications to the discrete convection-diffusion equation. Numer. Math., 99(3):441–484, 2005.
- [14] D. P. Bertsekas. Distributed asynchronous computation of fixed points. Math. Program., 27(1):107–120, 1983.
- [15] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1989.
- [16] D. Chazan and W. Miranker. Chaotic relaxation. Linear Algebra Appl., 2(2):199–222, 1969.
- [17] V. Conrad and Y. Wallach. Alternating methods for sets of linear equations. Numer. Math., 32(1):105–108, 1979.
- [18] L. D. Dalcín, R. R. Paz, and M. A. Storti. MPI for Python. J. Parallel Distrib. Comput., 65(9):1108–1115, 2005.
- [19] J. Douglas. On the numerical integration of by implicit methods. J. Soc. Ind. Appl. Math., 3(1):42–65, 1955.
- [20] M. El Haddad, J. C. Garay, F. Magoulès, and D. B. Szyld. Synchronous and asynchronous optimized Schwarz methods for one-way subdivision of bounded domains. Numer. Linear Algebra Appl., 27(2):e2227, 2020.
- [21] M. N. El Tarazi. Some convergence results for asynchronous algorithms. Numer. Math., 39(3):325–340, 1982. (in French).
- [22] K. Fan. Topological proofs for certain theorems on matrices with non-negative elements. Monatshefte für Mathematik, 62:219–237, 1958.
- [23] A. Frommer and D. B. Szyld. H-splittings and two-stage iterative methods. Numer. Math., 63(1):345–356, 1992.
- [24] A. Frommer and D. B. Szyld. Asynchronous two-stage iterative methods. Numer. Math., 69(2):141–153, 1994.
- [25] A. Frommer and D. B. Szyld. Asynchronous iterations with flexible communication for linear systems. Calculateurs Parallèles, 10:421–429, 1998.
- [26] G. Gbikpi-Benissan and F. Magoulès. Protocol-free asynchronous iterations termination. Adv. Eng. Softw., 146:102827, 2020.
- [27] M. R. Hestenes and E. Stiefel. Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 49(6):409–436, 1952.
- [28] Y.-M. Huang. A practical formula for computing optimal parameters in the HSS iteration methods. J. Comput. Appl. Math., 255:142–149, 2014.
- [29] C.-X. Li and S.-L. Wu. A single-step HSS method for non-Hermitian positive definite linear systems. Appl. Math. Lett., 44:26–29, 2015.
- [30] L. Li, T.-Z. Huang, and X.-P. Liu. Modified Hermitian and skew-Hermitian splitting methods for non-Hermitian positive-definite linear systems. Numer. Linear Algebra Appl., 14(3):217–235, 2007.
- [31] F. Magoulès and G. Gbikpi-Benissan. JACK: An asynchronous communication kernel library for iterative algorithms. J. Supercomput., 73(8):3468–3487, 2017.
- [32] F. Magoulès and G. Gbikpi-Benissan. Asynchronous Parareal time discretization for partial differential equations. SIAM J. Sci. Comput., 40(6):C704–C725, 2018.
- [33] F. Magoulès and G. Gbikpi-Benissan. Distributed convergence detection based on global residual error under asynchronous iterations. IEEE Trans. Parallel Distrib. Syst., 29(4):819–829, 2018.
- [34] F. Magoulès and G. Gbikpi-Benissan. JACK2: An MPI-based communication library with non-blocking synchronization for asynchronous iterations. Adv. Eng. Softw., 119:116–133, 2018.
- [35] F. Magoulès, G. Gbikpi-Benissan, and Q. Zou. Asynchronous iterations of Parareal algorithm for option pricing models. Mathematics, 6(4):1–18, 2018.
- [36] F. Magoulès, D. B. Szyld, and C. Venet. Asynchronous optimized Schwarz methods with and without overlap. Numer. Math., 137(1):199–227, 2017.
- [37] F. Magoulès and C. Venet. Asynchronous iterative sub-structuring methods. Math. Comput. Simul., 145:34–49, 2018.
- [38] G. I. Marchuk. Splitting and alternating direction methods. In Handbook of Numerical Analysis, volume 1, pages 197–462. Elsevier, 1990.
- [39] J.-C. Miellou. Algorithmes de relaxation chaotique à retards. ESAIM: M2AN, 9(R1):55–82, 1975. (in French).
- [40] D. W. Peaceman and H. H. Rachford. The numerical solution of parabolic and elliptic differential equations. J. Soc. Indust. Appl. Math., 3(1):28–41, 1955.
- [41] Y. Saad and M. H. Schultz. Gmres: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Statist. Comput., 7(3):856–869, 1986.
- [42] S. Schechter. Relaxation methods for linear equations. Comm. Pure Appl. Math., 12(2):313–335, 1959.
- [43] J. W. Sheldon. On the numerical solution of elliptic difference equations. MTAC, 9(51):101–112, 1955.
- [44] S.-L. Wu. Several variants of the Hermitian and skew-Hermitian splitting method for a class of complex symmetric linear systems. Numer. Linear Algebra Appl., 22(2):338–356, 2015.
- [45] I. Yamazaki, E. Chow, A. Bouteiller, and J. J. Dongarra. Performance of asynchronous optimized Schwarz with one-sided communication. Parallel Comput., 86:66–81, 2019.
- [46] Q. Zou and F. Magoulès. Parameter estimation in the Hermitian and skew-Hermitian splitting method using gradient iterations. Numer. Linear Algebra Appl., 27:e2304, 2020.