C. T. Kelley
North Carolina State University,
Department of Mathematics,
Box 8205, Raleigh, NC 27695-8205, USA
([email protected]).
This work was partially supported by
the Center for Exascale Monte-Carlo Neutron Transport (CEMeNT) a
PSAAP-III project funded by Department of Energy
grant number DE-NA003967.
Abstract
We make the interprecision transfers explicit in an algorithmic
description of iterative refinement and obtain new insights into
the algorithm. One example is
the classic variant of iterative refinement where the
matrix and the factorization are stored in a working precision and
the residual is evaluated in a higher precision. In that case
we make the observation
that this algorithm will solve a promoted form of the original problem
and thereby characterize the limiting behavior in a novel way and obtain
a different version of the classic convergence
analysis. We also discuss two approaches
for interprecision transfer in the triangular solves.
keywords:
Iterative refinement, Interprecision transfers,
Mixed-precision arithmetic, Linear systems
{AMS}
65F05, 65F10,
1 Introduction
Iterative refinement (IR) is a way to lower factorization costs
in the numerical solution of a linear system
by performing the factorization in a lower precision.
Algorithm IR-V0
is a simple formulation using Gaussian elimination. In this
formulation all computations are done in a high precision
except for the factorization.
Factor in a lower precision
while too large do
endwhile
Algorithm IR-V0 leaves out many implementation details. Some
recent papers
[1, 6, 7, 9] have made the
algorithmic details explicit. The purpose of this paper is to build
upon that work by explicitly including the interprecision transfers
in the algorithmic description. One consequence of this, which
we discuss in § 2.2 and
§ 3.1 is a novel
interpretation of the classic form of the method [22]
where the residual is evaluated in an extended precision.
1.1 Notation
We use the terminology from [9, 1]
and consider several precisions. We will use Julia-like notation for
data types.
•
The matrix and right side are stored in the
working precision .
•
is factored in the factorization precision .
•
The residual is computed in the residual precision .
•
The triangular solves for are done in the
solver precision .
If TH is a higher precision that TL we will write .
will be the rounding operation to precision . We will let
be the floating point numbers with precision .
We will assume that
a low precision number can be exactly represented in any higher
precision. So
(1)
We will let denote the unit roundoff for precision and let
denote the interprecision transfer from precision to
precision . Interprecision transfer is more than rounding and can,
in some cases, include data allocation. If ()
then we will call the transfer upcasting
(downcasting).
Upcasting is simpler than downcasting because if and
, then (1) implies that
(2)
However, upcasting is not linear. In fact if there is
no reason to expect that
because the multiplication on the left is done in a lower precision than
the one on the right.
Downcasting is more subtle because not only is it nonlinear but also
if and ,
then (2) holds only if .
The nonlinearity of interprecision transfers should be made explicit in
analysis especially for the triangular solves (see § 2.3).
2 Interprecision transfers in IR
Algorithm IR0 includes many implicit interprecision transfers. We
will make the assumption, which holds in IEEE [14] arithmetic,
that when a binary operation is performed between floating
point numbers with different precisions, then the lower precision number
is promoted before the operation is performed.
So, if , , , then
(3)
We will use (3) throughout this paper when we need to
make the implicit interprecision transfers explicit.
2.1 The low precision factorization
We will begin with the low precision factorization. The
line in Algorithm IR-V0 for this is
•
Factor in a lower precision.
However, is stored in precision
and the factorization is in
precision . Hence one must make a copy of . So to make this
interprecision transfers explicit we should express this as
•
Make a low precision copy of . .
•
Compute an factorization of , overwriting
.
So in this way the storage costs of IR become clear and we can
see the time/storage tradeoff. If, for example, and
, one must allocate storage for which is a
50% increase in matrix storage.
2.2 The residual precision
For the remainder of this paper we use the norm, so
.
The algorithm IR-V0 does not make it clear how the residual precision
affects the iteration. The description in [9] carefully
explains what one must do.
•
Store in precision .
•
Solve the triangular system in precision .
•
Store in precision .
The most important consequence of this and (3) is
that the residual is
While by our assumptions that , the
residual is computed in precision and is therefore the residual
of a promoted problem and the iteration is approximating
the solution of that promoted problem
(4)
which is posed in precision TR. We make a distinction between
and only in those cases, such as
(4) where we are talking about the computed solution
in the residual precision. In exact arithmetic, of course,
.
All of these interprecision transfers are implicit and need not be done
within the iteration. For example, when one computes in precision
, then (3) implies that
the matrix-vector product automatically promotes the
elements of to precision because is stored in precision
TR.
We can use the fact that convergence is to the solution of the
promoted problem to get a simple error estimate for
the classic special case [22]. Here
. If the iteration terminates with
In (5)
we use the fact that in the
norm and use the exact value of , which
is the same as by (1).
The estimate (5) is also
true if , but then there is no need to consider a promoted
problem.
2.3 The triangular solve
The choice of affects the number of interprecision transfers
and the storage cost in the
triangular solve. The line in Algorithm IR-V0 for this is
•
If , then the LAPACK’s default behavior
is to do the interprecision transfers as needed using (3). The
subtle consequence of this is that
Hence, one is implicitly doing the triangular solves with the factors
promoted to the residual precision. We will refer to this approach
as on-the-fly interprecision transfers.
Combining the results with § 2.2 we can expose all the
interprecision transfers in the transition
from a current iteration to a new one . In the case
we have a linear stationary iterative method.
The computation is done entirely in .
(6)
where the iteration matrix is
(7)
The residual update is
(8)
where
(9)
One must remember that if and
then because
the matrices are in different precisions and matrix-vector products produce
different results.
All the interprecision transfers
in (7) and (9)
are implicit and the promoted matrices
are not actually stored. However, the
promotions matter because they can help avoid underflows and overflows
and influence the limit of the iteration.
If , then the interprecision transfer is done before
the triangular solves and the number of interprecision transfers is
rather than . We will refer to this as interprecision transfer
in-place to distinguish it from on-the-fly.
For in-place interprecision transfers we copy from the residual
precision to the factorization precision before the solve and
then upcast
the output of the solve back to the residual precision.
So, one must store the low precision copy of
In this case one should scale before the downcasting transfer
[13]. One reason for this is that the
absolute size of could be very small, as would be the case in the
terminal phase of IR, and one could underflow before the iteration is
complete.
So the iteration in this case is
(10)
This is not a stationary linear iterative method because the map
is nonlinear.
Even though downcasting reduces the interprecision transfer cost,
one should do interprecision transfers on-the-fly if is half precision,
if , or if one is using the low-precision factorization as a
preconditioner [1, 7].
3 Explicit Interprecision Transfers
We apply the results from § 2
Algorithm IR-V0 and obtain Algorithm IR-V1, where
all the interprecision transfers are explicit.
.
Factor in precision TF
while too large do
in precision TR
endwhile
In the remainder of this section we look at some consequences of
this formulation of IR.
3.1 The case
In this section we will assume that the triangular solves
are done in the residual precision (TS = TR). In this case, one can
see from Algorithm IR-V1 that no computations are done in the
working precision at all. The working precision is only used to store
and , but residual computations are done in the residual precision
with promotion on-the-fly. We state this observation as a theorem.
Theorem 3.1.
If , then the three precision algorithm
produces the same computed results as the two precision algorithm
The theorem makes it clear that the iteration is reducing the residual of the
promoted problem. So we can apply the classical ideas for IR-V0
[11, 12]
and understand the case in that way. For example,
if is not highly
ill-conditioned, the LU factorization is stable, and the norm of
iteration matrix for IR , then
we can use equation (4.9) from [12]
to obtain
(11)
where . Hence we will be able to
reduce until the iteration saturates with
.
Since one has no a priori knowledge of
, one must manage the iteration in a way to detect stagnation.
The recommendation from
[12] is to terminate the iteration when
1.
,
2.
, or
3.
too many iterations have been performed.
The first item in the list is successful convergence where we approximate
the norms of the promoted objects and with the
ones in the working precision which have stored.
The two failure modes are insufficient decrease in the residual
and slow convergence in the terminal phase.
The recommendation in [12] was to set
and , and to limit IR to five iterations.
Our Julia code [18] uses a variation of this approach.
We use in our solver and do not
put a limit on the iterations. The reason for these choices are to give
the IR iteration a better chance to terminate successfully.
So, if we couple the termination strategy with (5) we see
that if the iteration terminates successfully then
When one can also attempt to estimate the
convergence rate from (11) and then estimate
the error . This is a common strategy in the
nonlinear solver literature, especially for stiff initial
value problems [3, 4, 5, 20, 15].
The idea is that as the iteration progresses
is a very good estimate if is small enough. In that case
and one can terminate the iteration when one predicts that
is sufficiently small. The algorithm in [9]
does this and terminates when the predicted error is less than .
3.2 Cost of Interprecision Transfers
One case where setting may be useful is if
and . In this case, unlike ,
tools such as LAPACK and BLAS have been compiled to work efficiently.
The cases of interest in this section are medium sized
problems where the cost of the LU factorization is a few times
more than the cost of the triangular solves, but not orders of magnitude more.
In these cases the cost of interprecision transfers on the fly
is noticeable and could make a difference in cases where many triangular
solves are done for each factorization. One example is for nonlinear
solvers [15, 17] where the factorization of the
Jacobian can be reused for many Newton iterations (or even time
steps when solving stiff initial value problems [20, 21]).
We illustrate this with some cpu timings. The computations in this section
were done on an Apple Macintosh
Mini Pro with a M2 processor and eight performance
cores. We used OpenBlas, which satisfies (3), rather than the
AppleAccelerate Framework, which does not. We used
Julia [2] v1.11.0-beta2 with the author’s
MulitPrecisionArrays.jl [18, 19]
Julia package. We made this choice because
Julia v1.11.0 has faster matrix-vector products
than the current version v1.10.4.
We used the Julia package BenchmarkTools.jl
[8] to get the timings we report in
Table 1. This is the standard way to obtain timings in Julia.
BenchmarkTools repeats computations and can obtain accurate results even if
the compute time per run is very small.
We have put the Julia codes that generate Table 1
from § 3.2.1 in a
GitHub repository
We will use a concrete example rather than generating random problems. For a
given dimension let the matrix corresponding to the composite
trapezoid rule discretization of the Greens operator
for on
The eigenvalues of are for .
We use in this example. The conditioning of
is somewhat poor with an condition number of roughly
.
for the dimensions we consider in this section.
We terminate the IR iteration with the residual condition from
§ 3.1 and tabulate the dimension, the time (LU) for
copying from TW to TF and performing the
LU factorization in TF (column 2 of Table 1,
the timings for the two variants of the
triangular solves (OTF = on-the-fly: column 3, IP = in-place: column 4),
and the times
and iteration counts for the IR loop with the two variations of the
triangular solves (columns 5–8).
Table 1 shows that the
factorization time is between 6 and 10 times that of the on-the-fly triangular
solve indicating that the triangular solves could be significant
if the matrix-vector products were fast or one had to solve for many
right hand sides. We saw that effect in the nonlinear
examples in [16, 17] where the nonlinear residual
could be evaluated in work.
One can also see that
the cpu time for the in-place
triangular solves is 2–5 times less than for the on-the-fly version.
In the final four columns we see that,
while the version with in-place triangular solves is somewhat faster, the
difference is not compelling. This is no surprise because the
matrix-vector product in the residual precision (double) takes work
and is therefore a
significant part of the cost for each IR iteration. The number of
iterations is the same in all but one case, with in-place triangular
solves taking one more iteration in that case. That is consistent with
the prediction in [7].
4 Conclusions
We expose the interprecision transfers in iterative refinement and obtain
new insights into this classical algorithm. In particular we show that
the version in which the residual is evaluated in an extended precision is
equivalent to solving a promoted problem and show how interprecision transfers
affect the triangular solves.
5 Acknowledgments
The author is very grateful to Ilse Ipsen for listening to him
as he worked through the ideas for this paper.
References
[1]P. Amestoy, A. Buttari, N. J. Higham, J.-Y. L’Excellent, T. Mary, and
B. Vieublé, Five-precision gmres-based iterative refinement, SIAM
Journal on Matrix Analysis and Applications, 45 (2024), pp. 529–552,
https://doi.org/10.1137/23M1549079.
[2]J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A
fresh approach to numerical computing, SIAM Review, 59 (2017), pp. 65–98.
[3]K. E. Brenan, S. L. Campbell, and L. R. Petzold, The Numerical
Solution of Initial Value Problems in Differential-Algebraic Equations,
no. 14 in Classics in Applied Mathematics, SIAM, Philadelphia, 1996.
[4]P. N. Brown, G. D. Byrne, and A. C. Hindmarsh, VODE: A variable
coefficient ode solver, SIAM J. Sci. Statist. Comput., 10 (1989),
pp. 1038–1051.
[5]P. N. Brown, A. C. Hindmarsh, and L. R. Petzold, Using Krylov
methods in the solution of large-scale differential-algebraic systems, SIAM
J. Sci. Comput., 15 (1994), pp. 1467–1488.
[6]E. Carson and N. J. Higham, A new analysis of iterative refinement
and its application of accurate solution of ill-conditioned sparse linear
systems, SIAM Journal on Scientific Computing, 39 (2017), pp. A2834–A2856,
https://doi.org/10.1137/17M112291.
[7]E. Carson and N. J. Higham, Accelerating the solution of linear
systems by iterative refinement in three precisions, SIAM Journal on
Scientific Computing, 40 (2018), pp. A817–A847,
https://doi.org/10.1137/17M1140819.
[9]J. Demmel, Y. Hida, W. Kahan, X. S. Li, S. Mukherjee, and E. J. Riedy,
Error bounds from extra-precise iterative refinement, ACM Trans. Math.
Soft., (2006), pp. 325–351.
[10]J. W. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia,
1997.
[11]N. J. Higham, Accuracy and Stability of Numerical Algorithms,
Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1996,
http://www.ma.man.ac.uk/~higham/asna.html.
[12]N. J. Higham, Iterative refinement for linear systems and LAPACK,
IMA J. Numer. Anal., 17 (1997), pp. 495–509.
[13]N. J. Higham, S. Pranesh, and M. Zounon, Squeezing a matrix into
half precision, with an application to solving linear systems, SIAM J. Sci.
Comp., 41 (2019), pp. A2536–A2551.
[14]IEEE Computer Society, IEEE standard for floating-point
arithmetic, IEEE Std 754–2019, July 2019.
[15]C. T. Kelley, Iterative Methods for Linear and Nonlinear
Equations, no. 16 in Frontiers in Applied Mathematics, SIAM, Philadelphia,
1995.
[17]C. T. Kelley, Solving Nonlinear Equations with Iterative Methods:
Solvers and Examples in Julia, no. 20 in Fundamentals of Algorithms, SIAM,
Philadelphia, 2022.
[20]L. R. Petzold, A description of DASSL: a differential/algebraic
system solver, in Scientific Computing, R. S. Stepleman et al., ed., North
Holland, Amsterdam, 1983, pp. 65–68.
[21]L. F. Shampine, Numerical Solution of Ordinary Differential
Equations, Chapman and Hall, New York, 1994.
[22]J. H. Wilkinson, Progress report on the automatic computing engine,
Tech. Report MA/17/1024, Mathematics Division, Department of Scientific and
Industrial Research, National Physical Laboratory, Teddington, UK, 1948,
http://www.alanturing.net/turing_archive/archive/l/l10/l10.php.