Interprecision transfers in iterative refinement

C. T. Kelley North Carolina State University, Department of Mathematics, Box 8205, Raleigh, NC 27695-8205, USA ([email protected]). This work was partially supported by the Center for Exascale Monte-Carlo Neutron Transport (CEMeNT) a PSAAP-III project funded by Department of Energy grant number DE-NA003967.
Abstract

We make the interprecision transfers explicit in an algorithmic description of iterative refinement and obtain new insights into the algorithm. One example is the classic variant of iterative refinement where the matrix and the factorization are stored in a working precision and the residual is evaluated in a higher precision. In that case we make the observation that this algorithm will solve a promoted form of the original problem and thereby characterize the limiting behavior in a novel way and obtain a different version of the classic convergence analysis. We also discuss two approaches for interprecision transfer in the triangular solves.

keywords:
Iterative refinement, Interprecision transfers, Mixed-precision arithmetic, Linear systems
{AMS}

65F05, 65F10,

1 Introduction

Iterative refinement (IR) is a way to lower factorization costs in the numerical solution of a linear system 𝐀𝐱=𝐛𝐀𝐱𝐛{\bf A}{\bf x}={\bf b}bold_Ax = bold_b by performing the factorization in a lower precision. Algorithm IR-V0 is a simple formulation using Gaussian elimination. In this formulation all computations are done in a high precision except for the LU𝐿𝑈LUitalic_L italic_U factorization.

IR-V0(𝐀,𝐛)IR-V0𝐀𝐛\mbox{\bf IR-V0}({\bf A},{\bf b})IR-V0 ( bold_A , bold_b )

  𝐱=0𝐱0{\bf x}=0bold_x = 0
  𝐫=𝐛𝐫𝐛{\bf r}={\bf b}bold_r = bold_b
  Factor 𝐀=𝐋𝐔𝐀𝐋𝐔{\bf A}={\bf L}{\bf U}bold_A = bold_LU in a lower precision
  while 𝐫norm𝐫\|{\bf r}\|∥ bold_r ∥ too large do
     𝐝=𝐔1𝐋1𝐫𝐝superscript𝐔1superscript𝐋1𝐫{\bf d}={\bf U}^{-1}{\bf L}^{-1}{\bf r}bold_d = bold_U start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_L start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r
     𝐱𝐱+𝐝𝐱𝐱𝐝{\bf x}\leftarrow{\bf x}+{\bf d}bold_x ← bold_x + bold_d
     𝐫=𝐛𝐀𝐱𝐫𝐛𝐀𝐱{\bf r}={\bf b}-{\bf A}{\bf x}bold_r = bold_b - bold_Ax
  end while

Algorithm IR-V0 leaves out many implementation details. Some recent papers [1, 6, 7, 9] have made the algorithmic details explicit. The purpose of this paper is to build upon that work by explicitly including the interprecision transfers in the algorithmic description. One consequence of this, which we discuss in § 2.2 and § 3.1 is a novel interpretation of the classic form of the method [22] where the residual is evaluated in an extended precision.

1.1 Notation

We use the terminology from [9, 1] and consider several precisions. We will use Julia-like notation for data types.

  • The matrix 𝐀𝐀{\bf A}bold_A and right side 𝐛𝐛{\bf b}bold_b are stored in the working precision TW𝑇𝑊TWitalic_T italic_W.

  • 𝐀𝐀{\bf A}bold_A is factored in the factorization precision TF𝑇𝐹TFitalic_T italic_F.

  • The residual is computed in the residual precision TR𝑇𝑅TRitalic_T italic_R.

  • The triangular solves for 𝐋𝐔𝐝=𝐫𝐋𝐔𝐝𝐫{\bf L}{\bf U}{\bf d}={\bf r}bold_LUd = bold_r are done in the solver precision TS𝑇𝑆TSitalic_T italic_S.

If TH is a higher precision that TL we will write TL<TH𝑇𝐿𝑇𝐻TL<THitalic_T italic_L < italic_T italic_H. flX𝑓subscript𝑙𝑋fl_{X}italic_f italic_l start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT will be the rounding operation to precision TX𝑇𝑋TXitalic_T italic_X. We will let Xsubscript𝑋{\cal F}_{X}caligraphic_F start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT be the floating point numbers with precision TX𝑇𝑋TXitalic_T italic_X. We will assume that a low precision number can be exactly represented in any higher precision. So

(1) xXY if TXTY.𝑥subscript𝑋subscript𝑌 if TXTYx\in{\cal F}_{X}\subset{\cal F}_{Y}\mbox{ if $TX\leq TY$}.italic_x ∈ caligraphic_F start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⊂ caligraphic_F start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT if italic_T italic_X ≤ italic_T italic_Y .

We will let uXsubscript𝑢𝑋u_{X}italic_u start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT denote the unit roundoff for precision TX𝑇𝑋TXitalic_T italic_X and let IABsuperscriptsubscript𝐼𝐴𝐵I_{A}^{B}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT denote the interprecision transfer from precision TA𝑇𝐴TAitalic_T italic_A to precision TB𝑇𝐵TBitalic_T italic_B. Interprecision transfer is more than rounding and can, in some cases, include data allocation. If TA<TB𝑇𝐴𝑇𝐵TA<TBitalic_T italic_A < italic_T italic_B (TA>TB𝑇𝐴𝑇𝐵TA>TBitalic_T italic_A > italic_T italic_B) then we will call the transfer IABsuperscriptsubscript𝐼𝐴𝐵I_{A}^{B}italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT upcasting (downcasting).

Upcasting is simpler than downcasting because if TA<TB𝑇𝐴𝑇𝐵TA<TBitalic_T italic_A < italic_T italic_B and xA𝑥subscript𝐴x\in{\cal F}_{A}italic_x ∈ caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, then (1) implies that

(2) IAB(x)=x.superscriptsubscript𝐼𝐴𝐵𝑥𝑥I_{A}^{B}(x)=x.italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_x ) = italic_x .

However, upcasting is not linear. In fact if yA𝑦subscript𝐴y\in{\cal F}_{A}italic_y ∈ caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT there is no reason to expect that

IAB(xy)=IAB(x)IAB(y)superscriptsubscript𝐼𝐴𝐵𝑥𝑦superscriptsubscript𝐼𝐴𝐵𝑥superscriptsubscript𝐼𝐴𝐵𝑦I_{A}^{B}(xy)=I_{A}^{B}(x)I_{A}^{B}(y)italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_x italic_y ) = italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_x ) italic_I start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( italic_y )

because the multiplication on the left is done in a lower precision than the one on the right. Downcasting is more subtle because not only is it nonlinear but also if TA>TB𝑇𝐴𝑇𝐵TA>TBitalic_T italic_A > italic_T italic_B and xA𝑥subscript𝐴x\in{\cal F}_{A}italic_x ∈ caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT, then (2) holds only if xBA𝑥subscript𝐵subscript𝐴x\in{\cal F}_{B}\subset{\cal F}_{A}italic_x ∈ caligraphic_F start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ⊂ caligraphic_F start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT.

The nonlinearity of interprecision transfers should be made explicit in analysis especially for the triangular solves (see § 2.3).

2 Interprecision transfers in IR

Algorithm IR0 includes many implicit interprecision transfers. We will make the assumption, which holds in IEEE [14] arithmetic, that when a binary operation \circ is performed between floating point numbers with different precisions, then the lower precision number is promoted before the operation is performed.

So, if TH>TL𝑇𝐻𝑇𝐿TH>TLitalic_T italic_H > italic_T italic_L, uH𝑢subscript𝐻u\in{\cal F}_{H}italic_u ∈ caligraphic_F start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, vL𝑣subscript𝐿v\in{\cal F}_{L}italic_v ∈ caligraphic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT, then

(3) flH(uv)=flH(u(ILHv)) and flH(vu)=flH((ILHv)u).𝑓subscript𝑙𝐻𝑢𝑣𝑓subscript𝑙𝐻𝑢superscriptsubscript𝐼𝐿𝐻𝑣 and 𝑓subscript𝑙𝐻𝑣𝑢𝑓subscript𝑙𝐻superscriptsubscript𝐼𝐿𝐻𝑣𝑢fl_{H}(u\circ v)=fl_{H}(u\circ(I_{L}^{H}v))\mbox{ and }fl_{H}(v\circ u)=fl_{H}% ((I_{L}^{H}v)\circ u).italic_f italic_l start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_u ∘ italic_v ) = italic_f italic_l start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_u ∘ ( italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_v ) ) and italic_f italic_l start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_v ∘ italic_u ) = italic_f italic_l start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( ( italic_I start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_v ) ∘ italic_u ) .

We will use (3) throughout this paper when we need to make the implicit interprecision transfers explicit.

2.1 The low precision factorization

We will begin with the low precision factorization. The line in Algorithm IR-V0 for this is

  • Factor 𝐀=𝐋𝐔𝐀𝐋𝐔{\bf A}={\bf L}{\bf U}bold_A = bold_LU in a lower precision.

However, 𝐀𝐀{\bf A}bold_A is stored in precision TW𝑇𝑊TWitalic_T italic_W and the factorization is in precision TF𝑇𝐹TFitalic_T italic_F. Hence one must make a copy of 𝐀𝐀{\bf A}bold_A. So to make this interprecision transfers explicit we should express this as

  • Make a low precision copy of 𝐀𝐀{\bf A}bold_A. 𝐀F=IWF𝐀subscript𝐀𝐹superscriptsubscript𝐼𝑊𝐹𝐀{\bf A}_{F}=I_{W}^{F}{\bf A}bold_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT bold_A.

  • Compute an LU𝐿𝑈LUitalic_L italic_U factorization 𝐋𝐔𝐋𝐔{\bf L}{\bf U}bold_LU of 𝐀Fsubscript𝐀𝐹{\bf A}_{F}bold_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, overwriting 𝐀Fsubscript𝐀𝐹{\bf A}_{F}bold_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT.

So in this way the storage costs of IR become clear and we can see the time/storage tradeoff. If, for example, TW=Float64𝑇𝑊𝐹𝑙𝑜𝑎𝑡64TW=Float64italic_T italic_W = italic_F italic_l italic_o italic_a italic_t 64 and TF=Float32𝑇𝐹𝐹𝑙𝑜𝑎𝑡32TF=Float32italic_T italic_F = italic_F italic_l italic_o italic_a italic_t 32, one must allocate storage for 𝐀Fsubscript𝐀𝐹{\bf A}_{F}bold_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT which is a 50% increase in matrix storage.

2.2 The residual precision

For the remainder of this paper we use the superscript\ell^{\infty}roman_ℓ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT norm, so =\|\cdot\|=\|\cdot\|_{\infty}∥ ⋅ ∥ = ∥ ⋅ ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT.

The algorithm IR-V0 does not make it clear how the residual precision affects the iteration. The description in [9] carefully explains what one must do.

  • Store 𝐱𝐱{\bf x}bold_x in precision TR𝑇𝑅TRitalic_T italic_R.

  • Solve the triangular system in precision TS=TR𝑇𝑆𝑇𝑅TS=TRitalic_T italic_S = italic_T italic_R.

  • Store 𝐝𝐝{\bf d}bold_d in precision TR𝑇𝑅TRitalic_T italic_R.

The most important consequence of this and (3) is that the residual is

𝐫=IWR(𝐛𝐀𝐱)=(IWR𝐀)𝐱IWR𝐛.𝐫superscriptsubscript𝐼𝑊𝑅𝐛𝐀𝐱superscriptsubscript𝐼𝑊𝑅𝐀𝐱superscriptsubscript𝐼𝑊𝑅𝐛{\bf r}=I_{W}^{R}({\bf b}-{\bf A}{\bf x})=(I_{W}^{R}{\bf A}){\bf x}-I_{W}^{R}{% \bf b}.bold_r = italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_b - bold_Ax ) = ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) bold_x - italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_b .

While IWR𝐀=𝐀superscriptsubscript𝐼𝑊𝑅𝐀𝐀I_{W}^{R}{\bf A}={\bf A}italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A = bold_A by our assumptions that TRTW𝑇𝑅𝑇𝑊TR\geq TWitalic_T italic_R ≥ italic_T italic_W, the residual is computed in precision TR𝑇𝑅TRitalic_T italic_R and is therefore the residual of a promoted problem and the iteration is approximating the solution of that promoted problem

(4) 𝐱P=(IWR𝐀)1𝐛=(IWR𝐀)1IWR𝐛,superscriptsubscript𝐱𝑃superscriptsuperscriptsubscript𝐼𝑊𝑅𝐀1𝐛superscriptsuperscriptsubscript𝐼𝑊𝑅𝐀1superscriptsubscript𝐼𝑊𝑅𝐛{\bf x}_{P}^{*}=(I_{W}^{R}{\bf A})^{-1}{\bf b}=(I_{W}^{R}{\bf A})^{-1}I_{W}^{R% }{\bf b},bold_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_b = ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_b ,

which is posed in precision TR. We make a distinction between 𝐱Psuperscriptsubscript𝐱𝑃{\bf x}_{P}^{*}bold_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝐱=𝐀1𝐛superscript𝐱superscript𝐀1𝐛{\bf x}^{*}={\bf A}^{-1}{\bf b}bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_b only in those cases, such as (4) where we are talking about the computed solution in the residual precision. In exact arithmetic, of course, 𝐱P=𝐱superscriptsubscript𝐱𝑃superscript𝐱{\bf x}_{P}^{*}={\bf x}^{*}bold_x start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

All of these interprecision transfers are implicit and need not be done within the iteration. For example, when one computes 𝐫𝐫{\bf r}bold_r in precision TR𝑇𝑅TRitalic_T italic_R, then (3) implies that the matrix-vector product 𝐀𝐱𝐀𝐱{\bf A}{\bf x}bold_Ax automatically promotes the elements of 𝐀𝐀{\bf A}bold_A to precision TR𝑇𝑅TRitalic_T italic_R because 𝐱𝐱{\bf x}bold_x is stored in precision TR.

We can use the fact that convergence is to the solution of the promoted problem to get a simple error estimate for the classic special case [22]. Here TR=TS>TWTF𝑇𝑅𝑇𝑆𝑇𝑊𝑇𝐹TR=TS>TW\geq TFitalic_T italic_R = italic_T italic_S > italic_T italic_W ≥ italic_T italic_F. If the iteration terminates with

𝐫/𝐛τnorm𝐫norm𝐛𝜏\|{\bf r}\|/\|{\bf b}\|\leq\tau∥ bold_r ∥ / ∥ bold_b ∥ ≤ italic_τ

then the standard estimates [10] imply that

(5) 𝐱𝐱𝐱=𝐱𝐱P𝐱Pκ(IWR𝐀)𝐫IWR(𝐛)=κ(𝐀)𝐫𝐛τκ(𝐀).norm𝐱superscript𝐱normsuperscript𝐱norm𝐱subscriptsuperscript𝐱𝑃normsubscriptsuperscript𝐱𝑃𝜅superscriptsubscript𝐼𝑊𝑅𝐀norm𝐫normsuperscriptsubscript𝐼𝑊𝑅𝐛𝜅𝐀norm𝐫norm𝐛𝜏𝜅𝐀\frac{\|{\bf x}-{\bf x}^{*}\|}{\|{\bf x}^{*}\|}=\frac{\|{\bf x}-{\bf x}^{*}_{P% }\|}{\|{\bf x}^{*}_{P}\|}\leq\frac{\kappa(I_{W}^{R}{\bf A})\|{\bf r}\|}{\|I_{W% }^{R}({\bf b})\|}=\frac{\kappa({\bf A})\|{\bf r}\|}{\|{\bf b}\|}\leq\tau\kappa% ({\bf A}).divide start_ARG ∥ bold_x - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ end_ARG start_ARG ∥ bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ end_ARG = divide start_ARG ∥ bold_x - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ end_ARG ≤ divide start_ARG italic_κ ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) ∥ bold_r ∥ end_ARG start_ARG ∥ italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_b ) ∥ end_ARG = divide start_ARG italic_κ ( bold_A ) ∥ bold_r ∥ end_ARG start_ARG ∥ bold_b ∥ end_ARG ≤ italic_τ italic_κ ( bold_A ) .

In (5) we use the fact that IWR(𝐛)=𝐛normsuperscriptsubscript𝐼𝑊𝑅𝐛norm𝐛\|I_{W}^{R}({\bf b})\|=\|{\bf b}\|∥ italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_b ) ∥ = ∥ bold_b ∥ in the superscript\ell^{\infty}roman_ℓ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT norm and use the exact value of κ(𝐀)𝜅𝐀\kappa({\bf A})italic_κ ( bold_A ), which is the same as κ(IWR𝐀)𝜅superscriptsubscript𝐼𝑊𝑅𝐀\kappa(I_{W}^{R}{\bf A})italic_κ ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) by (1). The estimate (5) is also true if TR=TW𝑇𝑅𝑇𝑊TR=TWitalic_T italic_R = italic_T italic_W, but then there is no need to consider a promoted problem.

2.3 The triangular solve

The choice of TS𝑇𝑆TSitalic_T italic_S affects the number of interprecision transfers and the storage cost in the triangular solve. The line in Algorithm IR-V0 for this is

  • 𝐝=𝐔1𝐋1𝐫𝐝superscript𝐔1superscript𝐋1𝐫{\bf d}={\bf U}^{-1}{\bf L}^{-1}{\bf r}bold_d = bold_U start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_L start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r

If TS=TR𝑇𝑆𝑇𝑅TS=TRitalic_T italic_S = italic_T italic_R, then the LAPACK’s default behavior is to do the interprecision transfers as needed using (3). The subtle consequence of this is that

(𝐋𝐔)1𝐫=((IFR𝐋)(IFR𝐔))1𝐫.superscript𝐋𝐔1𝐫superscriptsuperscriptsubscript𝐼𝐹𝑅𝐋superscriptsubscript𝐼𝐹𝑅𝐔1𝐫({\bf L}{\bf U})^{-1}{\bf r}=((I_{F}^{R}{\bf L})(I_{F}^{R}{\bf U}))^{-1}{\bf r}.( bold_LU ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r = ( ( italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_L ) ( italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_U ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r .

Hence, one is implicitly doing the triangular solves with the factors promoted to the residual precision. We will refer to this approach as on-the-fly interprecision transfers.

Combining the results with § 2.2 we can expose all the interprecision transfers in the transition from a current iteration 𝐱csubscript𝐱𝑐{\bf x}_{c}bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to a new one 𝐱+subscript𝐱{\bf x}_{+}bold_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT. In the case TS=TR𝑇𝑆𝑇𝑅TS=TRitalic_T italic_S = italic_T italic_R we have a linear stationary iterative method. The computation is done entirely in TR𝑇𝑅TRitalic_T italic_R.

(6) 𝐱+=𝐱c+𝐝=𝐱c+(𝐋𝐔)1(IWR𝐛𝐀𝐱c)=𝐌𝐱c+(𝐋𝐔)1IWR𝐛,subscript𝐱subscript𝐱𝑐𝐝subscript𝐱𝑐superscript𝐋𝐔1superscriptsubscript𝐼𝑊𝑅𝐛subscript𝐀𝐱𝑐subscript𝐌𝐱𝑐superscript𝐋𝐔1superscriptsubscript𝐼𝑊𝑅𝐛{\bf x}_{+}={\bf x}_{c}+{\bf d}={\bf x}_{c}+({\bf L}{\bf U})^{-1}(I_{W}^{R}{% \bf b}-{\bf A}{\bf x}_{c})={\bf M}{\bf x}_{c}+({\bf L}{\bf U})^{-1}I_{W}^{R}{% \bf b},bold_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + bold_d = bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + ( bold_LU ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_b - bold_Ax start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) = bold_Mx start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + ( bold_LU ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_b ,

where the iteration matrix is

(7) 𝐌=𝐈((IFR𝐋)(IFR𝐔))1(IWR𝐀).𝐌𝐈superscriptsuperscriptsubscript𝐼𝐹𝑅𝐋superscriptsubscript𝐼𝐹𝑅𝐔1superscriptsubscript𝐼𝑊𝑅𝐀{\bf M}={\bf I}-((I_{F}^{R}{\bf L})(I_{F}^{R}{\bf U}))^{-1}(I_{W}^{R}{\bf A}).bold_M = bold_I - ( ( italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_L ) ( italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_U ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) .

The residual update is

(8) 𝐫+=𝐌r𝐫csubscript𝐫subscript𝐌𝑟subscript𝐫𝑐{\bf r}_{+}={\bf M}_{r}{\bf r}_{c}bold_r start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = bold_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT bold_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT

where

(9) 𝐌r=𝐈(IWR𝐀)((IFR𝐋)(IFR𝐔))1.subscript𝐌𝑟𝐈superscriptsubscript𝐼𝑊𝑅𝐀superscriptsuperscriptsubscript𝐼𝐹𝑅𝐋superscriptsubscript𝐼𝐹𝑅𝐔1{\bf M}_{r}={\bf I}-(I_{W}^{R}{\bf A})((I_{F}^{R}{\bf L})(I_{F}^{R}{\bf U}))^{% -1}.bold_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = bold_I - ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) ( ( italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_L ) ( italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_U ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT .

One must remember that if TR>TW𝑇𝑅𝑇𝑊TR>TWitalic_T italic_R > italic_T italic_W and 𝐱WN𝐱superscriptsubscript𝑊𝑁{\bf x}\in{\cal F}_{W}^{N}bold_x ∈ caligraphic_F start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT then (IWR𝐀)𝐱𝐀𝐱superscriptsubscript𝐼𝑊𝑅𝐀𝐱𝐀𝐱(I_{W}^{R}{\bf A}){\bf x}\neq{\bf A}{\bf x}( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) bold_x ≠ bold_Ax because the matrices are in different precisions and matrix-vector products produce different results.

All the interprecision transfers in (7) and (9) are implicit and the promoted matrices are not actually stored. However, the promotions matter because they can help avoid underflows and overflows and influence the limit of the iteration.

If TS=TF<TW𝑇𝑆𝑇𝐹𝑇𝑊TS=TF<TWitalic_T italic_S = italic_T italic_F < italic_T italic_W, then the interprecision transfer is done before the triangular solves and the number of interprecision transfers is N𝑁Nitalic_N rather than N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We will refer to this as interprecision transfer in-place to distinguish it from on-the-fly. For in-place interprecision transfers we copy 𝐫𝐫{\bf r}bold_r from the residual precision TR𝑇𝑅TRitalic_T italic_R to the factorization precision before the solve and then upcast the output of the solve back to the residual precision. So, one must store the low precision copy of 𝐫𝐫{\bf r}bold_r In this case one should scale 𝐫𝐫{\bf r}bold_r before the downcasting transfer IRFsuperscriptsubscript𝐼𝑅𝐹I_{R}^{F}italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT [13]. One reason for this is that the absolute size of 𝐫𝐫{\bf r}bold_r could be very small, as would be the case in the terminal phase of IR, and one could underflow before the iteration is complete. So the iteration in this case is

(10) 𝐱+=𝐱c+𝐝=𝐱c+𝐫IFR[(𝐋𝐔)1IRF𝐫𝐫].subscript𝐱subscript𝐱𝑐𝐝subscript𝐱𝑐norm𝐫superscriptsubscript𝐼𝐹𝑅delimited-[]superscript𝐋𝐔1superscriptsubscript𝐼𝑅𝐹𝐫norm𝐫{\bf x}_{+}={\bf x}_{c}+{\bf d}={\bf x}_{c}+\|{\bf r}\|I_{F}^{R}\left[({\bf L}% {\bf U})^{-1}\frac{I_{R}^{F}{\bf r}}{\|{\bf r}\|}\right].bold_x start_POSTSUBSCRIPT + end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + bold_d = bold_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + ∥ bold_r ∥ italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT [ ( bold_LU ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT bold_r end_ARG start_ARG ∥ bold_r ∥ end_ARG ] .

This is not a stationary linear iterative method because the map 𝐫IRF𝐫𝐫𝐫superscriptsubscript𝐼𝑅𝐹𝐫norm𝐫{\bf r}\rightarrow\frac{I_{R}^{F}{\bf r}}{\|{\bf r}\|}bold_r → divide start_ARG italic_I start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT bold_r end_ARG start_ARG ∥ bold_r ∥ end_ARG is nonlinear.

Even though downcasting 𝐫𝐫{\bf r}bold_r reduces the interprecision transfer cost, one should do interprecision transfers on-the-fly if TF𝑇𝐹TFitalic_T italic_F is half precision, if TR>TW𝑇𝑅𝑇𝑊TR>TWitalic_T italic_R > italic_T italic_W, or if one is using the low-precision factorization as a preconditioner [1, 7].

3 Explicit Interprecision Transfers

We apply the results from § 2 Algorithm IR-V0 and obtain Algorithm IR-V1, where all the interprecision transfers are explicit.

IR-V1(𝐀,𝐛,TF,TW,TR)IR-V1𝐀𝐛𝑇𝐹𝑇𝑊𝑇𝑅\mbox{\bf IR-V1}({\bf A},{\bf b},TF,TW,TR)IR-V1 ( bold_A , bold_b , italic_T italic_F , italic_T italic_W , italic_T italic_R )

  𝐱=0RN𝐱0superscriptsubscript𝑅𝑁{\bf x}=0\in{\cal F}_{R}^{N}bold_x = 0 ∈ caligraphic_F start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
  𝐫=IWR(𝐛)𝐫superscriptsubscript𝐼𝑊𝑅𝐛{\bf r}=I_{W}^{R}({\bf b})bold_r = italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_b )
  𝐀F=IWF(𝐀)subscript𝐀𝐹superscriptsubscript𝐼𝑊𝐹𝐀{\bf A}_{F}=I_{W}^{F}({\bf A})bold_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT ( bold_A ).
  Factor 𝐀F=𝐋𝐔subscript𝐀𝐹𝐋𝐔{\bf A}_{F}={\bf L}{\bf U}bold_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = bold_LU in precision TF
  while 𝐫norm𝐫\|{\bf r}\|∥ bold_r ∥ too large do
     𝐝=((IFR𝐋)(IFR𝐔))1𝐫𝐝superscriptsuperscriptsubscript𝐼𝐹𝑅𝐋superscriptsubscript𝐼𝐹𝑅𝐔1𝐫{\bf d}=((I_{F}^{R}{\bf L})(I_{F}^{R}{\bf U}))^{-1}{\bf r}bold_d = ( ( italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_L ) ( italic_I start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_U ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_r in precision TR
     𝐱𝐱+𝐝𝐱𝐱𝐝{\bf x}\leftarrow{\bf x}+{\bf d}bold_x ← bold_x + bold_d
     𝐫=(IWR𝐛)(IWR𝐀)𝐱𝐫superscriptsubscript𝐼𝑊𝑅𝐛superscriptsubscript𝐼𝑊𝑅𝐀𝐱{\bf r}=(I_{W}^{R}{\bf b})-(I_{W}^{R}{\bf A}){\bf x}bold_r = ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_b ) - ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ) bold_x
  end while

In the remainder of this section we look at some consequences of this formulation of IR.

3.1 The case TS=TR>TW𝑇𝑆𝑇𝑅𝑇𝑊TS=TR>TWitalic_T italic_S = italic_T italic_R > italic_T italic_W

In this section we will assume that the triangular solves are done in the residual precision (TS = TR). In this case, one can see from Algorithm IR-V1 that no computations are done in the working precision at all. The working precision is only used to store 𝐀𝐀{\bf A}bold_A and 𝐛𝐛{\bf b}bold_b, but residual computations are done in the residual precision with promotion on-the-fly. We state this observation as a theorem.

Theorem 3.1.

If TS=TR>TW𝑇𝑆𝑇𝑅𝑇𝑊TS=TR>TWitalic_T italic_S = italic_T italic_R > italic_T italic_W, then the three precision algorithm

IR-V1(𝐀,𝐛,TF,TW,TR)IR-V1𝐀𝐛𝑇𝐹𝑇𝑊𝑇𝑅\mbox{\bf IR-V1}({\bf A},{\bf b},TF,TW,TR)IR-V1 ( bold_A , bold_b , italic_T italic_F , italic_T italic_W , italic_T italic_R )

produces the same computed results as the two precision algorithm

IR-V1(IWR𝐀,IWR𝐛,TF,TR,TR).IR-V1superscriptsubscript𝐼𝑊𝑅𝐀superscriptsubscript𝐼𝑊𝑅𝐛𝑇𝐹𝑇𝑅𝑇𝑅\mbox{\bf IR-V1}(I_{W}^{R}{\bf A},I_{W}^{R}{\bf b},TF,TR,TR).IR-V1 ( italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A , italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_b , italic_T italic_F , italic_T italic_R , italic_T italic_R ) .

The theorem makes it clear that the iteration is reducing the residual of the promoted problem. So we can apply the classical ideas for IR-V0 [11, 12] and understand the case TR>TW𝑇𝑅𝑇𝑊TR>TWitalic_T italic_R > italic_T italic_W in that way. For example, if 𝐀𝐀{\bf A}bold_A is not highly ill-conditioned, the LU factorization is stable, and the norm of iteration matrix for IR 𝐌<1norm𝐌1\|{\bf M}\|<1∥ bold_M ∥ < 1, then we can use equation (4.9) from [12] to obtain

(11) 𝐫+G𝐫c+g, where g=O(uR[IWR𝐀𝐱P+IWR𝐛]).formulae-sequencenormsubscript𝐫𝐺normsubscript𝐫𝑐𝑔 where 𝑔𝑂subscript𝑢𝑅delimited-[]normsuperscriptsubscript𝐼𝑊𝑅𝐀normsubscriptsuperscript𝐱𝑃normsuperscriptsubscript𝐼𝑊𝑅𝐛\|{\bf r}_{+}\|\leq G\|{\bf r}_{c}\|+g,\mbox{ where }g=O(u_{R}[\|I_{W}^{R}{\bf A% }\|\|{\bf x}^{*}_{P}\|+\|I_{W}^{R}{\bf b}\|]).∥ bold_r start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∥ ≤ italic_G ∥ bold_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥ + italic_g , where italic_g = italic_O ( italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT [ ∥ italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A ∥ ∥ bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ + ∥ italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_b ∥ ] ) .

where G<1𝐺1G<1italic_G < 1. Hence we will be able to reduce 𝐫norm𝐫\|{\bf r}\|∥ bold_r ∥ until the iteration saturates with 𝐫gnorm𝐫𝑔\|{\bf r}\|\approx g∥ bold_r ∥ ≈ italic_g.

Since one has no a priori knowledge of g𝑔gitalic_g, one must manage the iteration in a way to detect stagnation. The recommendation from [12] is to terminate the iteration when

  1. 1.

    𝐫uR(𝐀𝐱+𝐛)norm𝐫subscript𝑢𝑅norm𝐀norm𝐱norm𝐛\|{\bf r}\|\leq u_{R}(\|{\bf A}\|\|{\bf x}\|+\|{\bf b}\|)∥ bold_r ∥ ≤ italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( ∥ bold_A ∥ ∥ bold_x ∥ + ∥ bold_b ∥ ),

  2. 2.

    𝐫+α𝐫cnormsubscript𝐫𝛼normsubscript𝐫𝑐\|{\bf r}_{+}\|\geq\alpha\|{\bf r}_{c}\|∥ bold_r start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∥ ≥ italic_α ∥ bold_r start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∥, or

  3. 3.

    too many iterations have been performed.

The first item in the list is successful convergence where we approximate the norms of the promoted objects IWR𝐀superscriptsubscript𝐼𝑊𝑅𝐀I_{W}^{R}{\bf A}italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_A and IWR𝐛superscriptsubscript𝐼𝑊𝑅𝐛I_{W}^{R}{\bf b}italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT bold_b with the ones in the working precision which have stored. The two failure modes are insufficient decrease in the residual and slow convergence in the terminal phase. The recommendation in [12] was to set C=1𝐶1C=1italic_C = 1 and α=.5𝛼.5\alpha=.5italic_α = .5, and to limit IR to five iterations. Our Julia code [18] uses a variation of this approach. We use α=.9𝛼.9\alpha=.9italic_α = .9 in our solver and do not put a limit on the iterations. The reason for these choices are to give the IR iteration a better chance to terminate successfully.

So, if we couple the termination strategy with (5) we see that if the iteration terminates successfully then

𝐱𝐱P𝐱PuRC(𝐀𝐱+𝐛)κ(𝐀)𝐛IWR(𝐛).norm𝐱subscriptsuperscript𝐱𝑃normsubscriptsuperscript𝐱𝑃subscript𝑢𝑅𝐶norm𝐀norm𝐱norm𝐛𝜅𝐀norm𝐛normsuperscriptsubscript𝐼𝑊𝑅𝐛\frac{\|{\bf x}-{\bf x}^{*}_{P}\|}{\|{\bf x}^{*}_{P}\|}\leq u_{R}\frac{C(\|{% \bf A}\|\|{\bf x}\|+\|{\bf b}\|)\kappa({\bf A})\|{\bf b}\|}{\|I_{W}^{R}({\bf b% })\|}.divide start_ARG ∥ bold_x - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ end_ARG start_ARG ∥ bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ end_ARG ≤ italic_u start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT divide start_ARG italic_C ( ∥ bold_A ∥ ∥ bold_x ∥ + ∥ bold_b ∥ ) italic_κ ( bold_A ) ∥ bold_b ∥ end_ARG start_ARG ∥ italic_I start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( bold_b ) ∥ end_ARG .

When TR=TS>TW𝑇𝑅𝑇𝑆𝑇𝑊TR=TS>TWitalic_T italic_R = italic_T italic_S > italic_T italic_W one can also attempt to estimate the convergence rate G𝐺Gitalic_G from (11) and then estimate the error 𝐱𝐱Pnorm𝐱subscriptsuperscript𝐱𝑃\|{\bf x}-{\bf x}^{*}_{P}\|∥ bold_x - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥. This is a common strategy in the nonlinear solver literature, especially for stiff initial value problems [3, 4, 5, 20, 15]. The idea is that as the iteration progresses

Gσ=𝐱n+1𝐱n/𝐱n𝐱n1𝐺𝜎normsubscript𝐱𝑛1subscript𝐱𝑛normsubscript𝐱𝑛subscript𝐱𝑛1G\approx\sigma=\|{\bf x}_{n+1}-{\bf x}_{n}\|/\|{\bf x}_{n}-{\bf x}_{n-1}\|italic_G ≈ italic_σ = ∥ bold_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ / ∥ bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥

is a very good estimate if G𝐺Gitalic_G is small enough. In that case

𝐱n𝐱P𝐱n+1𝐱n1σnormsubscript𝐱𝑛subscriptsuperscript𝐱𝑃normsubscript𝐱𝑛1subscript𝐱𝑛1𝜎\|{\bf x}_{n}-{\bf x}^{*}_{P}\|\leq\frac{\|{\bf x}_{n+1}-{\bf x}_{n}\|}{1-\sigma}∥ bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ ≤ divide start_ARG ∥ bold_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ end_ARG start_ARG 1 - italic_σ end_ARG

and one can terminate the iteration when one predicts that

𝐱n+1𝐱P𝐱n+1𝐱nσ1σnormsubscript𝐱𝑛1subscriptsuperscript𝐱𝑃normsubscript𝐱𝑛1subscript𝐱𝑛𝜎1𝜎\|{\bf x}_{n+1}-{\bf x}^{*}_{P}\|\leq\frac{\|{\bf x}_{n+1}-{\bf x}_{n}\|\sigma% }{1-\sigma}∥ bold_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ ≤ divide start_ARG ∥ bold_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_σ end_ARG start_ARG 1 - italic_σ end_ARG

is sufficiently small. The algorithm in [9] does this and terminates when the predicted error is less than uWsubscript𝑢𝑊u_{W}italic_u start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT.

3.2 Cost of Interprecision Transfers

One case where setting TS=TF𝑇𝑆𝑇𝐹TS=TFitalic_T italic_S = italic_T italic_F may be useful is if TS=Float32𝑇𝑆𝐹𝑙𝑜𝑎𝑡32TS=Float32italic_T italic_S = italic_F italic_l italic_o italic_a italic_t 32 and TW=TR=Float64𝑇𝑊𝑇𝑅𝐹𝑙𝑜𝑎𝑡64TW=TR=Float64italic_T italic_W = italic_T italic_R = italic_F italic_l italic_o italic_a italic_t 64. In this case, unlike TF=Float16𝑇𝐹𝐹𝑙𝑜𝑎𝑡16TF=Float16italic_T italic_F = italic_F italic_l italic_o italic_a italic_t 16, tools such as LAPACK and BLAS have been compiled to work efficiently. The cases of interest in this section are medium sized problems where the O(N3)𝑂superscript𝑁3O(N^{3})italic_O ( italic_N start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) cost of the LU factorization is a few times more than the cost of the triangular solves, but not orders of magnitude more. In these cases the O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) cost of interprecision transfers on the fly is noticeable and could make a difference in cases where many triangular solves are done for each factorization. One example is for nonlinear solvers [15, 17] where the factorization of the Jacobian can be reused for many Newton iterations (or even time steps when solving stiff initial value problems [20, 21]).

We illustrate this with some cpu timings. The computations in this section were done on an Apple Macintosh Mini Pro with a M2 processor and eight performance cores. We used OpenBlas, which satisfies (3), rather than the AppleAccelerate Framework, which does not. We used Julia [2] v1.11.0-beta2 with the author’s MulitPrecisionArrays.jl [18, 19] Julia package. We made this choice because Julia v1.11.0 has faster matrix-vector products than the current version v1.10.4.

We used the Julia package BenchmarkTools.jl [8] to get the timings we report in Table 1. This is the standard way to obtain timings in Julia. BenchmarkTools repeats computations and can obtain accurate results even if the compute time per run is very small.

We have put the Julia codes that generate Table 1 from § 3.2.1 in a GitHub repository

3.2.1 Integral Equation Example

We will use a concrete example rather than generating random problems. For a given dimension N𝑁Nitalic_N let 𝐆𝐆{\bf G}bold_G the matrix corresponding to the composite trapezoid rule discretization of the Greens operator 𝒢𝒢{\cal G}caligraphic_G for d2/dx2superscript𝑑2𝑑superscript𝑥2-d^{2}/dx^{2}- italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on [0,1]01[0,1][ 0 , 1 ]

𝒢u(x)=01g(x,y)u(y)𝑑y where g(x,y)={y(1x);x>yx(1y);xy𝒢𝑢𝑥superscriptsubscript01𝑔𝑥𝑦𝑢𝑦differential-d𝑦 where 𝑔𝑥𝑦cases𝑦1𝑥𝑥𝑦𝑥1𝑦𝑥𝑦{\cal G}u(x)=\int_{0}^{1}g(x,y)u(y)\,dy\mbox{ where }g(x,y)=\left\{\begin{% array}[]{c}y(1-x);\ x>y\\ x(1-y);\ x\leq y\end{array}\right.caligraphic_G italic_u ( italic_x ) = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_g ( italic_x , italic_y ) italic_u ( italic_y ) italic_d italic_y where italic_g ( italic_x , italic_y ) = { start_ARRAY start_ROW start_CELL italic_y ( 1 - italic_x ) ; italic_x > italic_y end_CELL end_ROW start_ROW start_CELL italic_x ( 1 - italic_y ) ; italic_x ≤ italic_y end_CELL end_ROW end_ARRAY

The eigenvalues of 𝒢𝒢{\cal G}caligraphic_G are 1/(n2π2)1superscript𝑛2superscript𝜋21/(n^{2}\pi^{2})1 / ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for n=1,2,𝑛12n=1,2,\dotsitalic_n = 1 , 2 , ….

We use 𝐀=𝐈800.0𝐆𝐀𝐈800.0𝐆{\bf A}={\bf I}-800.0*{\bf G}bold_A = bold_I - 800.0 ∗ bold_G in this example. The conditioning of 𝐀𝐀{\bf A}bold_A is somewhat poor with an superscript\ell^{\infty}roman_ℓ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT condition number of roughly κ(𝐀)18,253subscript𝜅𝐀18253\kappa_{\infty}({\bf A})\approx 18,253italic_κ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( bold_A ) ≈ 18 , 253. for the dimensions we consider in this section.

We terminate the IR iteration with the residual condition from § 3.1 and tabulate the dimension, the time (LU) for copying 𝐀𝐀{\bf A}bold_A from TW to TF and performing the LU factorization in TF (column 2 of Table 1, the timings for the two variants of the triangular solves (OTF = on-the-fly: column 3, IP = in-place: column 4), and the times and iteration counts for the IR loop with the two variations of the triangular solves (columns 5–8).

Table 1: Cost of OTF Triangular Solve

N LU OTF IP OTF-IR its IP-IR its 200 1.6e-04 1.5e-05 5.7e-06 4.5e-05 3 3.8e-05 3 400 4.9e-04 5.4e-05 1.9e-05 2.3e-04 4 2.3e-04 5 800 1.8e-03 2.4e-04 6.6e-05 9.9e-04 5 7.3e-04 5 1600 7.8e-03 1.4e-03 2.5e-04 2.8e-03 4 2.1e-03 4 3200 4.2e-02 9.2e-03 1.3e-03 2.0e-02 5 1.7e-02 5 6400 2.9e-01 3.6e-02 6.1e-03 7.2e-02 5 5.8e-02 5

Table 1 shows that the factorization time is between 6 and 10 times that of the on-the-fly triangular solve indicating that the triangular solves could be significant if the matrix-vector products were fast or one had to solve for many right hand sides. We saw that effect in the nonlinear examples in [16, 17] where the nonlinear residual could be evaluated in O(NlogN)𝑂𝑁𝑙𝑜𝑔𝑁O(NlogN)italic_O ( italic_N italic_l italic_o italic_g italic_N ) work. One can also see that the cpu time for the in-place triangular solves is 2–5 times less than for the on-the-fly version.

In the final four columns we see that, while the version with in-place triangular solves is somewhat faster, the difference is not compelling. This is no surprise because the matrix-vector product in the residual precision (double) takes O(N2)𝑂superscript𝑁2O(N^{2})italic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) work and is therefore a significant part of the cost for each IR iteration. The number of iterations is the same in all but one case, with in-place triangular solves taking one more iteration in that case. That is consistent with the prediction in [7].

4 Conclusions

We expose the interprecision transfers in iterative refinement and obtain new insights into this classical algorithm. In particular we show that the version in which the residual is evaluated in an extended precision is equivalent to solving a promoted problem and show how interprecision transfers affect the triangular solves.

5 Acknowledgments

The author is very grateful to Ilse Ipsen for listening to him as he worked through the ideas for this paper.

References

  • [1] P. Amestoy, A. Buttari, N. J. Higham, J.-Y. L’Excellent, T. Mary, and B. Vieublé, Five-precision gmres-based iterative refinement, SIAM Journal on Matrix Analysis and Applications, 45 (2024), pp. 529–552, https://doi.org/10.1137/23M1549079.
  • [2] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A fresh approach to numerical computing, SIAM Review, 59 (2017), pp. 65–98.
  • [3] K. E. Brenan, S. L. Campbell, and L. R. Petzold, The Numerical Solution of Initial Value Problems in Differential-Algebraic Equations, no. 14 in Classics in Applied Mathematics, SIAM, Philadelphia, 1996.
  • [4] P. N. Brown, G. D. Byrne, and A. C. Hindmarsh, VODE: A variable coefficient ode solver, SIAM J. Sci. Statist. Comput., 10 (1989), pp. 1038–1051.
  • [5] P. N. Brown, A. C. Hindmarsh, and L. R. Petzold, Using Krylov methods in the solution of large-scale differential-algebraic systems, SIAM J. Sci. Comput., 15 (1994), pp. 1467–1488.
  • [6] E. Carson and N. J. Higham, A new analysis of iterative refinement and its application of accurate solution of ill-conditioned sparse linear systems, SIAM Journal on Scientific Computing, 39 (2017), pp. A2834–A2856, https://doi.org/10.1137/17M112291.
  • [7] E. Carson and N. J. Higham, Accelerating the solution of linear systems by iterative refinement in three precisions, SIAM Journal on Scientific Computing, 40 (2018), pp. A817–A847, https://doi.org/10.1137/17M1140819.
  • [8] J. Chen and J. Revels, Robust benchmarking in noisy environments, 2016, https://arxiv.longhoe.net/abs/1608.04295.
  • [9] J. Demmel, Y. Hida, W. Kahan, X. S. Li, S. Mukherjee, and E. J. Riedy, Error bounds from extra-precise iterative refinement, ACM Trans. Math. Soft., (2006), pp. 325–351.
  • [10] J. W. Demmel, Applied Numerical Linear Algebra, SIAM, Philadelphia, 1997.
  • [11] N. J. Higham, Accuracy and Stability of Numerical Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1996, http://www.ma.man.ac.uk/~higham/asna.html.
  • [12] N. J. Higham, Iterative refinement for linear systems and LAPACK, IMA J. Numer. Anal., 17 (1997), pp. 495–509.
  • [13] N. J. Higham, S. Pranesh, and M. Zounon, Squeezing a matrix into half precision, with an application to solving linear systems, SIAM J. Sci. Comp., 41 (2019), pp. A2536–A2551.
  • [14] IEEE Computer Society, IEEE standard for floating-point arithmetic, IEEE Std 754–2019, July 2019.
  • [15] C. T. Kelley, Iterative Methods for Linear and Nonlinear Equations, no. 16 in Frontiers in Applied Mathematics, SIAM, Philadelphia, 1995.
  • [16] C. T. Kelley, Newton’s method in mixed precision, SIAM Review, 64 (2022), pp. 191–211, https://doi.org/10.1137/20M1342902.
  • [17] C. T. Kelley, Solving Nonlinear Equations with Iterative Methods: Solvers and Examples in Julia, no. 20 in Fundamentals of Algorithms, SIAM, Philadelphia, 2022.
  • [18] C. T. Kelley, MultiPrecisionArrays.jl, 2023, https://doi.org/10.5281/zenodo.7521427, https://github.com/ctkelley/MultiPrecisionArrays.jl. Julia Package.
  • [19] C. T. Kelley, Using MultiPrecisonArrays.jl: Iterative refinement in Julia, 2024, https://arxiv.longhoe.net/abs/2311.14616.
  • [20] L. R. Petzold, A description of DASSL: a differential/algebraic system solver, in Scientific Computing, R. S. Stepleman et al., ed., North Holland, Amsterdam, 1983, pp. 65–68.
  • [21] L. F. Shampine, Numerical Solution of Ordinary Differential Equations, Chapman and Hall, New York, 1994.
  • [22] J. H. Wilkinson, Progress report on the automatic computing engine, Tech. Report MA/17/1024, Mathematics Division, Department of Scientific and Industrial Research, National Physical Laboratory, Teddington, UK, 1948, http://www.alanturing.net/turing_archive/archive/l/l10/l10.php.