Learning Preconditioners for Inverse Problems

Matthias J. Ehrhardt
Department of Mathematical Sciences
University of Bath
[email protected]
&Patrick Fahy
Department of Mathematical Sciences
University of Bath
[email protected]
&Mohammad Golbabaee
Department of Engineering
Mathematics, University of Bristol
[email protected]
Corresponding author. Patrick Fahy is supported by a scholarship from the EPSRC Centre for Doctoral Training in Statistical Applied Mathematics at Bath (SAMBa), under the project EP/S022945/1.
Abstract

We explore the application of preconditioning in optimisation algorithms, specifically those appearing in Inverse Problems in imaging. Such problems often contain an ill-posed forward operator and are large-scale. Therefore, computationally efficient algorithms which converge quickly are desirable. To remedy these issues, learning-to-optimise leverages training data to accelerate solving particular optimisation problems. Many traditional optimisation methods use scalar hyperparameters, significantly limiting their convergence speed when applied to ill-conditioned problems. In contrast, we propose a novel approach that replaces these scalar quantities with matrices learned using data. Often, preconditioning considers only symmetric positive-definite preconditioners. However, we consider multiple parametrisations of the preconditioner, which do not require symmetry or positive-definiteness. These parametrisations include using full matrices, diagonal matrices, and convolutions. We analyse the convergence properties of these methods and compare their performance against classical optimisation algorithms. Generalisation performance of these methods is also considered, both for in-distribution and out-of-distribution data.

1 Introduction

A linear inverse problem is defined by receiving an observation y𝒴𝑦𝒴y\in\mathcal{Y}italic_y ∈ caligraphic_Y, generated from a ground-truth xtruesubscript𝑥truex_{\text{true}}italic_x start_POSTSUBSCRIPT true end_POSTSUBSCRIPT via some linear forward operator A:𝒳𝒴:𝐴𝒳𝒴A:\mathcal{X}\to\mathcal{Y}italic_A : caligraphic_X → caligraphic_Y, such that

y=Axtrue+ε,𝑦𝐴subscript𝑥true𝜀y=Ax_{\text{true}}+\varepsilon,italic_y = italic_A italic_x start_POSTSUBSCRIPT true end_POSTSUBSCRIPT + italic_ε , (1.0.1)

where ε𝒴𝜀𝒴\varepsilon\in\mathcal{Y}italic_ε ∈ caligraphic_Y is some random noise. In this formulation, y𝑦yitalic_y and A𝐴Aitalic_A are known, and the goal is to recover xtruesubscript𝑥truex_{\text{true}}italic_x start_POSTSUBSCRIPT true end_POSTSUBSCRIPT. Such a problem is often ill-posed due to the noise inherent in the observation. To remedy this, one may introduce a data-fidelity term 𝒟:𝒴×𝒴:𝒟𝒴𝒴\mathcal{D}:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}caligraphic_D : caligraphic_Y × caligraphic_Y → blackboard_R to enforce Ax𝐴𝑥Axitalic_A italic_x and y𝑦yitalic_y are "close" and a regularisation function :𝒳:𝒳\mathcal{R}:\mathcal{X}\to\mathbb{R}caligraphic_R : caligraphic_X → blackboard_R to enforce the solution has desired properties, such that the minimiser of the function

f(x):=𝒟(Ax,y)+(x),assign𝑓𝑥𝒟𝐴𝑥𝑦𝑥f(x):=\mathcal{D}(Ax,y)+\mathcal{R}(x),italic_f ( italic_x ) := caligraphic_D ( italic_A italic_x , italic_y ) + caligraphic_R ( italic_x ) , (1.0.2)

approximates xtruesubscript𝑥truex_{\text{true}}italic_x start_POSTSUBSCRIPT true end_POSTSUBSCRIPT.

In this paper, we refer to solving the optimisation problem given by

minxf(x),subscript𝑥𝑓𝑥\min_{x}f(x),roman_min start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_f ( italic_x ) , (1.0.3)

with the assumption that f:n:𝑓superscript𝑛f:\mathbb{R}^{n}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is continuously differentiable, convex, and L𝐿Litalic_L-smooth, and that a global minimiser exists.

To approximate a solution to this optimisation problem, one can use gradient descent:

xt+1=xtαtf(xt).subscript𝑥𝑡1subscript𝑥𝑡subscript𝛼𝑡𝑓subscript𝑥𝑡x_{t+1}=x_{t}-\alpha_{t}\nabla f(x_{t}).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1.0.4)

Various strategies exist for determining the step size αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, including using fixed step size, exact line search, and backtracking line search [18]. However, especially for ill-conditioned problems, gradient descent leads to very slow convergence.

1.1 Preconditioning

The issue of slow convergence in gradient descent can be remedied by introducing a matrix value step size, otherwise referred to as a preconditioner. Preconditioned Gradient Descent often considers a symmetric positive-definite matrix Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that the update is now given by

xt+1=xtPtf(xt).subscript𝑥𝑡1subscript𝑥𝑡subscript𝑃𝑡𝑓subscript𝑥𝑡x_{t+1}=x_{t}-P_{t}\nabla f(x_{t}).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1.1.1)

One such choice of Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is Newton’s method, which considers an update equation given by

xt+1=xt(2f(xt))1f(xt).subscript𝑥𝑡1subscript𝑥𝑡superscriptsuperscript2𝑓subscript𝑥𝑡1𝑓subscript𝑥𝑡x_{t+1}=x_{t}-\left(\nabla^{2}f(x_{t})\right)^{-1}\nabla f(x_{t}).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1.1.2)

This method, given that f𝑓fitalic_f is twice continuously differentiable, L𝐿Litalic_L-smooth and μ𝜇\muitalic_μ-strongly convex for some L,μ>0𝐿𝜇0L,\mu>0italic_L , italic_μ > 0, achieves quadratic convergence, compared to linear convergence for gradient descent [3]. However, this method comes with multiple drawbacks:

  • For ill-conditioned problems, the computation of (2f(xt))1superscriptsuperscript2𝑓subscript𝑥𝑡1(\nabla^{2}f(x_{t}))^{-1}( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT may be unstable and lead to an incorrect estimate.

  • To remedy this, one may calculate the inverse Hessian as the solution dtsubscript𝑑𝑡d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the equation

    (2f(xt))dt=f(xt).superscript2𝑓subscript𝑥𝑡subscript𝑑𝑡𝑓subscript𝑥𝑡\left(\nabla^{2}f(x_{t})\right)d_{t}=\nabla f(x_{t}).( ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1.1.3)

    This can be approximated, for example, by using Conjugate Gradient.

  • Approximating the solution to equation (1.1.3) can be computationally expensive. This can be an issue when optimising f𝑓fitalic_f quickly is important.

  • Storing the inverse Hessian requires storing an n×n𝑛𝑛n\times nitalic_n × italic_n matrix, which may be infeasible for large n𝑛nitalic_n, often occurring in imaging inverse problems.

Other choices of Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT include quasi-Newton methods. Such methods construct Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as an approximation of the inverse Hessian and can change over iterations. One example is the BFGS algorithm [10], which starts with some symmetric positive matrix B0subscript𝐵0B_{0}italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and calculates Bt+1subscript𝐵𝑡1B_{t+1}italic_B start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT from Btsubscript𝐵𝑡B_{t}italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using a rank-2222 update. Quasi-Newton methods lie within ’variable metric’ methods [7], which construct a symmetric, positive definite matrix Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each iteration. This general class of methods has been studied for nonsmooth optimisation [4].

One application of hand-crafted preconditioners in inverse problems is in parallel MRI. The authors of [12] propose hand-crafted preconditioners with the aim of speeding up the convergence of a plug-and-play approach [21], whereas [14] consider a circulant preconditioner which leads to an acceleration factor of 2.52.52.52.5. Preconditioning has also been applied for PET imaging [9].

Learning preconditioners offline can remedy the issues of calculating preconditioners online and improve performance on a ’small’ class of relevant functions f𝑓fitalic_f. Learned preconditioners have been considered in [11], where the preconditioner is constrained to the set of symmetric, positive-definite matrices by learning a map** ΛθsubscriptΛ𝜃\Lambda_{\theta}roman_Λ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT such that the preconditioner is given by ΛθΛθTsubscriptΛ𝜃superscriptsubscriptΛ𝜃𝑇\Lambda_{\theta}\Lambda_{\theta}^{T}roman_Λ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_Λ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. In [14], a convolutional neural network preconditioner is learned as a function of the observation. However, this preconditioner is not required to be symmetric or positive-definite. Due to the learning of the preconditioner, the resulting optimisation algorithm is not necessarily convergent.

1.2 Learning-to-optimise

Although there exist optimisation algorithms that are optimal for large problem classes, practitioners usually only focus on a very narrow subclass. For example, one may only be interested in reconstructing blurred observations y𝑦yitalic_y generated from a distribution y𝒴similar-to𝑦𝒴y\sim\mathcal{Y}italic_y ∼ caligraphic_Y with a known constant blurring operator A𝐴Aitalic_A. One might then consider the following class of functions:

={f:n:f(x)=12Axy22+R(x),y𝒴},conditional-set𝑓:superscript𝑛formulae-sequence𝑓𝑥12superscriptsubscriptnorm𝐴𝑥𝑦22𝑅𝑥similar-to𝑦𝒴\mathcal{F}=\left\{f:\mathbb{R}^{n}\to\mathbb{R}:f(x)=\frac{1}{2}\|Ax-y\|_{2}^% {2}+R(x),y\sim\mathcal{Y}\right\},caligraphic_F = { italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R : italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_R ( italic_x ) , italic_y ∼ caligraphic_Y } , (1.2.1)

where R:n:𝑅superscript𝑛R:\mathbb{R}^{n}\to\mathbb{R}italic_R : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is a chosen regularisation function. Learning-to-optimise [6] aims to minimise objective functions quickly over a given class of functions (see (1.2.1)) and a distribution of initial points x0𝒳0similar-tosubscript𝑥0subscript𝒳0x_{0}\sim\mathcal{X}_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. If the class of functions chosen is small, an optimisation algorithm that massively accelerates optimisation within this class can likely be learned. However, the performance on functions outside of this class may be poor. If the optimisation algorithm can be parametrised by

xt+1=Gθtt(xt,f(xt),zt),subscript𝑥𝑡1superscriptsubscript𝐺subscript𝜃𝑡𝑡subscript𝑥𝑡𝑓subscript𝑥𝑡subscript𝑧𝑡x_{t+1}=G_{\theta_{t}}^{t}(x_{t},\nabla f(x_{t}),z_{t}),italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (1.2.2)

for Gθtt:n×n×𝒵t:superscriptsubscript𝐺subscript𝜃𝑡𝑡superscript𝑛superscript𝑛subscript𝒵𝑡G_{\theta_{t}}^{t}:\mathbb{R}^{n}\times\mathbb{R}^{n}\times\mathcal{Z}_{t}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT × caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then, the parameters θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be chosen to satisfy

(θ0,,θT1)argmin(θ0,,θT1)𝔼f,x0𝒳0t=1Tf(xt),subscript𝜃0subscript𝜃𝑇1subscriptargminsubscript𝜃0subscript𝜃𝑇1subscript𝔼formulae-sequencesimilar-to𝑓similar-tosubscript𝑥0subscript𝒳0superscriptsubscript𝑡1𝑇𝑓subscript𝑥𝑡(\theta_{0},\cdots,\theta_{T-1})\in\operatorname*{argmin}_{(\theta_{0},\cdots,% \theta_{T-1})}\mathbb{E}_{f\sim\mathcal{F},x_{0}\sim\mathcal{X}_{0}}\sum_{t=1}% ^{T}f(x_{t}),( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) ∈ roman_argmin start_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_f ∼ caligraphic_F , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ caligraphic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (1.2.3)

for some fixed T>0𝑇0T>0italic_T > 0, for example.

Algorithm unrolling [16], otherwise known as unrolling, directly parameterises the update step as a ’neural network’, often taking previous iterates and gradients as arguments. Parameters θ0,,θT1subscript𝜃0subscript𝜃𝑇1\theta_{0},\cdots,\theta_{T-1}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT are found to approximate the solution (1.2.3). These methods have been empirically shown to speed up optimisation in various settings. However, many learned optimisation solvers do not have convergence guarantees, including those using reinforcement learning [15] and RNNs [1]. However, others come with provable convergence; for example, Banert et al. developed a method [2] for nonsmooth optimisation inspired by proximal splitting methods. However, such methods often greatly limit the number of learnable parameters and, therefore, the extent to which the algorithm can be adapted to a particular problem class. Learned optimisation algorithms exist where the parameters θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are chosen constant throughout iterations, i.e. θt=θsubscript𝜃𝑡𝜃\theta_{t}=\thetaitalic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_θ for all t{0,,T1}𝑡0𝑇1t\in\{0,\cdots,T-1\}italic_t ∈ { 0 , ⋯ , italic_T - 1 }. For example, [20] learn mirror maps using input-convex neural networks within the mirror descent optimisation algorithm.

1.3 Our approach

We consider learning a preconditioner Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each iteration of gradient descent. Therefore, we seek to learn a parametrised update map of the form

xt+1=xtGθtt(xt,f(xt),zt)f(xt).subscript𝑥𝑡1subscript𝑥𝑡superscriptsubscript𝐺subscript𝜃𝑡𝑡subscript𝑥𝑡𝑓subscript𝑥𝑡subscript𝑧𝑡𝑓subscript𝑥𝑡x_{t+1}=x_{t}-G_{\theta_{t}}^{t}(x_{t},\nabla f(x_{t}),z_{t})\nabla f(x_{t}).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1.3.1)

We simplify the learning procedure by reducing this to the following parametrisation:

xt+1=xtGθtf(xt).subscript𝑥𝑡1subscript𝑥𝑡subscript𝐺subscript𝜃𝑡𝑓subscript𝑥𝑡x_{t+1}=x_{t}-G_{\theta_{t}}\nabla f(x_{t}).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (1.3.2)

In other words,

  • The parametrisation is constant throughout different iterations: Gθtt=Gθtsuperscriptsubscript𝐺subscript𝜃𝑡𝑡subscript𝐺subscript𝜃𝑡G_{\theta_{t}}^{t}=G_{\theta_{t}}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT for all t𝑡titalic_t (but with potentially different parameters).

  • Gθtn×nsubscript𝐺subscript𝜃𝑡superscript𝑛𝑛G_{\theta_{t}}\in\mathbb{R}^{n\times n}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is a matrix and does not take any inputs.

We create an optimisation algorithm that is provably convergent on training data without requiring learned preconditioners confined to symmetric positive-definite matrices. We propose parameter learning as a convex optimisation problem using greedy learning and specific parameterisations Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Therefore, any local minimiser is a global minimiser, removing the issue of being ’stuck’ in local optima. We also derive closed-form preconditioners in the case of least-squares objective functions f𝑓fitalic_f. There is also an investigation into the generalisation properties of these methods, with out-of-sample data (data within the class used in training but not seen in training) and out-of-distribution data (data with a different distribution to those used in training). Firstly, we require a few definitions.

2 Background

We require the following definitions.

Definition 2.1.

Convexity
A function f:n:fsuperscriptnf:\mathbb{R}^{n}\rightarrow\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is convex if for all x,ynxysuperscriptnx,y\in\mathbb{R}^{n}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and for all λ[0,1]λ01\lambda\in[0,1]italic_λ ∈ [ 0 , 1 ]:

f(λx+(1λ)y)λf(x)+(1λ)f(y).𝑓𝜆𝑥1𝜆𝑦𝜆𝑓𝑥1𝜆𝑓𝑦f(\lambda x+(1-\lambda)y)\leq\lambda f(x)+(1-\lambda)f(y).italic_f ( italic_λ italic_x + ( 1 - italic_λ ) italic_y ) ≤ italic_λ italic_f ( italic_x ) + ( 1 - italic_λ ) italic_f ( italic_y ) . (2.0.1)
Definition 2.2.

Strong Convexity
A function f:n:fsuperscriptnf:\mathbb{R}^{n}\rightarrow\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is strongly convex with parameter μ>0μ0\mu>0italic_μ > 0 if fμ22f-\frac{\mu}{2}\|\cdot\|_{2}italic_f - divide start_ARG italic_μ end_ARG start_ARG 2 end_ARG ∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is convex.

Definition 2.3.

L-smoothness
A function f:n:fsuperscriptnf:\mathbb{R}^{n}\rightarrow\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R is L-smooth with parameter L>0L0L>0italic_L > 0 if its gradient is Lipschitz continuous, i.e., if for all x,ynxysuperscriptnx,y\in\mathbb{R}^{n}italic_x , italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, the following inequality holds:

f(x)f(y)2Lxy2.subscriptnorm𝑓𝑥𝑓𝑦2𝐿subscriptnorm𝑥𝑦2\|\nabla f(x)-\nabla f(y)\|_{2}\leq L\|x-y\|_{2}.∥ ∇ italic_f ( italic_x ) - ∇ italic_f ( italic_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_L ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (2.0.2)

We say

  • fL1,1𝑓superscriptsubscript𝐿11f\in\mathcal{F}_{L}^{1,1}italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT if f𝑓fitalic_f is convex, continuously differentiable and L𝐿Litalic_L-smooth.

  • fL,μ1,1𝑓superscriptsubscript𝐿𝜇11f\in\mathcal{F}_{L,\mu}^{1,1}italic_f ∈ caligraphic_F start_POSTSUBSCRIPT italic_L , italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT if, in addition, f𝑓fitalic_f is μ𝜇\muitalic_μ-strongly convex.

3 Greedy preconditioning

With these definitions, we can now formulate our method. This paper considers learning from a class of functions given by some \mathcal{F}caligraphic_F. With this in mind, we consider a dataset of functions {f1,,fN}subscript𝑓1subscript𝑓𝑁\{f_{1},\cdots,f_{N}\}{ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } with initial points {x10,,xN0}superscriptsubscript𝑥10superscriptsubscript𝑥𝑁0\{x_{1}^{0},\cdots,x_{N}^{0}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT } and minimisers {x1,,xN}superscriptsubscript𝑥1superscriptsubscript𝑥𝑁\{x_{1}^{*},\cdots,x_{N}^{*}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT }, respectively. Throughout this paper, we assume that fkLk1,1subscript𝑓𝑘superscriptsubscriptsubscript𝐿𝑘11f_{k}\in\mathcal{F}_{L_{k}}^{1,1}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT for some Lk>0subscript𝐿𝑘0L_{k}>0italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT > 0, for each k{1,,N}𝑘1𝑁k\in\{1,\cdots,N\}italic_k ∈ { 1 , ⋯ , italic_N }.

We parametrise a preconditioner Gθtn×nsubscript𝐺subscript𝜃𝑡superscript𝑛𝑛G_{\theta_{t}}\in\mathbb{R}^{n\times n}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT at each iteration t𝑡titalic_t in the preconditioned gradient descent algorithm (1.3.2), and restrict Gθtsubscript𝐺subscript𝜃𝑡G_{\theta_{t}}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT such that Gθtfk(xkt)subscript𝐺subscript𝜃𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡G_{\theta_{t}}\nabla f_{k}(x_{k}^{t})italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is affine in the parameters θtrsubscript𝜃𝑡superscript𝑟\theta_{t}\in\mathbb{R}^{r}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, where xktsuperscriptsubscript𝑥𝑘𝑡x_{k}^{t}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents the iterate at iteration t𝑡titalic_t for datapoint k𝑘kitalic_k. Then there exist Bktn×r,vktnformulae-sequencesuperscriptsubscript𝐵𝑘𝑡superscript𝑛𝑟superscriptsubscript𝑣𝑘𝑡superscript𝑛B_{k}^{t}\in\mathbb{R}^{n\times r},v_{k}^{t}\in\mathbb{R}^{n}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_r end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that

xktGθ(fk(xkt))=vktBktθ.superscriptsubscript𝑥𝑘𝑡subscript𝐺𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃x_{k}^{t}-G_{\theta}(\nabla f_{k}(x_{k}^{t}))=v_{k}^{t}-B_{k}^{t}\theta.italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ . (3.0.1)

Note that when Gθfk(xkt)subscript𝐺𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡G_{\theta}\nabla f_{k}(x_{k}^{t})italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is affine in its θ𝜃\thetaitalic_θ, then the optimisation problem

minθfk(xktGθfk(xkt))subscript𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐺𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\min_{\theta}f_{k}(x_{k}^{t}-G_{\theta}\nabla f_{k}(x_{k}^{t}))roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (3.0.2)

is convex as it is the composition of a convex function with an affine function [3]. Therefore, if a local minimiser exists, it is a global minimiser, and we avoid local minima traps. Note that if Gθ=θIsubscript𝐺𝜃𝜃𝐼G_{\theta}=\theta Iitalic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_θ italic_I, then the problem (3.0.2) reduces to exact line search for fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

If instead we chose to learn T𝑇Titalic_T sets of parameters θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for t{0,1,2,,T1}𝑡012𝑇1t\in\{0,1,2,\cdots,T-1\}italic_t ∈ { 0 , 1 , 2 , ⋯ , italic_T - 1 } simultaneously, such that

(θ0,,θt)=minθ~0,,θ~T1f(xt(θ~0,,θ~T1)),subscript𝜃0subscript𝜃𝑡subscriptsubscript~𝜃0subscript~𝜃𝑇1𝑓subscript𝑥𝑡subscript~𝜃0subscript~𝜃𝑇1(\theta_{0},\cdots,\theta_{t})=\min_{\tilde{\theta}_{0},\cdots,\tilde{\theta}_% {T-1}}f(x_{t}(\tilde{\theta}_{0},\cdots,\tilde{\theta}_{T-1})),( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = roman_min start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) ) , (3.0.3)

we have obtained a nonconvex optimisation problem for T>1𝑇1T>1italic_T > 1 (shown in Appendix A), where

xt+1(θ~0,,θ~t)=xt(θ~0,,θ~t1)Gθt(f(xt(θ~0,,θ~t1))).subscript𝑥𝑡1subscript~𝜃0subscript~𝜃𝑡subscript𝑥𝑡subscript~𝜃0subscript~𝜃𝑡1subscript𝐺subscript𝜃𝑡𝑓subscript𝑥𝑡subscript~𝜃0subscript~𝜃𝑡1x_{t+1}(\tilde{\theta}_{0},\cdots,\tilde{\theta}_{t})=x_{t}(\tilde{\theta}_{0}% ,\cdots,\tilde{\theta}_{t-1})-G_{\theta_{t}}(\nabla f(x_{t}(\tilde{\theta}_{0}% ,\cdots,\tilde{\theta}_{t-1}))).italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , over~ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ) ) . (3.0.4)

Consider the optimisation problem at iteration t𝑡titalic_t given by

θtargminθ{gt(θ):=1Nk=1Nfk(xktGθfk(xkt))}.superscriptsubscript𝜃𝑡subscriptargmin𝜃assignsubscript𝑔𝑡𝜃1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐺𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\theta_{t}^{*}\in\operatorname*{argmin}_{\theta}\left\{g_{t}(\theta):=\frac{1}% {N}\sum_{k=1}^{N}f_{k}(x_{k}^{t}-G_{\theta}\nabla f_{k}(x_{k}^{t}))\right\}.italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT { italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) := divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) } . (3.0.5)

As Gθfk(xkt)subscript𝐺𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡G_{\theta}\nabla f_{k}(x_{k}^{t})italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) is affine in θ𝜃\thetaitalic_θ and each function fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is convex, this optimisation problem is convex. In this optimisation problem, given the current iterates xktsuperscriptsubscript𝑥𝑘𝑡x_{k}^{t}italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT for k{1,,N}𝑘1𝑁k\in\{1,\cdots,N\}italic_k ∈ { 1 , ⋯ , italic_N }, we seek to choose the optimal greedy parameters θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at iteration t𝑡titalic_t such that we minimise the mean over every fk(xkt+1)subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1f_{k}(x_{k}^{t+1})italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT ). In learning-to-optimise, learning parameters is often a non-convex optimisation problem. Therefore, the performance of learned optimisers is highly dependent on the optimisation algorithm used and its hyperparameters. However, as our problem is convex, one can use any convex optimisation algorithm with convergence guarantees.

We focus on the following four parameterisations of Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, presented in Table 1.

Label Description Parametrisation
(P1) A scalar step size Gα=αI,αformulae-sequencesubscript𝐺𝛼𝛼𝐼𝛼G_{\alpha}=\alpha I,\alpha\in\mathbb{R}italic_G start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = italic_α italic_I , italic_α ∈ blackboard_R
(P2) A diagonal matrix Gpt=diag(pt),ptnformulae-sequencesubscript𝐺subscript𝑝𝑡diagsubscript𝑝𝑡subscript𝑝𝑡superscript𝑛G_{p_{t}}=\operatorname{diag}(p_{t}),p_{t}\in\mathbb{R}^{n}italic_G start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_diag ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT
(P3) A full matrix GPt=Ptn×nsubscript𝐺subscript𝑃𝑡subscript𝑃𝑡superscript𝑛𝑛G_{P_{t}}=P_{t}\in\mathbb{R}^{n\times n}italic_G start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT
(P4) Image convolution Gκtx=κtx,κtm1×m2formulae-sequencesubscript𝐺subscript𝜅𝑡𝑥subscript𝜅𝑡𝑥subscript𝜅𝑡superscriptsubscript𝑚1subscript𝑚2G_{\kappa_{t}}x=\kappa_{t}\ast x,\kappa_{t}\in\mathbb{R}^{m_{1}\times m_{2}}italic_G start_POSTSUBSCRIPT italic_κ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x = italic_κ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∗ italic_x , italic_κ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT
Table 1: Parametrisations

These four parametrisations can have a wildly varying number of parameters if n𝑛nitalic_n is large. An increasing number of parameters may make Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT more expressive, enabling better performance on training data. However, it may cause lower generalisation performance on out-of-sample data. However, some parameterisations may be more expressive than others, given as many or even fewer parameters as seen in section 7. Note that using a full-matrix preconditioner corresponds to the same memory usage as Newton’s method.

Suppose that training is terminated after T𝑇Titalic_T iterations, after learning parameters θ0,,θT1subscript𝜃0subscript𝜃𝑇1\theta_{0},\cdots,\theta_{T-1}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT, due to an apriori choice of stop** iteration T𝑇Titalic_T or some stop** condition, for example. Then, preconditioners Gθ0,,GθT1subscript𝐺subscript𝜃0subscript𝐺subscript𝜃𝑇1G_{\theta_{0}},\cdots,G_{\theta_{T-1}}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are learned and, for a new test function f𝑓fitalic_f with initial point x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the minimum of f𝑓fitalic_f can be approximated using the following optimisation algorithm:

xt+1={xtGθtf(xt), if t<T,xtGθT1f(xt), otherwise.subscript𝑥𝑡1casessubscript𝑥𝑡subscript𝐺subscript𝜃𝑡𝑓subscript𝑥𝑡 if 𝑡𝑇otherwisesubscript𝑥𝑡subscript𝐺subscript𝜃𝑇1𝑓subscript𝑥𝑡 otherwiseotherwisex_{t+1}=\begin{cases}x_{t}-G_{\theta_{t}}\nabla f(x_{t}),\text{ if }t<T,\\ x_{t}-G_{\theta_{T-1}}\nabla f(x_{t}),\text{ otherwise}.\end{cases}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = { start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , if italic_t < italic_T , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , otherwise . end_CELL start_CELL end_CELL end_ROW (3.0.6)

One other choice is to ’recycle’ the learned parameters θ0,,θT1subscript𝜃0subscript𝜃𝑇1\theta_{0},\cdots,\theta_{T-1}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_θ start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT, such that at iteration t𝑡titalic_t, the parameters θtmodTsubscript𝜃𝑡mod𝑇\theta_{t\hskip 2.00749pt\text{mod}\hskip 2.00749ptT}italic_θ start_POSTSUBSCRIPT italic_t mod italic_T end_POSTSUBSCRIPT are used.

In section 4, we restrict each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT to least-squares problems and see closed-form solutions for parameterisations (P1), (P2) and (P3). In particular, we will see that a diagonal preconditioner can cause preconditioned gradient descent to converge instantly. In section 5, we will see how to approximate the optimal preconditioner using optimisation for a more general class of functions. Then, in section 6, we provide convergence results, including rates for the closed-form and approximated greedy preconditioners for all parameterisations Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Following this, in section 7, we apply these methods to a series of problems and compare performance with classical optimisation methods and other learned approaches.

4 Closed-form solutions

In this section, we assume each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be written as

fk(x)=12Akxyk22,subscript𝑓𝑘𝑥12superscriptsubscriptnormsubscript𝐴𝑘𝑥superscript𝑦𝑘22f_{k}(x)=\frac{1}{2}\|A_{k}x-y^{k}\|_{2}^{2},italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_x - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4.0.1)

with corresponding ykmsuperscript𝑦𝑘superscript𝑚y^{k}\in\mathbb{R}^{m}italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and forward model Akm×nsubscript𝐴𝑘superscript𝑚𝑛A_{k}\in\mathbb{R}^{m\times n}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT. Under these assumptions, there exists a closed-form solution for affine parametrisations.

Proposition 4.1.

For an affine parameterisation Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, let Bkt,vktsuperscriptsubscript𝐵𝑘𝑡superscriptsubscript𝑣𝑘𝑡B_{k}^{t},v_{k}^{t}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT be given as in (3.0.1). Then for all fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT given as a least squares problem (4.0.1), then θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT given by

θt=(1Nk=1N(AkBkt)T(AkBkt))(1Nk=1N(Bkt)Tfk(vkt))subscript𝜃𝑡superscript1𝑁superscriptsubscript𝑘1𝑁superscriptsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝑇subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡1𝑁superscriptsubscript𝑘1𝑁superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡\theta_{t}=\left(\frac{1}{N}\sum_{k=1}^{N}(A_{k}B_{k}^{t})^{T}(A_{k}B_{k}^{t})% \right)^{\dagger}\left(\frac{1}{N}\sum_{k=1}^{N}(B_{k}^{t})^{T}\nabla f_{k}(v_% {k}^{t})\right)italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (4.0.2)

is the least-norm solution to (3.0.5). Where Msuperscript𝑀M^{\dagger}italic_M start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT represents the Moore-Penrose pseudoinverse of a matrix M𝑀Mitalic_M.

Proof.

Because this problem is convex, if a solution θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is found by differentiating and equating equal to zero, this is a global minimiser. First, note that

fk(vktBktθ)subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃\displaystyle f_{k}(v_{k}^{t}-B_{k}^{t}\theta)italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) =12Ak(vktBktθ)yk22absent12superscriptsubscriptnormsubscript𝐴𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃superscript𝑦𝑘22\displaystyle=\frac{1}{2}\|A_{k}(v_{k}^{t}-B_{k}^{t}\theta)-y^{k}\|_{2}^{2}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4.0.3)
=12Akvktyk22+12AkBktθ22+AkBktθ,Akvktykabsent12superscriptsubscriptnormsubscript𝐴𝑘superscriptsubscript𝑣𝑘𝑡superscript𝑦𝑘2212superscriptsubscriptnormsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝜃22subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝜃subscript𝐴𝑘superscriptsubscript𝑣𝑘𝑡superscript𝑦𝑘\displaystyle=\frac{1}{2}\|A_{k}v_{k}^{t}-y^{k}\|_{2}^{2}+\frac{1}{2}\|-A_{k}B% _{k}^{t}\theta\|_{2}^{2}+\langle-A_{k}B_{k}^{t}\theta,A_{k}v_{k}^{t}-y^{k}\rangle= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ - italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ⟨ - italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ , italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ⟩ (4.0.4)
=12Akvktyk22+12AkBktθ22θ,(Bkt)Tfk(vkt).absent12superscriptsubscriptnormsubscript𝐴𝑘superscriptsubscript𝑣𝑘𝑡superscript𝑦𝑘2212superscriptsubscriptnormsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝜃22𝜃superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡\displaystyle=\frac{1}{2}\|A_{k}v_{k}^{t}-y^{k}\|_{2}^{2}+\frac{1}{2}\|A_{k}B_% {k}^{t}\theta\|_{2}^{2}-\langle\theta,(B_{k}^{t})^{T}\nabla f_{k}(v_{k}^{t})\rangle.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_θ , ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ . (4.0.5)

Now,

θ1Nk=1Nfk(vktBktθ)subscript𝜃1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃\displaystyle\nabla_{\theta}\frac{1}{N}\sum_{k=1}^{N}f_{k}(v_{k}^{t}-B_{k}^{t}\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) (4.0.7)
=θ1Nk=1N{12Akvktyk22+12AkBktθ22θ,(Bkt)Tfk(vkt)}absentsubscript𝜃1𝑁superscriptsubscript𝑘1𝑁conditional-set12subscript𝐴𝑘superscriptsubscript𝑣𝑘𝑡evaluated-atsuperscript𝑦𝑘2212superscriptsubscriptnormsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝜃22𝜃superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡\displaystyle=\nabla_{\theta}\frac{1}{N}\sum_{k=1}^{N}\left\{\frac{1}{2}\|A_{k% }v_{k}^{t}-y^{k}\|_{2}^{2}+\frac{1}{2}\|A_{k}B_{k}^{t}\theta\|_{2}^{2}-\langle% \theta,(B_{k}^{t})^{T}\nabla f_{k}(v_{k}^{t})\rangle\right\}= ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ⟨ italic_θ , ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ } (4.0.8)
=1Nk=1N(AkBkt)T(AkBktθ)(Bkt)Tfk(vkt)absent1𝑁superscriptsubscript𝑘1𝑁superscriptsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝑇subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝜃superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡\displaystyle=\frac{1}{N}\sum_{k=1}^{N}(A_{k}B_{k}^{t})^{T}(A_{k}B_{k}^{t}% \theta)-(B_{k}^{t})^{T}\nabla f_{k}(v_{k}^{t})= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) - ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (4.0.9)

is equal to zero if and only if

1Nk=1N(AkBkt)T(AkBktθ)=1Nk=1N(Bkt)Tfk(vkt)1𝑁superscriptsubscript𝑘1𝑁superscriptsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝑇subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝜃1𝑁superscriptsubscript𝑘1𝑁superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡\displaystyle\frac{1}{N}\sum_{k=1}^{N}(A_{k}B_{k}^{t})^{T}(A_{k}B_{k}^{t}% \theta)=\frac{1}{N}\sum_{k=1}^{N}(B_{k}^{t})^{T}\nabla f_{k}(v_{k}^{t})divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (4.0.10)

Note that

1Nk=1N(AkBkt)T(AkBktθ)=(1Nk=1N(AkBkt)T(AkBkt))θ,1𝑁superscriptsubscript𝑘1𝑁superscriptsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝑇subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝜃1𝑁superscriptsubscript𝑘1𝑁superscriptsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝑇subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝜃\frac{1}{N}\sum_{k=1}^{N}(A_{k}B_{k}^{t})^{T}(A_{k}B_{k}^{t}\theta)=\left(% \frac{1}{N}\sum_{k=1}^{N}(A_{k}B_{k}^{t})^{T}(A_{k}B_{k}^{t})\right)\theta,divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) italic_θ , (4.0.11)

and so θ𝜃\thetaitalic_θ can be given by

θ=(1Nk=1N(AkBkt)T(AkBkt))(1Nk=1N(Bkt)Tfk(vkt)).𝜃superscript1𝑁superscriptsubscript𝑘1𝑁superscriptsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝑇subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡1𝑁superscriptsubscript𝑘1𝑁superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡\theta=\left(\frac{1}{N}\sum_{k=1}^{N}(A_{k}B_{k}^{t})^{T}(A_{k}B_{k}^{t})% \right)^{\dagger}\left(\frac{1}{N}\sum_{k=1}^{N}(B_{k}^{t})^{T}\nabla f_{k}(v_% {k}^{t})\right).italic_θ = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) . (4.0.12)

Due to the properties of the pseudoinverse, this is the least-norm solution. ∎

The following proposition tells us when the parameters satisfying (3.0.5) are unique.

Proposition 4.2.

Suppose fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is convex and twice continuously differentiable for k{1,,N}𝑘1𝑁k\in\{1,\cdots,N\}italic_k ∈ { 1 , ⋯ , italic_N }. Furthermore, suppose there exists some j{1,,N}𝑗1𝑁j\in\{1,\cdots,N\}italic_j ∈ { 1 , ⋯ , italic_N } for which both Bjtsuperscriptsubscript𝐵𝑗𝑡B_{j}^{t}italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is injective and also fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is μjsubscript𝜇𝑗\mu_{j}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT-strongly convex. Then gt(θ)subscript𝑔𝑡𝜃g_{t}(\theta)italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) defined in (3.0.5) is strongly convex and has a unique global minimiser θtsuperscriptsubscript𝜃𝑡\theta_{t}^{*}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Proof.

Each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is twice continuously differentiable; therefore, gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is twice continuously differentiable. It is then sufficient to show there exists m>0𝑚0m>0italic_m > 0 such that

2gt(θ)mI,succeeds-or-equalssuperscript2subscript𝑔𝑡𝜃𝑚𝐼\nabla^{2}g_{t}(\theta)\succeq mI,∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) ⪰ italic_m italic_I , (4.0.13)

for all θ𝜃\thetaitalic_θ, as this implies that gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is strongly convex and has a unique global minimiser. Note that

2gt(θ)=1Nk=1N(Bkt)T2fk(vktBktθ)Bkt.superscript2subscript𝑔𝑡𝜃1𝑁superscriptsubscript𝑘1𝑁superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇superscript2subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃superscriptsubscript𝐵𝑘𝑡\nabla^{2}g_{t}(\theta)=\frac{1}{N}\sum_{k=1}^{N}(B_{k}^{t})^{T}\nabla^{2}f_{k% }(v_{k}^{t}-B_{k}^{t}\theta)B_{k}^{t}.∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . (4.0.14)

Each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is convex and so for all vn𝑣superscript𝑛v\in\mathbb{R}^{n}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

vT2fk(vktBktθ)v0,superscript𝑣𝑇superscript2subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃𝑣0v^{T}\nabla^{2}f_{k}(v_{k}^{t}-B_{k}^{t}\theta)v\geq 0,italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) italic_v ≥ 0 , (4.0.15)

and fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is μjsubscript𝜇𝑗\mu_{j}italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT-strongly convex, therefore

vT2fj(vtjBjtθ)vμjv22.superscript𝑣𝑇superscript2subscript𝑓𝑗superscriptsubscript𝑣𝑡𝑗superscriptsubscript𝐵𝑗𝑡𝜃𝑣subscript𝜇𝑗superscriptsubscriptnorm𝑣22v^{T}\nabla^{2}f_{j}(v_{t}^{j}-B_{j}^{t}\theta)v\geq\mu_{j}\|v\|_{2}^{2}.italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) italic_v ≥ italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (4.0.16)

For vn𝑣superscript𝑛v\in\mathbb{R}^{n}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

vT(1Nk=1N(Bkt)T2fk(vktBktθ)Bkt)vsuperscript𝑣𝑇1𝑁superscriptsubscript𝑘1𝑁superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇superscript2subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃superscriptsubscript𝐵𝑘𝑡𝑣\displaystyle v^{T}\left(\frac{1}{N}\sum_{k=1}^{N}(B_{k}^{t})^{T}\nabla^{2}f_{% k}(v_{k}^{t}-B_{k}^{t}\theta)B_{k}^{t}\right)vitalic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) italic_v (4.0.17)
=1Nk=1NvT(Bkt)T2fk(vktBktθ)Bktvabsent1𝑁superscriptsubscript𝑘1𝑁superscript𝑣𝑇superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇superscript2subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃superscriptsubscript𝐵𝑘𝑡𝑣\displaystyle=\frac{1}{N}\sum_{k=1}^{N}v^{T}(B_{k}^{t})^{T}\nabla^{2}f_{k}(v_{% k}^{t}-B_{k}^{t}\theta)B_{k}^{t}v= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_v (4.0.18)
=1Nk=1N(Bktv)T2fk(vktBktθ)(Bktv)absent1𝑁superscriptsubscript𝑘1𝑁superscriptsuperscriptsubscript𝐵𝑘𝑡𝑣𝑇superscript2subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃superscriptsubscript𝐵𝑘𝑡𝑣\displaystyle=\frac{1}{N}\sum_{k=1}^{N}(B_{k}^{t}v)^{T}\nabla^{2}f_{k}(v_{k}^{% t}-B_{k}^{t}\theta)(B_{k}^{t}v)= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_v ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_v ) (4.0.19)
1Nμj(Bjtv)T(Bjtv)absent1𝑁subscript𝜇𝑗superscriptsuperscriptsubscript𝐵𝑗𝑡𝑣𝑇superscriptsubscript𝐵𝑗𝑡𝑣\displaystyle\geq\frac{1}{N}\mu_{j}(B_{j}^{t}v)^{T}(B_{j}^{t}v)≥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_v ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_v ) (4.0.20)
=1NμjvT(Bjt)TBjtvabsent1𝑁subscript𝜇𝑗superscript𝑣𝑇superscriptsuperscriptsubscript𝐵𝑗𝑡𝑇superscriptsubscript𝐵𝑗𝑡𝑣\displaystyle=\frac{1}{N}\mu_{j}v^{T}(B_{j}^{t})^{T}B_{j}^{t}v= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_v (4.0.21)
1Nμjλminjv22absent1𝑁subscript𝜇𝑗superscriptsubscript𝜆min𝑗superscriptsubscriptnorm𝑣22\displaystyle\geq\frac{1}{N}\mu_{j}\lambda_{\text{min}}^{j}\|v\|_{2}^{2}≥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4.0.22)
=(1Nμjλminj)v22,absent1𝑁subscript𝜇𝑗superscriptsubscript𝜆min𝑗superscriptsubscriptnorm𝑣22\displaystyle=\left(\frac{1}{N}\mu_{j}\lambda_{\text{min}}^{j}\right)\|v\|_{2}% ^{2},= ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (4.0.23)

where λminjsuperscriptsubscript𝜆min𝑗\lambda_{\text{min}}^{j}italic_λ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the minimum eigenvalue of (Bjt)TBjtsuperscriptsuperscriptsubscript𝐵𝑗𝑡𝑇superscriptsubscript𝐵𝑗𝑡(B_{j}^{t})^{T}B_{j}^{t}( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Due to the symmetry of (Bjt)TBjtsuperscriptsuperscriptsubscript𝐵𝑗𝑡𝑇superscriptsubscript𝐵𝑗𝑡(B_{j}^{t})^{T}B_{j}^{t}( italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, λminj0superscriptsubscript𝜆min𝑗0\lambda_{\text{min}}^{j}\geq 0italic_λ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≥ 0 and is greater than zero if and only if Bjtsuperscriptsubscript𝐵𝑗𝑡B_{j}^{t}italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is injective. As Bjtsuperscriptsubscript𝐵𝑗𝑡B_{j}^{t}italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is injective, then λminj>0superscriptsubscript𝜆min𝑗0\lambda_{\text{min}}^{j}>0italic_λ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT > 0 and

vT2gt(θ)vsuperscript𝑣𝑇superscript2subscript𝑔𝑡𝜃𝑣\displaystyle v^{T}\nabla^{2}g_{t}(\theta)vitalic_v start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) italic_v (4.0.25)
(μjλminjN)v22absentsubscript𝜇𝑗superscriptsubscript𝜆min𝑗𝑁superscriptsubscriptnorm𝑣22\displaystyle\geq\left(\frac{\mu_{j}\lambda_{\text{min}}^{j}}{N}\right)\|v\|_{% 2}^{2}≥ ( divide start_ARG italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_λ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG ) ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (4.0.26)

and therefore gt(θ)subscript𝑔𝑡𝜃g_{t}(\theta)italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is strongly-convex. ∎

This result can then be used when considering least-squares functions.

Corollary 4.1.

Uniqueness of optimal parameters in the least-squares case
When our fksubscriptfkf_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be written as least-squares functions (4.0.1), then gt(θ)subscriptgtθg_{t}(\theta)italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) has a unique global minimiser θtsuperscriptsubscriptθt\theta_{t}^{*}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT if there exists some j{1,,N}j1Nj\in\{1,\cdots,N\}italic_j ∈ { 1 , ⋯ , italic_N } for which both BjtsuperscriptsubscriptBjtB_{j}^{t}italic_B start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT and AjsubscriptAjA_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are injective.

Proof.

If Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is injective then AjTAjsuperscriptsubscript𝐴𝑗𝑇subscript𝐴𝑗A_{j}^{T}A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is invertible which means that fj(x)=12Ajxyj22subscript𝑓𝑗𝑥12superscriptsubscriptnormsubscript𝐴𝑗𝑥superscript𝑦𝑗22f_{j}(x)=\frac{1}{2}\|A_{j}x-y^{j}\|_{2}^{2}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x - italic_y start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is strongly convex. ∎

4.1 Diagonal preconditioning

We first consider the diagonal parametrisation (P2). With this parametrisation, the optimisation problem (3.0.5) with Gpt=diag(pt)subscript𝐺subscript𝑝𝑡diagsubscript𝑝𝑡G_{p_{t}}=\operatorname{diag}(p_{t})italic_G start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_diag ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) has the following closed-form solution.

Proposition 4.3.

ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT defined by

pt=(1Nk=1N(fk(xkt)fk(xkt)T)(AkTAk))(1Nk=1Nfk(xkt)fk(xkt))subscript𝑝𝑡superscript1𝑁superscriptsubscript𝑘1𝑁direct-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡𝑇superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘1𝑁superscriptsubscript𝑘1𝑁direct-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡p_{t}=\bigg{(}\frac{1}{N}\sum_{k=1}^{N}\left(\nabla f_{k}(x_{k}^{t})\nabla f_{% k}(x_{k}^{t})^{T}\right)\odot(A_{k}^{T}A_{k})\bigg{)}^{\dagger}\bigg{(}\frac{1% }{N}\sum_{k=1}^{N}\nabla f_{k}(x_{k}^{t})\odot\nabla f_{k}(x_{k}^{t})\bigg{)}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⊙ ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (4.1.1)

is the minimal-norm solution to (3.0.5) with Gpt=diag(pt)subscript𝐺subscript𝑝𝑡diagsubscript𝑝𝑡G_{p_{t}}=\operatorname{diag}(p_{t})italic_G start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_diag ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where direct-product\odot represents the Hadarmard (element-wise) product.

Furthermore, suppose that there exists j{1,,N}𝑗1𝑁j\in\{1,\cdots,N\}italic_j ∈ { 1 , ⋯ , italic_N } such that Ajsubscript𝐴𝑗A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is injective (AjTAjsuperscriptsubscript𝐴𝑗𝑇subscript𝐴𝑗A_{j}^{T}A_{j}italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is invertible) and that [fj(xjt)]i0subscriptdelimited-[]subscript𝑓𝑗superscriptsubscript𝑥𝑗𝑡𝑖0[\nabla f_{j}(x_{j}^{t})]_{i}\neq 0[ ∇ italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0 for all i{1,,n}𝑖1𝑛i\in\{1,\cdots,n\}italic_i ∈ { 1 , ⋯ , italic_n }, where [v]isubscriptdelimited-[]𝑣𝑖[v]_{i}[ italic_v ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the ithsuperscript𝑖thi^{\text{th}}italic_i start_POSTSUPERSCRIPT th end_POSTSUPERSCRIPT component of the vector v𝑣vitalic_v. Then the inverse exists, and one can write

pt=(1Nk=1N(fk(xkt)fk(xkt)T)(AkTAk))1(1Nk=1Nfk(xkt)fk(xkt))subscript𝑝𝑡superscript1𝑁superscriptsubscript𝑘1𝑁direct-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡𝑇superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘11𝑁superscriptsubscript𝑘1𝑁direct-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡p_{t}=\bigg{(}\frac{1}{N}\sum_{k=1}^{N}\left(\nabla f_{k}(x_{k}^{t})\nabla f_{% k}(x_{k}^{t})^{T}\right)\odot(A_{k}^{T}A_{k})\bigg{)}^{-1}\bigg{(}\frac{1}{N}% \sum_{k=1}^{N}\nabla f_{k}(x_{k}^{t})\odot\nabla f_{k}(x_{k}^{t})\bigg{)}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⊙ ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (4.1.2)
Proof.

For a diagonal preconditioner diag(p)diag𝑝\operatorname{diag}(p)roman_diag ( italic_p ) for pn𝑝superscript𝑛p\in\mathbb{R}^{n}italic_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT we have that

xkt+1=xktdiag(p)fk(xkt)=xktdiag(fk(xkt))psuperscriptsubscript𝑥𝑘𝑡1superscriptsubscript𝑥𝑘𝑡diag𝑝subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝑥𝑘𝑡diagsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑝x_{k}^{t+1}=x_{k}^{t}-\operatorname{diag}(p)\nabla f_{k}(x_{k}^{t})=x_{k}^{t}-% \operatorname{diag}(\nabla f_{k}(x_{k}^{t}))pitalic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t + 1 end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - roman_diag ( italic_p ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - roman_diag ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) italic_p (4.1.3)

and so we take

θ𝜃\displaystyle\thetaitalic_θ =p,absent𝑝\displaystyle=p,= italic_p , (4.1.4)
Bktsuperscriptsubscript𝐵𝑘𝑡\displaystyle B_{k}^{t}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =diag(fk(xkt)),absentdiagsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=\operatorname{diag}(\nabla f_{k}(x_{k}^{t})),= roman_diag ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) , (4.1.5)
vktsuperscriptsubscript𝑣𝑘𝑡\displaystyle v_{k}^{t}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =xkt.absentsuperscriptsubscript𝑥𝑘𝑡\displaystyle=x_{k}^{t}.= italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . (4.1.6)

Now,

(AkBkt)T(AkBkt)superscriptsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝑇subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡\displaystyle(A_{k}B_{k}^{t})^{T}(A_{k}B_{k}^{t})( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =diag(fk(xkt))AkTAkdiag(fk(xkt))absentdiagsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘diagsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=\operatorname{diag}(\nabla f_{k}(x_{k}^{t}))A_{k}^{T}A_{k}% \operatorname{diag}(\nabla f_{k}(x_{k}^{t}))= roman_diag ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT roman_diag ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (4.1.7)
=(AkTAk)(fk(xkt)fk(xkt)T),absentdirect-productsuperscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡𝑇\displaystyle=(A_{k}^{T}A_{k})\odot(\nabla f_{k}(x_{k}^{t})\nabla f_{k}(x_{k}^% {t})^{T}),= ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⊙ ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) , (4.1.8)

and

(Bkt)Tfk(vkt)superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡\displaystyle(B_{k}^{t})^{T}\nabla f_{k}(v_{k}^{t})( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =diag(fk(xkt))fk(vkt)=fk(xkt)fk(xkt).absentdiagsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡direct-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=\operatorname{diag}(\nabla f_{k}(x_{k}^{t}))\nabla f_{k}(v_{k}^{% t})=\nabla f_{k}(x_{k}^{t})\odot\nabla f_{k}(x_{k}^{t}).= roman_diag ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (4.1.9)

Inserting these values in (4.0.2) gives

p=(1Nk=1N(AkTAk)(fk(xkt)fk(xkt)T))(1Nk=1Nfk(xkt)fk(xkt)).𝑝superscript1𝑁superscriptsubscript𝑘1𝑁direct-productsuperscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡𝑇1𝑁superscriptsubscript𝑘1𝑁direct-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡p=\left(\frac{1}{N}\sum_{k=1}^{N}(A_{k}^{T}A_{k})\odot(\nabla f_{k}(x_{k}^{t})% \nabla f_{k}(x_{k}^{t})^{T})\right)^{\dagger}\left(\frac{1}{N}\sum_{k=1}^{N}% \nabla f_{k}(x_{k}^{t})\odot\nabla f_{k}(x_{k}^{t})\right).italic_p = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ⊙ ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊙ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) . (4.1.10)

In this case we have Bkt=diag(fk(xkt))superscriptsubscript𝐵𝑘𝑡diagsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡B_{k}^{t}=\operatorname{diag}(\nabla f_{k}(x_{k}^{t}))italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = roman_diag ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) and vkt=xktsuperscriptsubscript𝑣𝑘𝑡superscriptsubscript𝑥𝑘𝑡v_{k}^{t}=x_{k}^{t}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT. Bktsuperscriptsubscript𝐵𝑘𝑡B_{k}^{t}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is therefore injective if and only if [fk(xkt)]i0subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑖0[\nabla f_{k}(x_{k}^{t})]_{i}\neq 0[ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0 for i{1,,n}𝑖1𝑛i\in\{1,\cdots,n\}italic_i ∈ { 1 , ⋯ , italic_n } and therefore by proposition 4.2 there is a unique solution, and so the inverse exists.

Proposition 4.4.

In the case N=1𝑁1N=1italic_N = 1, we consider a lone function f=f1𝑓subscript𝑓1f=f_{1}italic_f = italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with an initial point x10=x0superscriptsubscript𝑥10subscript𝑥0x_{1}^{0}=x_{0}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then, the preconditioned gradient descent algorithm (1.3.2) with diagonal preconditioner converges in one iteration.

Proof.

Denote by x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT the starting point and xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT a global minimum of f𝑓fitalic_f. As p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is chosen to be the global minimum of g0(p)subscript𝑔0𝑝g_{0}(p)italic_g start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_p ), it is sufficient to show there exists some diagonal preconditioner which leads to x1=xsubscript𝑥1superscript𝑥x_{1}=x^{*}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Choose the vector p0nsubscript𝑝0superscript𝑛p_{0}\in\mathbb{R}^{n}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT such that

[p0]i={[x0x]i[f(x0)]i, if [f(x0)]i0,0, otherwise,subscriptdelimited-[]subscript𝑝0𝑖casessubscriptdelimited-[]subscript𝑥0superscript𝑥𝑖subscriptdelimited-[]𝑓subscript𝑥0𝑖 if subscriptdelimited-[]𝑓subscript𝑥0𝑖0otherwise0 otherwiseotherwise[p_{0}]_{i}=\begin{cases}\frac{[x_{0}-x^{*}]_{i}}{[\nabla f(x_{0})]_{i}},\text% { if }[\nabla f(x_{0})]_{i}\neq 0,\\ 0,\text{ otherwise},\end{cases}[ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL divide start_ARG [ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG [ ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG , if [ ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ 0 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , otherwise , end_CELL start_CELL end_CELL end_ROW (4.1.11)

then let

P0=diag(p0).subscript𝑃0diagsubscript𝑝0P_{0}=\operatorname{diag}(p_{0}).italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_diag ( italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) . (4.1.12)

Due to the fact that if [f(x0)]isubscriptdelimited-[]𝑓subscript𝑥0𝑖[\nabla f(x_{0})]_{i}[ ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT then [x0x]i=0subscriptdelimited-[]subscript𝑥0superscript𝑥𝑖0[x_{0}-x^{*}]_{i}=0[ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0, we have

pf(x0)=x0x.direct-product𝑝𝑓subscript𝑥0subscript𝑥0superscript𝑥\displaystyle p\odot\nabla f(x_{0})=x_{0}-x^{*}.italic_p ⊙ ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . (4.1.13)

Then

x1subscript𝑥1\displaystyle x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =x0P0f(x0)absentsubscript𝑥0subscript𝑃0𝑓subscript𝑥0\displaystyle=x_{0}-P_{0}\nabla f(x_{0})= italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (4.1.14)
=x0pf(x0)absentsubscript𝑥0direct-product𝑝𝑓subscript𝑥0\displaystyle=x_{0}-p\odot\nabla f(x_{0})= italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_p ⊙ ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (4.1.15)
=x0(x0x)absentsubscript𝑥0subscript𝑥0superscript𝑥\displaystyle=x_{0}-(x_{0}-x^{\star})= italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) (4.1.16)
=x,absentsuperscript𝑥\displaystyle=x^{\star},= italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , (4.1.17)

as required. ∎

4.2 Full matrix preconditioning

Next, we consider the full matrix parametrisation (P3). We consider the optimisation problem (3.0.5) with GPt=Ptsubscript𝐺subscript𝑃𝑡subscript𝑃𝑡G_{P_{t}}=P_{t}italic_G start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Proposition 4.5.

Let θtn2subscript𝜃𝑡superscriptsuperscript𝑛2\theta_{t}\in\mathbb{R}^{n^{2}}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be such that

θt=(1Nk=1N(fk(xkt)fk(xkt)T)(AkTAk))(1Nk=1Nfk(xkt)fk(xkt)).subscript𝜃𝑡superscript1𝑁superscriptsubscript𝑘1𝑁tensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡𝑇superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘1𝑁superscriptsubscript𝑘1𝑁tensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\theta_{t}=\bigg{(}\frac{1}{N}\sum_{k=1}^{N}\left(\nabla f_{k}(x_{k}^{t})% \nabla f_{k}(x_{k}^{t})^{T}\right)\otimes(A_{k}^{T}A_{k})\bigg{)}^{\dagger}% \bigg{(}\frac{1}{N}\sum_{k=1}^{N}\nabla f_{k}(x_{k}^{t})\otimes\nabla f_{k}(x_% {k}^{t})\bigg{)}.italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⊗ ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) . (4.2.1)

Then define Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by

Pt=([θt]1[θt]n+1[θt](n1)n+1[θt]2[θt]n+2[θt](n1)n+2[θt]n[θt]2n[θt]n2).subscript𝑃𝑡matrixsubscriptdelimited-[]subscript𝜃𝑡1subscriptdelimited-[]subscript𝜃𝑡𝑛1subscriptdelimited-[]subscript𝜃𝑡𝑛1𝑛1subscriptdelimited-[]subscript𝜃𝑡2subscriptdelimited-[]subscript𝜃𝑡𝑛2subscriptdelimited-[]subscript𝜃𝑡𝑛1𝑛2subscriptdelimited-[]subscript𝜃𝑡𝑛subscriptdelimited-[]subscript𝜃𝑡2𝑛subscriptdelimited-[]subscript𝜃𝑡superscript𝑛2P_{t}=\begin{pmatrix}[\theta_{t}]_{1}&[\theta_{t}]_{n+1}&\cdots&[\theta_{t}]_{% (n-1)n+1}\\ [\theta_{t}]_{2}&[\theta_{t}]_{n+2}&\cdots&[\theta_{t}]_{(n-1)n+2}\\ \vdots&\vdots&\ddots&\vdots\\ [\theta_{t}]_{n}&[\theta_{t}]_{2n}&\cdots&[\theta_{t}]_{n^{2}}\end{pmatrix}.italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT ( italic_n - 1 ) italic_n + 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_n + 2 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT ( italic_n - 1 ) italic_n + 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) . (4.2.2)

This is the minimal-norm solution to (3.0.5) with GPt=Ptsubscript𝐺subscript𝑃𝑡subscript𝑃𝑡G_{P_{t}}=P_{t}italic_G start_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where tensor-product\otimes represents the Kronecker product of two matrices, defined as

AB=[a11Ba12Ba1nBa21Ba22Ba2nBam1Bam2BamnB]tensor-product𝐴𝐵matrixsubscript𝑎11𝐵subscript𝑎12𝐵subscript𝑎1𝑛𝐵subscript𝑎21𝐵subscript𝑎22𝐵subscript𝑎2𝑛𝐵subscript𝑎𝑚1𝐵subscript𝑎𝑚2𝐵subscript𝑎𝑚𝑛𝐵A\otimes B=\begin{bmatrix}a_{11}B&a_{12}B&\cdots&a_{1n}B\\ a_{21}B&a_{22}B&\cdots&a_{2n}B\\ \vdots&\vdots&\ddots&\vdots\\ a_{m1}B&a_{m2}B&\cdots&a_{mn}B\end{bmatrix}italic_A ⊗ italic_B = [ start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_B end_CELL start_CELL italic_a start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT italic_B end_CELL start_CELL ⋯ end_CELL start_CELL italic_a start_POSTSUBSCRIPT 1 italic_n end_POSTSUBSCRIPT italic_B end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT italic_B end_CELL start_CELL italic_a start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT italic_B end_CELL start_CELL ⋯ end_CELL start_CELL italic_a start_POSTSUBSCRIPT 2 italic_n end_POSTSUBSCRIPT italic_B end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT italic_m 1 end_POSTSUBSCRIPT italic_B end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_m 2 end_POSTSUBSCRIPT italic_B end_CELL start_CELL ⋯ end_CELL start_CELL italic_a start_POSTSUBSCRIPT italic_m italic_n end_POSTSUBSCRIPT italic_B end_CELL end_ROW end_ARG ]

.

Note the matrix in (4.2.1) is of dimension n4superscript𝑛4n^{4}italic_n start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT. If n𝑛nitalic_n is large, this dimension becomes extremely large.

Proof.

For a full matrix preconditioner Pn×n𝑃superscript𝑛𝑛P\in\mathbb{R}^{n\times n}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT, we require

xktPfk(xkt)=vkBktθ,superscriptsubscript𝑥𝑘𝑡𝑃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑣𝑘superscriptsubscript𝐵𝑘𝑡𝜃x_{k}^{t}-P\nabla f_{k}(x_{k}^{t})=v_{k}-B_{k}^{t}\theta,italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_P ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ , (4.2.3)

where, in this instance θn2𝜃superscriptsuperscript𝑛2\theta\in\mathbb{R}^{n^{2}}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT. From (4.2.4) we have that

θi{xktGθfk(xkt)}=(Bk)i.subscript𝜃𝑖superscriptsubscript𝑥𝑘𝑡subscript𝐺𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscriptsubscript𝐵𝑘𝑖\frac{\partial}{\partial\theta_{i}}\left\{x_{k}^{t}-G_{\theta}\nabla f_{k}(x_{% k}^{t})\right\}=(B_{k})_{i}.divide start_ARG ∂ end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } = ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (4.2.4)

Note that

Pij[xktPfk(xkt)]qsubscript𝑃𝑖𝑗subscriptdelimited-[]superscriptsubscript𝑥𝑘𝑡𝑃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑞\displaystyle\frac{\partial}{\partial P_{ij}}[x_{k}^{t}-P\nabla f_{k}(x_{k}^{t% })]_{q}divide start_ARG ∂ end_ARG start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG [ italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_P ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT =Pij[Pfk(xkt)]qabsentsubscript𝑃𝑖𝑗subscriptdelimited-[]𝑃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑞\displaystyle=-\frac{\partial}{\partial P_{ij}}[P\nabla f_{k}(x_{k}^{t})]_{q}= - divide start_ARG ∂ end_ARG start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG [ italic_P ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT (4.2.5)
=Pijr=1nPqr[fk(xkt)]rabsentsubscript𝑃𝑖𝑗superscriptsubscript𝑟1𝑛subscript𝑃𝑞𝑟subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑟\displaystyle=-\frac{\partial}{\partial P_{ij}}\sum_{r=1}^{n}P_{qr}[\nabla f_{% k}(x_{k}^{t})]_{r}= - divide start_ARG ∂ end_ARG start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_q italic_r end_POSTSUBSCRIPT [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (4.2.6)
={0, if qi,[fk(xkt)]j, otherwise .absentcases0 if 𝑞𝑖otherwisesubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑗 otherwise otherwise\displaystyle=\begin{cases}0,\text{ if }q\neq i,\\ -[\nabla f_{k}(x_{k}^{t})]_{j},\text{ otherwise }.\end{cases}= { start_ROW start_CELL 0 , if italic_q ≠ italic_i , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL - [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , otherwise . end_CELL start_CELL end_CELL end_ROW (4.2.7)

Therefore,

Pij{xktPfk(xkt)}=[fk(xkt)]jδi,subscript𝑃𝑖𝑗superscriptsubscript𝑥𝑘𝑡𝑃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑗subscript𝛿𝑖\displaystyle\frac{\partial}{\partial P_{ij}}\left\{x_{k}^{t}-P\nabla f_{k}(x_% {k}^{t})\right\}=-[\nabla f_{k}(x_{k}^{t})]_{j}\delta_{i},divide start_ARG ∂ end_ARG start_ARG ∂ italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG { italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_P ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) } = - [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (4.2.8)

where δinsubscript𝛿𝑖superscript𝑛\delta_{i}\in\mathbb{R}^{n}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, such that

[δi]r={0, if ir1, otherwise .subscriptdelimited-[]subscript𝛿𝑖𝑟cases0 if 𝑖𝑟otherwise1 otherwise otherwise[\delta_{i}]_{r}=\begin{cases}0,\text{ if }i\neq r\\ 1,\text{ otherwise }.\end{cases}[ italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = { start_ROW start_CELL 0 , if italic_i ≠ italic_r end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 , otherwise . end_CELL start_CELL end_CELL end_ROW (4.2.9)

Therefore, Bksubscript𝐵𝑘B_{k}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the matrix with columns [fk(xkt)]jδisubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑗subscript𝛿𝑖[\nabla f_{k}(x_{k}^{t})]_{j}\delta_{i}[ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for i,j{1,,n}𝑖𝑗1𝑛i,j\in\{1,\cdots,n\}italic_i , italic_j ∈ { 1 , ⋯ , italic_n }. Therefore,

Bkt=[[fk(xkt)]1In[fk(xkt)]nIn]=(fk(xkt)In)T,superscriptsubscript𝐵𝑘𝑡matrixsubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1subscript𝐼𝑛subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛subscript𝐼𝑛superscripttensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐼𝑛𝑇B_{k}^{t}=\begin{bmatrix}[\nabla f_{k}(x_{k}^{t})]_{1}I_{n}&\cdots&[\nabla f_{% k}(x_{k}^{t})]_{n}I_{n}\end{bmatrix}=(\nabla f_{k}(x_{k}^{t})\otimes I_{n})^{T},italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , (4.2.10)

which corresponds to θ𝜃\thetaitalic_θ defined as

θ=(P11Pn1P12Pn2Pnn).𝜃matrixsubscript𝑃11subscript𝑃𝑛1subscript𝑃12subscript𝑃𝑛2subscript𝑃𝑛𝑛\theta=\begin{pmatrix}P_{11}\\ \vdots\\ P_{n1}\\ P_{12}\\ \vdots\\ P_{n2}\vdots\\ \vdots\\ P_{nn}\end{pmatrix}.italic_θ = ( start_ARG start_ROW start_CELL italic_P start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_n 2 end_POSTSUBSCRIPT ⋮ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_P start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) . (4.2.11)

We can also write

vkt=xkt.superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝑥𝑘𝑡v_{k}^{t}=x_{k}^{t}.italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . (4.2.12)

Note that then

AkBkt=[[fk(xkt)]1Ak[fk(xkt)]nAk]subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡matrixsubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1subscript𝐴𝑘subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛subscript𝐴𝑘\displaystyle A_{k}B_{k}^{t}=\begin{bmatrix}[\nabla f_{k}(x_{k}^{t})]_{1}A_{k}% &\cdots&[\nabla f_{k}(x_{k}^{t})]_{n}A_{k}\end{bmatrix}italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (4.2.13)
(AkBkt)T(AkBkt)superscriptsubscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡𝑇subscript𝐴𝑘superscriptsubscript𝐵𝑘𝑡\displaystyle(A_{k}B_{k}^{t})^{T}(A_{k}B_{k}^{t})( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =[[fk(xkt)]1AkT[fk(xkt)]nAkT][[fk(xkt)]1Ak[fk(xkt)]nAk]absentmatrixsubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1superscriptsubscript𝐴𝑘𝑇subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛superscriptsubscript𝐴𝑘𝑇matrixsubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1subscript𝐴𝑘subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛subscript𝐴𝑘\displaystyle=\begin{bmatrix}[\nabla f_{k}(x_{k}^{t})]_{1}A_{k}^{T}\\ \vdots\\ [\nabla f_{k}(x_{k}^{t})]_{n}A_{k}^{T}\end{bmatrix}\begin{bmatrix}[\nabla f_{k% }(x_{k}^{t})]_{1}A_{k}&\cdots&[\nabla f_{k}(x_{k}^{t})]_{n}A_{k}\end{bmatrix}= [ start_ARG start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (4.2.14)
=[[fk(xkt)]12AkTAk[fk(xkt)]1[fk(xkt)]nAkTAk[fk(xkt)]1[fk(xkt)]nAkTAk[fk(xkt)]n2AkTAk]absentmatrixsuperscriptsubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡12superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘superscriptsubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛2superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘\displaystyle=\begin{bmatrix}[\nabla f_{k}(x_{k}^{t})]_{1}^{2}A_{k}^{T}A_{k}&% \cdots&[\nabla f_{k}(x_{k}^{t})]_{1}[\nabla f_{k}(x_{k}^{t})]_{n}A_{k}^{T}A_{k% }\\ \vdots&\ddots&\vdots\\ [\nabla f_{k}(x_{k}^{t})]_{1}[\nabla f_{k}(x_{k}^{t})]_{n}A_{k}^{T}A_{k}&% \cdots&[\nabla f_{k}(x_{k}^{t})]_{n}^{2}A_{k}^{T}A_{k}\\ \end{bmatrix}= [ start_ARG start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (4.2.15)
=(fk(xkt)fk(xkt)T)(AkTAk).absenttensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡𝑇superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘\displaystyle=(\nabla f_{k}(x_{k}^{t})\nabla f_{k}(x_{k}^{t})^{T})\otimes(A_{k% }^{T}A_{k}).= ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⊗ ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) . (4.2.16)

Secondly,

(Bkt)Tfk(vkt)superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡\displaystyle(B_{k}^{t})^{T}\nabla f_{k}(v_{k}^{t})( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) =[[fk(xkt)]1In[fk(xkt)]nIn]fk(xkt)absentmatrixsubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1subscript𝐼𝑛subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛subscript𝐼𝑛subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=\begin{bmatrix}[\nabla f_{k}(x_{k}^{t})]_{1}I_{n}\\ \vdots\\ [\nabla f_{k}(x_{k}^{t})]_{n}I_{n}\end{bmatrix}\nabla f_{k}(x_{k}^{t})= [ start_ARG start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (4.2.17)
=[[fk(xkt)]1fk(xkt)[fk(xkt)]nfk(xkt)]absentmatrixsubscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscriptdelimited-[]subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑛subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=\begin{bmatrix}[\nabla f_{k}(x_{k}^{t})]_{1}\nabla f_{k}(x_{k}^{% t})\\ \vdots\\ [\nabla f_{k}(x_{k}^{t})]_{n}\nabla f_{k}(x_{k}^{t})\end{bmatrix}= [ start_ARG start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] (4.2.18)
=fk(xkt)fk(xkt).absenttensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=\nabla f_{k}(x_{k}^{t})\otimes\nabla f_{k}(x_{k}^{t}).= ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (4.2.19)

Therefore, θ𝜃\thetaitalic_θ, the vectorised form of P𝑃Pitalic_P can be given by

θ=(1Nk=1N(fk(xkt)fk(xkt)T)(AkTAk))(1Nk=1Nfk(xkt)fk(xkt)).𝜃superscript1𝑁superscriptsubscript𝑘1𝑁tensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡𝑇superscriptsubscript𝐴𝑘𝑇subscript𝐴𝑘1𝑁superscriptsubscript𝑘1𝑁tensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\theta=\left(\frac{1}{N}\sum_{k=1}^{N}(\nabla f_{k}(x_{k}^{t})\nabla f_{k}(x_{% k}^{t})^{T})\otimes(A_{k}^{T}A_{k})\right)^{\dagger}\left(\frac{1}{N}\sum_{k=1% }^{N}\nabla f_{k}(x_{k}^{t})\otimes\nabla f_{k}(x_{k}^{t})\right).italic_θ = ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ⊗ ( italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) . (4.2.20)

While diagonal preconditioning obtains instant convergence for one function, full matrix preconditioning, under certain conditions, can obtain immediate convergence for all functions in the dataset if N<n𝑁𝑛N<nitalic_N < italic_n.

Corollary 4.2.

Suppose that {f1(x10),,fN(xN0)}subscript𝑓1superscriptsubscript𝑥10subscript𝑓𝑁superscriptsubscript𝑥𝑁0\{\nabla f_{1}(x_{1}^{0}),\cdots,\nabla f_{N}(x_{N}^{0})\}{ ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , ⋯ , ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) } is a linearly independent set, then if Nn𝑁𝑛N\leq nitalic_N ≤ italic_n, the full matrix preconditioner P𝑃Pitalic_P causes instant convergence for all datapoints. In particular,

xk1:=xk0Pfk(xk0)=xk,assignsuperscriptsubscript𝑥𝑘1superscriptsubscript𝑥𝑘0𝑃subscript𝑓𝑘superscriptsubscript𝑥𝑘0superscriptsubscript𝑥𝑘x_{k}^{1}:=x_{k}^{0}-P\nabla f_{k}(x_{k}^{0})=x_{k}^{*},italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT := italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_P ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , (4.2.21)

for all k{1,,N}𝑘1𝑁k\in\{1,\cdots,N\}italic_k ∈ { 1 , ⋯ , italic_N }.

Proof.

It is sufficient to show there exists a matrix P𝑃Pitalic_P such that (4.2.21) is satisfied for all k𝑘kitalic_k. We require

{x1=x10Pf1(x10),xk=xN0PfN(xN0).casessuperscriptsubscript𝑥1absentsuperscriptsubscript𝑥10𝑃subscript𝑓1superscriptsubscript𝑥10otherwisesuperscriptsubscript𝑥𝑘absentsuperscriptsubscript𝑥𝑁0𝑃subscript𝑓𝑁superscriptsubscript𝑥𝑁0\displaystyle\begin{cases}x_{1}^{*}&=x_{1}^{0}-P\nabla f_{1}(x_{1}^{0}),\\ &\vdots\\ x_{k}^{*}&=x_{N}^{0}-P\nabla f_{N}(x_{N}^{0}).\end{cases}{ start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_P ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL start_CELL = italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_P ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) . end_CELL end_ROW (4.2.22)

Each of these equations gives n𝑛nitalic_n linear equations in n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT unknowns. There are N𝑁Nitalic_N such equations and so we have nN𝑛𝑁nNitalic_n italic_N linear equations in n2superscript𝑛2n^{2}italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT unknowns. Rewritten, these read

P[f1(x10)||fN(xN0)]=[x10x1||xN0xN].𝑃matrixsubscript𝑓1superscriptsubscript𝑥10subscript𝑓𝑁superscriptsubscript𝑥𝑁0matrixsuperscriptsubscript𝑥10superscriptsubscript𝑥1superscriptsubscript𝑥𝑁0superscriptsubscript𝑥𝑁P\begin{bmatrix}\nabla f_{1}(x_{1}^{0})|\cdots|\nabla f_{N}(x_{N}^{0})\end{% bmatrix}=\begin{bmatrix}x_{1}^{0}-x_{1}^{*}|\cdots|x_{N}^{0}-x_{N}^{*}\end{% bmatrix}.italic_P [ start_ARG start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | ⋯ | ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ⋯ | italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] . (4.2.23)

For such a P𝑃Pitalic_P to exist we require

  • The columns fk(xk0)subscript𝑓𝑘superscriptsubscript𝑥𝑘0\nabla f_{k}(x_{k}^{0})∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) to be linearly independent,

  • nNn2𝑛𝑁superscript𝑛2nN\leq n^{2}italic_n italic_N ≤ italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which is equivalent to Nn𝑁𝑛N\leq nitalic_N ≤ italic_n.

Note in the case that N=n𝑁𝑛N=nitalic_N = italic_n, we have a unique choice of P𝑃Pitalic_P:

P=[f1(x10)||fN(xN0)]1[x10x1||xN0xN]𝑃superscriptmatrixsubscript𝑓1superscriptsubscript𝑥10subscript𝑓𝑁superscriptsubscript𝑥𝑁01matrixsuperscriptsubscript𝑥10superscriptsubscript𝑥1superscriptsubscript𝑥𝑁0superscriptsubscript𝑥𝑁P=\begin{bmatrix}\nabla f_{1}(x_{1}^{0})|\cdots|\nabla f_{N}(x_{N}^{0})\end{% bmatrix}^{-1}\begin{bmatrix}x_{1}^{0}-x_{1}^{*}|\cdots|x_{N}^{0}-x_{N}^{*}\end% {bmatrix}italic_P = [ start_ARG start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) | ⋯ | ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ⋯ | italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT - italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] (4.2.24)

4.3 Scalar step-size

Consider now the case where we learn scalars αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (P1) such that Gαt=αtIsubscript𝐺subscript𝛼𝑡subscript𝛼𝑡𝐼G_{\alpha_{t}}=\alpha_{t}Iitalic_G start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I.

Proposition 4.6.

If each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT can be written as a least-squares function (4.0.1), then αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be given as

αt={1Nk=1Nfk(xkt)221Nk=1NAkfk(xkt)22, if Akfk(xkt)=0¯ for all k0, otherwise .subscript𝛼𝑡cases1𝑁superscriptsubscript𝑘1𝑁superscriptsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡221𝑁superscriptsubscript𝑘1𝑁superscriptsubscriptnormsubscript𝐴𝑘subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22 if subscript𝐴𝑘subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡¯0 for all 𝑘otherwise0 otherwise otherwise\alpha_{t}=\begin{cases}\frac{\frac{1}{N}\sum_{k=1}^{N}\|\nabla f_{k}(x_{k}^{t% })\|_{2}^{2}}{\frac{1}{N}\sum_{k=1}^{N}\|A_{k}\nabla f_{k}(x_{k}^{t})\|_{2}^{2% }},\text{ if }A_{k}\nabla f_{k}(x_{k}^{t})=\underline{0}\text{ for all }k\\ 0,\text{ otherwise }.\end{cases}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , if italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = under¯ start_ARG 0 end_ARG for all italic_k end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , otherwise . end_CELL start_CELL end_CELL end_ROW (4.3.1)

Note that in the case N=1𝑁1N=1italic_N = 1, this reduces to exact line search for least-squares functions.

Proof.

In this case, we wish to calculate the optimal greedy scalar step size α𝛼\alphaitalic_α, such that

xktαfk(xkt).superscriptsubscript𝑥𝑘𝑡𝛼subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡x_{k}^{t}-\alpha\nabla f_{k}(x_{k}^{t}).italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_α ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) . (4.3.2)

Then we take

Bktsuperscriptsubscript𝐵𝑘𝑡\displaystyle B_{k}^{t}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =fk(xkt)absentsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=\nabla f_{k}(x_{k}^{t})= ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (4.3.3)
vktsuperscriptsubscript𝑣𝑘𝑡\displaystyle v_{k}^{t}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT =xkt.absentsuperscriptsubscript𝑥𝑘𝑡\displaystyle=x_{k}^{t}.= italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT . (4.3.4)

Then (4.0.2) reduces to

α={0, if Akfk(xkt)=0 for all k,1Nk=1Nfk(xkt)221Nk=1NAkfk(xkt)22, otherwise.𝛼cases0 if subscript𝐴𝑘subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡0 for all 𝑘otherwise1𝑁superscriptsubscript𝑘1𝑁superscriptsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡221𝑁superscriptsubscript𝑘1𝑁superscriptsubscriptnormsubscript𝐴𝑘subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22 otherwiseotherwise\displaystyle\alpha=\begin{cases}0,\text{ if }A_{k}\nabla f_{k}(x_{k}^{t})=0% \text{ for all }k,\\ \frac{\frac{1}{N}\sum_{k=1}^{N}\|\nabla f_{k}(x_{k}^{t})\|_{2}^{2}}{\frac{1}{N% }\sum_{k=1}^{N}\|A_{k}\nabla f_{k}(x_{k}^{t})\|_{2}^{2}},\text{ otherwise}.% \end{cases}italic_α = { start_ROW start_CELL 0 , if italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) = 0 for all italic_k , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_A start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , otherwise . end_CELL start_CELL end_CELL end_ROW (4.3.5)

5 Approximating optimal parameters

In the general case, we can’t simply consider least-squares functions, a closed-form solution does not exist for choosing αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, Ptsubscript𝑃𝑡P_{t}italic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (P1)-(P3). Instead, we require an optimisation algorithm to approximate these quantities. With information of

  • gt(θ)subscript𝑔𝑡𝜃\nabla g_{t}(\theta)∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ), and

  • Lgtsubscript𝐿subscript𝑔𝑡L_{g_{t}}italic_L start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, the Lipschitz constant of gt(θ)subscript𝑔𝑡𝜃\nabla g_{t}(\theta)∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ),

one can use a first-order convex optimisation algorithm, such as gradient descent FISTA, or stochastic methods (especially for large N𝑁Nitalic_N) to approximate θtsuperscriptsubscript𝜃𝑡\theta_{t}^{*}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. For example, one can start at an initial guess θt0superscriptsubscript𝜃𝑡0\theta_{t}^{0}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT at iteration t𝑡titalic_t and update via gradient descent

θtw+1=θtw1Lgtg(θtw).superscriptsubscript𝜃𝑡𝑤1superscriptsubscript𝜃𝑡𝑤1subscript𝐿subscript𝑔𝑡𝑔superscriptsubscript𝜃𝑡𝑤\theta_{t}^{w+1}=\theta_{t}^{w}-\frac{1}{L_{g_{t}}}\nabla g(\theta_{t}^{w}).italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w + 1 end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ∇ italic_g ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) . (5.0.1)

The following result illustrates how these values can be calculated.

Proposition 5.1.

For a general affine preconditioner Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, the gradient of gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with respect to θ𝜃\thetaitalic_θ can be calculated as

gt(θ)=1Nk=1N(Bkt)Tfk(xktGθfk(xkt)),subscript𝑔𝑡𝜃1𝑁superscriptsubscript𝑘1𝑁superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐺𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\nabla g_{t}(\theta)=-\frac{1}{N}\sum_{k=1}^{N}(B_{k}^{t})^{T}\nabla f_{k}(x_{% k}^{t}-G_{\theta}\nabla f_{k}(x_{k}^{t})),∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) , (5.0.2)

and gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is Lgtsubscript𝐿subscript𝑔𝑡L_{g_{t}}italic_L start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT-smooth, where

Lgt=1Nk=1NLkBkt2.subscript𝐿subscript𝑔𝑡1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptnormsuperscriptsubscript𝐵𝑘𝑡2L_{g_{t}}=\frac{1}{N}\sum_{k=1}^{N}L_{k}\|B_{k}^{t}\|^{2}.italic_L start_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.0.3)
Proof.

As

gt(θ)=1Nk=1Nfk(xktGθfk(xkt))=1Nk=1Nfk(vktBktθ),subscript𝑔𝑡𝜃1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐺𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃g_{t}(\theta)=\frac{1}{N}\sum_{k=1}^{N}f_{k}(x_{k}^{t}-G_{\theta}\nabla f_{k}(% x_{k}^{t}))=\frac{1}{N}\sum_{k=1}^{N}f_{k}(v_{k}^{t}-B_{k}^{t}\theta),italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) , (5.0.4)

then by the chain rule

gt(θ)=1Nk=1NBkTfk(vktBktθ),subscript𝑔𝑡𝜃1𝑁superscriptsubscript𝑘1𝑁superscriptsubscript𝐵𝑘𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃\nabla g_{t}(\theta)=-\frac{1}{N}\sum_{k=1}^{N}B_{k}^{T}\nabla f_{k}(v_{k}^{t}% -B_{k}^{t}\theta),∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) , (5.0.5)

as required. To calculate the smoothness constant, we have

gt(θ1)gt(θ2)2subscriptnormsubscript𝑔𝑡subscript𝜃1subscript𝑔𝑡subscript𝜃22\displaystyle\|\nabla g_{t}(\theta_{1})-\nabla g_{t}(\theta_{2})\|_{2}∥ ∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =1Nk=1N(Bkt)T(fk(vktBktθ2)fk(vktBktθ2))2absentsubscriptnorm1𝑁superscriptsubscript𝑘1𝑁superscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡subscript𝜃2subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡subscript𝜃22\displaystyle=\left\|\frac{1}{N}\sum_{k=1}^{N}(B_{k}^{t})^{T}(\nabla f_{k}(v_{% k}^{t}-B_{k}^{t}\theta_{2})-\nabla f_{k}(v_{k}^{t}-B_{k}^{t}\theta_{2}))\right% \|_{2}= ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5.0.6)
1Nk=1N(Bkt)T(fk(vktBktθ2)fk(vktBktθ1))2absent1𝑁superscriptsubscript𝑘1𝑁subscriptnormsuperscriptsuperscriptsubscript𝐵𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡subscript𝜃2subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡subscript𝜃12\displaystyle\leq\frac{1}{N}\sum_{k=1}^{N}\left\|(B_{k}^{t})^{T}(\nabla f_{k}(% v_{k}^{t}-B_{k}^{t}\theta_{2})-\nabla f_{k}(v_{k}^{t}-B_{k}^{t}\theta_{1}))% \right\|_{2}≤ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ ( italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5.0.7)
1Nk=1NBktfk(vktBktθ2)fk(vktBktθ1)2absent1𝑁superscriptsubscript𝑘1𝑁normsuperscriptsubscript𝐵𝑘𝑡subscriptnormsubscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡subscript𝜃2subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡subscript𝜃12\displaystyle\leq\frac{1}{N}\sum_{k=1}^{N}\|B_{k}^{t}\|\left\|\nabla f_{k}(v_{% k}^{t}-B_{k}^{t}\theta_{2})-\nabla f_{k}(v_{k}^{t}-B_{k}^{t}\theta_{1})\right% \|_{2}≤ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5.0.8)
1Nk=1NLkBktBkt(θ1θ2)2absent1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘normsuperscriptsubscript𝐵𝑘𝑡subscriptnormsuperscriptsubscript𝐵𝑘𝑡subscript𝜃1subscript𝜃22\displaystyle\leq\frac{1}{N}\sum_{k=1}^{N}L_{k}\|B_{k}^{t}\|\left\|B_{k}^{t}(% \theta_{1}-\theta_{2})\right\|_{2}≤ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5.0.9)
1Nk=1NLkBkt2θ1θ22absent1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptnormsuperscriptsubscript𝐵𝑘𝑡2subscriptnormsubscript𝜃1subscript𝜃22\displaystyle\leq\frac{1}{N}\sum_{k=1}^{N}L_{k}\|B_{k}^{t}\|^{2}\left\|\theta_% {1}-\theta_{2}\right\|_{2}≤ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5.0.10)

Due to the properties of the triangle inequality, the Cauchy-Schwarz inequality and the operator norm, this bound is tight. Therefore the Lipschitz constant of gt(θ)subscript𝑔𝑡𝜃\nabla g_{t}(\theta)∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is given by

1Nk=1NLkBkt21𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptnormsuperscriptsubscript𝐵𝑘𝑡2\frac{1}{N}\sum_{k=1}^{N}L_{k}\|B_{k}^{t}\|^{2}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5.0.12)

as required. ∎

With this result, we can now see how to approximate the optimal diagonal and full matrix preconditioners, and the optimal scalar step size.

Corollary 5.1.

Suppose each fkLk1,1subscript𝑓𝑘superscriptsubscriptsubscript𝐿𝑘11f_{k}\in\mathcal{F}_{L_{k}}^{1,1}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT.
Diagonal preconditioning
For diagonal preconditioning, θ=pn𝜃𝑝superscript𝑛\theta=p\in\mathbb{R}^{n}italic_θ = italic_p ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT gives Gp=diag(p)subscript𝐺𝑝diag𝑝G_{p}=\operatorname{diag}(p)italic_G start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = roman_diag ( italic_p ) and Bkt=diag(fk(xkt))superscriptsubscript𝐵𝑘𝑡diagsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡B_{k}^{t}=\operatorname{diag}(\nabla f_{k}(x_{k}^{t}))italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = roman_diag ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ). Then by (5.0.2)

gt(p)=1Nk=1Nfk(xktdiag(p)fk(xkt))fk(xkt),subscript𝑔𝑡𝑝1𝑁superscriptsubscript𝑘1𝑁direct-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡diag𝑝subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\nabla g_{t}(p)=-\frac{1}{N}\sum_{k=1}^{N}\nabla f_{k}(x_{k}^{t}-\operatorname% {diag}(p)\nabla f_{k}(x_{k}^{t}))\odot f_{k}(x_{k}^{t}),∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_p ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - roman_diag ( italic_p ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ⊙ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , (5.0.13)

and the Lipschitz constant of pgsubscript𝑝𝑔\nabla_{p}g∇ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_g is given by

Lpg=1Nk=1NLk(max{|[fk(xk)]1|,,|[fk(xk)]n|})2.subscript𝐿subscript𝑝𝑔1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptdelimited-[]subscript𝑓𝑘superscript𝑥𝑘1subscriptdelimited-[]subscript𝑓𝑘superscript𝑥𝑘𝑛2L_{\nabla_{p}g}=\frac{1}{N}\sum_{k=1}^{N}L_{k}(\max\{|[\nabla f_{k}(x^{k})]_{1% }|,\cdots,|[\nabla f_{k}(x^{k})]_{n}|\})^{2}.italic_L start_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_max { | [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , ⋯ , | [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | } ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.0.14)

Full matrix preconditioning
In this case we have θn2θsuperscriptsuperscriptn2\theta\in\mathbb{R}^{n^{2}}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, θθ\thetaitalic_θ has a corresponding n×nnnn\times nitalic_n × italic_n matrix PPPitalic_P, such that Gθ=PsubscriptGθPG_{\theta}=Pitalic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_P. The gradient of gt(θ)subscriptgtθg_{t}(\theta)italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is given by

gt(θ)=1Nk=1Nfk(xktPfk(xkt))fk(xkt),subscript𝑔𝑡𝜃1𝑁superscriptsubscript𝑘1𝑁tensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\nabla g_{t}(\theta)=-\frac{1}{N}\sum_{k=1}^{N}\nabla f_{k}(x_{k}^{t}-P\nabla f% _{k}(x_{k}^{t}))\otimes\nabla f_{k}(x_{k}^{t}),∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_P ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) ⊗ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , (5.0.15)

and the Lipschitz constant of gt(θ)subscript𝑔𝑡𝜃\nabla g_{t}(\theta)∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is given by

1Nk=1NLkfk(xkt)22.1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22\frac{1}{N}\sum_{k=1}^{N}L_{k}\left\|\nabla f_{k}(x_{k}^{t})\right\|_{2}^{2}.divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.0.16)

Scalar step size
We now take θt=αtsubscriptθtsubscriptαt\theta_{t}=\alpha_{t}\in\mathbb{R}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R. The derivative of gtsubscriptgtg_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with respect to αα\alphaitalic_α is given by

g(α)=1Nk=1Nfk(xktαfk(xkt)),fk(xkt),superscript𝑔𝛼1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝛼subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡g^{\prime}(\alpha)=-\frac{1}{N}\sum_{k=1}^{N}\langle\nabla f_{k}(x_{k}^{t}-% \alpha\nabla f_{k}(x_{k}^{t})),\nabla f_{k}(x_{k}^{t})\rangle,italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_α ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_α ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) , ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⟩ , (5.0.17)

and the Lipschitz constant of g(α)superscript𝑔𝛼g^{\prime}(\alpha)italic_g start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_α ) is given by

1Nk=1NLkfk(xkt)22.1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22\frac{1}{N}\sum_{k=1}^{N}L_{k}\|\nabla f_{k}(x_{k}^{t})\|_{2}^{2}.divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.0.18)
Proof.

Diagonal preconditioning

In this case, we have θn𝜃superscript𝑛\theta\in\mathbb{R}^{n}italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and that

  • Bkt=diag(fk(xkt))superscriptsubscript𝐵𝑘𝑡diagsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡B_{k}^{t}=\operatorname{diag}(\nabla f_{k}(x_{k}^{t}))italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = roman_diag ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ), and

  • vkt=xktsuperscriptsubscript𝑣𝑘𝑡superscriptsubscript𝑥𝑘𝑡v_{k}^{t}=x_{k}^{t}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Therefore, θgt(θ)subscript𝜃subscript𝑔𝑡𝜃\nabla_{\theta}g_{t}(\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is given by

θgt(θ)subscript𝜃subscript𝑔𝑡𝜃\displaystyle\nabla_{\theta}g_{t}(\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) =1Nk=1Nfk(xktBktθ)fk(xkt)absent1𝑁superscriptsubscript𝑘1𝑁direct-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=-\frac{1}{N}\sum_{k=1}^{N}\nabla f_{k}(x_{k}^{t}-B_{k}^{t}\theta% )\odot\nabla f_{k}(x_{k}^{t})= - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) ⊙ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (5.0.19)

and the smoothness constant of gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is

1Nk=1NLkBkt2=1Nk=1NLk(max{|[fk(xk)]1|,,|[fk(xk)]n|})2.1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptnormsuperscriptsubscript𝐵𝑘𝑡21𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptdelimited-[]subscript𝑓𝑘superscript𝑥𝑘1subscriptdelimited-[]subscript𝑓𝑘superscript𝑥𝑘𝑛2\frac{1}{N}\sum_{k=1}^{N}L_{k}\|B_{k}^{t}\|^{2}=\frac{1}{N}\sum_{k=1}^{N}L_{k}% (\max\{|[\nabla f_{k}(x^{k})]_{1}|,\cdots,|[\nabla f_{k}(x^{k})]_{n}|\})^{2}.divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( roman_max { | [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , ⋯ , | [ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ] start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | } ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.0.20)

Full matrix preconditioning

For the proof of full matrix preconditioning, we require the following propositions.

Lemma 5.1.

Let v,wn𝑣𝑤superscript𝑛v,w\in\mathbb{R}^{n}italic_v , italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. Then

(vIn)w=vw.tensor-product𝑣subscript𝐼𝑛𝑤tensor-product𝑣𝑤(v\otimes I_{n})w=v\otimes w.( italic_v ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_w = italic_v ⊗ italic_w . (5.0.21)
Proof.

Note that

vIn=(v1InvnIn)tensor-product𝑣subscript𝐼𝑛matrixsubscript𝑣1subscript𝐼𝑛subscript𝑣𝑛subscript𝐼𝑛\displaystyle v\otimes I_{n}=\begin{pmatrix}v_{1}I_{n}\\ \vdots\\ v_{n}I_{n}\\ \end{pmatrix}italic_v ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) (5.0.22)

then

(vIn)wtensor-product𝑣subscript𝐼𝑛𝑤\displaystyle(v\otimes I_{n})w( italic_v ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_w =(v1InvnIn)wabsentmatrixsubscript𝑣1subscript𝐼𝑛subscript𝑣𝑛subscript𝐼𝑛𝑤\displaystyle=\begin{pmatrix}v_{1}I_{n}\\ \vdots\\ v_{n}I_{n}\\ \end{pmatrix}w= ( start_ARG start_ROW start_CELL italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) italic_w (5.0.23)
=(v1wvnw)absentmatrixsubscript𝑣1𝑤subscript𝑣𝑛𝑤\displaystyle=\begin{pmatrix}v_{1}w\\ \vdots\\ v_{n}w\\ \end{pmatrix}= ( start_ARG start_ROW start_CELL italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_w end_CELL end_ROW end_ARG ) (5.0.24)
=vw.absenttensor-product𝑣𝑤\displaystyle=v\otimes w.= italic_v ⊗ italic_w . (5.0.25)

Lemma 5.2.

For vectors v,wn𝑣𝑤superscript𝑛v,w\in\mathbb{R}^{n}italic_v , italic_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

vw2=v2w2subscriptnormtensor-product𝑣𝑤2subscriptnorm𝑣2subscriptnorm𝑤2\|v\otimes w\|_{2}=\|v\|_{2}\|w\|_{2}∥ italic_v ⊗ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5.0.26)
Proof.
vw2subscriptnormtensor-product𝑣𝑤2\displaystyle\|v\otimes w\|_{2}∥ italic_v ⊗ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =i=1nj=1nvi2wj2absentsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑗1𝑛superscriptsubscript𝑣𝑖2superscriptsubscript𝑤𝑗2\displaystyle=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{n}v_{i}^{2}w_{j}^{2}}= square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (5.0.27)
=i=1nvi2j=1nwj2absentsuperscriptsubscript𝑖1𝑛superscriptsubscript𝑣𝑖2superscriptsubscript𝑗1𝑛superscriptsubscript𝑤𝑗2\displaystyle=\sqrt{\sum_{i=1}^{n}v_{i}^{2}\sum_{j=1}^{n}w_{j}^{2}}= square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (5.0.28)
=v2w2,absentsubscriptnorm𝑣2subscriptnorm𝑤2\displaystyle=\|v\|_{2}\|w\|_{2},= ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (5.0.29)

Lemma 5.3.
fk(xkt)In=fk(xkt)2normtensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐼𝑛subscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡2\|\nabla f_{k}(x_{k}^{t})\otimes I_{n}\|=\|\nabla f_{k}(x_{k}^{t})\|_{2}∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ = ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5.0.30)
Proof.

Note that for vn𝑣superscript𝑛v\in\mathbb{R}^{n}italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT,

(fk(xkt)In)v2subscriptnormtensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐼𝑛𝑣2\displaystyle\|(\nabla f_{k}(x_{k}^{t})\otimes I_{n})v\|_{2}∥ ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =(fk(xkt)v)2absentsubscriptnormtensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝑣2\displaystyle=\|(\nabla f_{k}(x_{k}^{t})\otimes v)\|_{2}= ∥ ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ italic_v ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (5.0.31)
=fk(xkt)2v2,absentsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡2subscriptnorm𝑣2\displaystyle=\|\nabla f_{k}(x_{k}^{t})\|_{2}\|v\|_{2},= ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ italic_v ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (5.0.32)

where both Lemma 5.1 and Lemma 5.2 were used. ∎

In the case of full matrix preconditioning, we have that θtn2subscript𝜃𝑡superscriptsuperscript𝑛2\theta_{t}\in\mathbb{R}^{n^{2}}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT

  • Bkt=(fk(xkt)In)Tsuperscriptsubscript𝐵𝑘𝑡superscripttensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐼𝑛𝑇B_{k}^{t}=(\nabla f_{k}(x_{k}^{t})\otimes I_{n})^{T}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and

  • vkt=xktsuperscriptsubscript𝑣𝑘𝑡superscriptsubscript𝑥𝑘𝑡v_{k}^{t}=x_{k}^{t}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Therefore, the gradient is given by

gt(θ)subscript𝑔𝑡𝜃\displaystyle\nabla g_{t}(\theta)∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) =1Nk=1N(fk(xkt)In)fk(vktBktθ)absent1𝑁superscriptsubscript𝑘1𝑁tensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐼𝑛subscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃\displaystyle=-\frac{1}{N}\sum_{k=1}^{N}(\nabla f_{k}(x_{k}^{t})\otimes I_{n})% \nabla f_{k}(v_{k}^{t}-B_{k}^{t}\theta)= - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) (5.0.33)
=1Nk=1Nfk(vktBktθ)fk(xkt)absent1𝑁superscriptsubscript𝑘1𝑁tensor-productsubscript𝑓𝑘superscriptsubscript𝑣𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=-\frac{1}{N}\sum_{k=1}^{N}\nabla f_{k}(v_{k}^{t}-B_{k}^{t}\theta% )\otimes\nabla f_{k}(x_{k}^{t})= - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) ⊗ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) (5.0.34)

the smoothness constant of gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given by

1Nk=1NLkBkt2=1Nk=1NLkfk(xkt)In2=1Nk=1NLkfk(xkt)22,1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptnormsuperscriptsubscript𝐵𝑘𝑡21𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptnormtensor-productsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝐼𝑛21𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22\frac{1}{N}\sum_{k=1}^{N}L_{k}\|B_{k}^{t}\|^{2}=\frac{1}{N}\sum_{k=1}^{N}L_{k}% \|\nabla f_{k}(x_{k}^{t})\otimes I_{n}\|^{2}=\frac{1}{N}\sum_{k=1}^{N}L_{k}\|% \nabla f_{k}(x_{k}^{t})\|_{2}^{2},divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ⊗ italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5.0.35)

where the last equality is as a result of Lemma 5.3.
Scalar step size
In this case, we have θ=α𝜃𝛼\theta=\alpha\in\mathbb{R}italic_θ = italic_α ∈ blackboard_R and that

  • Bkt=fk(xkt)superscriptsubscript𝐵𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡B_{k}^{t}=\nabla f_{k}(x_{k}^{t})italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ), and

  • vkt=xktsuperscriptsubscript𝑣𝑘𝑡superscriptsubscript𝑥𝑘𝑡v_{k}^{t}=x_{k}^{t}italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT.

Therefore, θgt(θ)subscript𝜃subscript𝑔𝑡𝜃\nabla_{\theta}g_{t}(\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) is given by

θgt(θ)subscript𝜃subscript𝑔𝑡𝜃\displaystyle\nabla_{\theta}g_{t}(\theta)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ ) =1Nk=1Nfk(xkt)Tfk(xktBktθ)absent1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡𝑇subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃\displaystyle=-\frac{1}{N}\sum_{k=1}^{N}\nabla f_{k}(x_{k}^{t})^{T}\nabla f_{k% }(x_{k}^{t}-B_{k}^{t}\theta)= - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) (5.0.36)
=1Nk=1Nfk(xkt),fk(xktBktθ)absent1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝐵𝑘𝑡𝜃\displaystyle=-\frac{1}{N}\sum_{k=1}^{N}\langle\nabla f_{k}(x_{k}^{t}),\nabla f% _{k}(x_{k}^{t}-B_{k}^{t}\theta)\rangle= - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ⟨ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) , ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ ) ⟩ (5.0.37)

First, note that

fk(xkt)=fk(xkt)2,normsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡subscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡2\|\nabla f_{k}(x_{k}^{t})\|=\|\nabla f_{k}(x_{k}^{t})\|_{2},∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ = ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (5.0.38)

and therefore, the smoothness constant of gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is

1Nk=1NLkBkt2=1Nk=1NLkfk(xkt)22.1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptnormsuperscriptsubscript𝐵𝑘𝑡21𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22\frac{1}{N}\sum_{k=1}^{N}L_{k}\|B_{k}^{t}\|^{2}=\frac{1}{N}\sum_{k=1}^{N}L_{k}% \|\nabla f_{k}(x_{k}^{t})\|_{2}^{2}.divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (5.0.39)

5.1 Convolutional preconditioning

We now introduce convolution preconditioning, which enables the preconditioner to consider local information instead of information at each pixel individually, unlike diagonal preconditioning.

Let xm1×m2𝑥superscriptsubscript𝑚1subscript𝑚2x\in\mathbb{R}^{m_{1}\times m_{2}}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (therefore the corresponding dimension is given by n=m1m2𝑛subscript𝑚1subscript𝑚2n=m_{1}m_{2}italic_n = italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Define a convolution kernel κh2×h1𝜅superscriptsubscript2subscript1\kappa\in\mathbb{R}^{h_{2}\times h_{1}}italic_κ ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and define ri=hi12,i{1,2}formulae-sequencesubscript𝑟𝑖subscript𝑖12𝑖12r_{i}=\frac{\lfloor h_{i}-1\rfloor}{2},i\in\{1,2\}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG ⌊ italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - 1 ⌋ end_ARG start_ARG 2 end_ARG , italic_i ∈ { 1 , 2 }. Define

κ=(κ(r1,r2)κ(r1+δ1,r2)κ(r1,r2+δ2)κ(r1+δ1,r2+δ2))𝜅matrix𝜅subscript𝑟1subscript𝑟2𝜅subscript𝑟1subscript𝛿1subscript𝑟2𝜅subscript𝑟1subscript𝑟2subscript𝛿2𝜅subscript𝑟1subscript𝛿1subscript𝑟2subscript𝛿2\kappa=\begin{pmatrix}\kappa(-r_{1},-r_{2})&\cdots&\kappa(r_{1}+\delta_{1},-r_% {2})\\ \vdots&\ddots&\vdots\\ \kappa(-r_{1},r_{2}+\delta_{2})&\cdots&\kappa(r_{1}+\delta_{1},r_{2}+\delta_{2% })\\ \end{pmatrix}italic_κ = ( start_ARG start_ROW start_CELL italic_κ ( - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_κ ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_κ ( - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_κ ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) (5.1.1)

where, for i{1,2}𝑖12i\in\{1,2\}italic_i ∈ { 1 , 2 },

δi={0, if hi is odd ,1, otherwise.subscript𝛿𝑖cases0 if hi is odd otherwise1 otherwiseotherwise\delta_{i}=\begin{cases}0,\text{ if $h_{i}$ is odd },\\ 1,\text{ otherwise}.\end{cases}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ROW start_CELL 0 , if italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is odd , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 , otherwise . end_CELL start_CELL end_CELL end_ROW (5.1.2)

The convolution (κx)(n1,n2)𝜅𝑥subscript𝑛1subscript𝑛2(\kappa\ast x)(n_{1},n_{2})( italic_κ ∗ italic_x ) ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) at coordinate (n1,n2)subscript𝑛1subscript𝑛2(n_{1},n_{2})( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is given by

(κx)(n1,n2)𝜅𝑥subscript𝑛1subscript𝑛2\displaystyle(\kappa\ast x)(n_{1},n_{2})( italic_κ ∗ italic_x ) ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =k1=k2=κ(k1,k2)x(n1k1,n2k2)absentsuperscriptsubscriptsubscript𝑘1superscriptsubscriptsubscript𝑘2𝜅subscript𝑘1subscript𝑘2𝑥subscript𝑛1subscript𝑘1subscript𝑛2subscript𝑘2\displaystyle=\sum_{k_{1}=-\infty}^{\infty}\sum_{k_{2}=-\infty}^{\infty}\kappa% (k_{1},k_{2})x(n_{1}-k_{1},n_{2}-k_{2})= ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_κ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_x ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (5.1.3)
=k1=k2=κ(k1,k2)x(n1k1,n2k2).absentsuperscriptsubscriptsubscript𝑘1superscriptsubscriptsubscript𝑘2𝜅subscript𝑘1subscript𝑘2𝑥subscript𝑛1subscript𝑘1subscript𝑛2subscript𝑘2\displaystyle=\sum_{k_{1}=-\infty}^{\infty}\sum_{k_{2}=-\infty}^{\infty}\kappa% (k_{1},k_{2})x(n_{1}-k_{1},n_{2}-k_{2}).= ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - ∞ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_κ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_x ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) . (5.1.4)

Notice that the convolution is linear in its parameters, and so the optimisation problem given by

κt=minκh1×h21Nk=1Nfk(xktκfk(xkt))subscript𝜅𝑡subscript𝜅superscriptsubscript1subscript21𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝜅subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\kappa_{t}=\min_{\kappa\in\mathbb{R}^{h_{1}\times h_{2}}}\frac{1}{N}\sum_{k=1}% ^{N}f_{k}(x_{k}^{t}-\kappa\ast\nabla f_{k}(x_{k}^{t}))italic_κ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_min start_POSTSUBSCRIPT italic_κ ∈ blackboard_R start_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_κ ∗ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (5.1.5)

for fixed h1,h2subscript1subscript2h_{1},h_{2}italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is convex. The following proposition provides the gradient and smoothness constant of gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using the convolutional parametrisation.

Proposition 5.2.

Firstly, define

κ~=(κ(r1,r2)κ(r1+δ1,r2)κ(r1,r2+1)κ(r1+δ1,r2+δ2)),x¯=(x1,1xm1,1xm1,m2).formulae-sequence~𝜅matrix𝜅subscript𝑟1subscript𝑟2𝜅subscript𝑟1subscript𝛿1subscript𝑟2𝜅subscript𝑟1subscript𝑟21𝜅subscript𝑟1subscript𝛿1subscript𝑟2subscript𝛿2¯𝑥matrixsubscript𝑥11subscript𝑥subscript𝑚11subscript𝑥subscript𝑚1subscript𝑚2\tilde{\kappa}=\begin{pmatrix}\kappa(-r_{1},-r_{2})\\ \vdots\\ \kappa(r_{1}+\delta_{1},-r_{2})\\ \kappa(-r_{1},-r_{2}+1)\\ \vdots\\ \kappa(r_{1}+\delta_{1},r_{2}+\delta_{2})\end{pmatrix},\quad\overline{x}=% \begin{pmatrix}x_{1,1}\\ \vdots\\ x_{m_{1},1}\\ \vdots\\ x_{m_{1},m_{2}}\end{pmatrix}.over~ start_ARG italic_κ end_ARG = ( start_ARG start_ROW start_CELL italic_κ ( - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_κ ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_κ ( - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 1 ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_κ ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) , over¯ start_ARG italic_x end_ARG = ( start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) . (5.1.6)

Finally, denote by x(a1,a2)superscript𝑥subscript𝑎1subscript𝑎2x^{(a_{1},a_{2})}italic_x start_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT the image x𝑥xitalic_x translated by a1subscript𝑎1a_{1}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT pixels down and a2subscript𝑎2a_{2}italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT pixels right, in other words, for an image xm1×m2𝑥superscriptsubscript𝑚1subscript𝑚2x\in\mathbb{R}^{m_{1}\times m_{2}}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT

[x(a1,a2)]i,j={xi+a1,j+a2, if i+a1{1,,m1} and j+a2{1,,m2},0, otherwise.subscriptdelimited-[]superscript𝑥subscript𝑎1subscript𝑎2𝑖𝑗casessubscript𝑥𝑖subscript𝑎1𝑗subscript𝑎2 if 𝑖subscript𝑎11subscript𝑚1 and 𝑗subscript𝑎21subscript𝑚2otherwise0 otherwiseotherwise[x^{(a_{1},a_{2})}]_{i,j}=\begin{cases}x_{i+a_{1},j+a_{2}},\text{ if }i+a_{1}% \in\{1,\cdots,m_{1}\}\text{ and }j+a_{2}\in\{1,\cdots,m_{2}\},\\ 0,\text{ otherwise}.\end{cases}[ italic_x start_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_i + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j + italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , if italic_i + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ { 1 , ⋯ , italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } and italic_j + italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ { 1 , ⋯ , italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , otherwise . end_CELL start_CELL end_CELL end_ROW (5.1.7)

Then

Bkt=[fk(xkt)(r1,r2)¯fk(xkt)(r1+δ1,r2)¯fk(xkt)(r1+δ1,r2+δ2)¯]superscriptsubscript𝐵𝑘𝑡matrix¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑟1subscript𝑟2¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑟1subscript𝛿1subscript𝑟2¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑟1subscript𝛿1subscript𝑟2subscript𝛿2\displaystyle B_{k}^{t}=\begin{bmatrix}\overline{\nabla f_{k}(x_{k}^{t})^{(-r_% {1},-r_{2})}}\cdots\overline{\nabla f_{k}(x_{k}^{t})^{(r_{1}+\delta_{1},-r_{2}% )}}\cdots\overline{\nabla f_{k}(x_{k}^{t})^{(r_{1}+\delta_{1},r_{2}+\delta_{2}% )}}\end{bmatrix}italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG ⋯ over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG ⋯ over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW end_ARG ] (5.1.8)

Then the gradient of gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with respect to κ~~𝜅\tilde{\kappa}over~ start_ARG italic_κ end_ARG is given by

gt(κ~)=1Nk=1N(fk(xkt)(r1,r2)¯Tfk(xkt)(r1+δ1,r2+δ2)¯T)fk(xktκfk(xkt))¯.subscript𝑔𝑡~𝜅1𝑁superscriptsubscript𝑘1𝑁matrixsuperscript¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑟1subscript𝑟2𝑇superscript¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑟1subscript𝛿1subscript𝑟2subscript𝛿2𝑇¯subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝜅subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\nabla g_{t}(\tilde{\kappa})=-\frac{1}{N}\sum_{k=1}^{N}\begin{pmatrix}% \overline{\nabla{f_{k}(x_{k}^{t})^{(-r_{1},-r_{2})}}}^{T}\\ \vdots\\ \overline{\nabla{f_{k}(x_{k}^{t})^{(r_{1}+\delta_{1},r_{2}+\delta_{2})}}}^{T}% \end{pmatrix}\overline{\nabla f_{k}(x_{k}^{t}-\kappa\ast\nabla f_{k}(x_{k}^{t}% ))}.∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_κ end_ARG ) = - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( start_ARG start_ROW start_CELL over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_κ ∗ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) end_ARG . (5.1.9)

Furthermore, an upper bound for the smoothness constant of gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given by

h1h2Nk=1NLkfk(xkt)¯22=h1h2Nk=1NLkfk(xkt)F2,subscript1subscript2𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnorm¯subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22subscript1subscript2𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝐹2\displaystyle\frac{h_{1}h_{2}}{N}\sum_{k=1}^{N}L_{k}\left\|\overline{\nabla f_% {k}(x_{k}^{t})}\right\|_{2}^{2}=\frac{h_{1}h_{2}}{N}\sum_{k=1}^{N}L_{k}\left\|% \nabla f_{k}(x_{k}^{t})\right\|_{F}^{2},divide start_ARG italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5.1.10)

where F\|\cdot\|_{F}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT represents the Frobenius norm of a matrix.

Proof.

We have that

κ(i,j)(κx)(n1,n2)𝜅𝑖𝑗𝜅𝑥subscript𝑛1subscript𝑛2\displaystyle\frac{\partial}{\partial\kappa(i,j)}(\kappa\ast x)(n_{1},n_{2})divide start_ARG ∂ end_ARG start_ARG ∂ italic_κ ( italic_i , italic_j ) end_ARG ( italic_κ ∗ italic_x ) ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) =κ(i,j)k1=n1m1n11k2=n2m2n21κ(k1,k2)x(n1k1,n2k2)absent𝜅𝑖𝑗superscriptsubscriptsubscript𝑘1subscript𝑛1subscript𝑚1subscript𝑛11superscriptsubscriptsubscript𝑘2subscript𝑛2subscript𝑚2subscript𝑛21𝜅subscript𝑘1subscript𝑘2𝑥subscript𝑛1subscript𝑘1subscript𝑛2subscript𝑘2\displaystyle=\frac{\partial}{\partial\kappa(i,j)}\sum_{k_{1}=n_{1}-m_{1}}^{n_% {1}-1}\sum_{k_{2}=n_{2}-m_{2}}^{n_{2}-1}\kappa(k_{1},k_{2})x(n_{1}-k_{1},n_{2}% -k_{2})= divide start_ARG ∂ end_ARG start_ARG ∂ italic_κ ( italic_i , italic_j ) end_ARG ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 end_POSTSUPERSCRIPT italic_κ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) italic_x ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) (5.1.12)
=x(n1i,n2j).absent𝑥subscript𝑛1𝑖subscript𝑛2𝑗\displaystyle=x(n_{1}-i,n_{2}-j).= italic_x ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_i , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_j ) . (5.1.13)

Furthermore, an upper bound for the smoothness constant of gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is given by

1Nk=1NLkBktF21𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnormsuperscriptsubscript𝐵𝑘𝑡𝐹2\displaystyle\frac{1}{N}\sum_{k=1}^{N}L_{k}\|B_{k}^{t}\|_{F}^{2}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ italic_B start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =1Nk=1NLk[fk(xkt)(r1,r2)¯fk(xkt)(r1+δ1,r2)¯fk(xkt)(r1+δ1,r2+δ2)¯]F2absent1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnormmatrix¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑟1subscript𝑟2¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑟1subscript𝛿1subscript𝑟2¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑟1subscript𝛿1subscript𝑟2subscript𝛿2𝐹2\displaystyle=\frac{1}{N}\sum_{k=1}^{N}L_{k}\left\|\begin{bmatrix}\overline{% \nabla f_{k}(x_{k}^{t})^{(-r_{1},-r_{2})}}\cdots\overline{\nabla f_{k}(x_{k}^{% t})^{(r_{1}+\delta_{1},-r_{2})}}\cdots\overline{\nabla f_{k}(x_{k}^{t})^{(r_{1% }+\delta_{1},r_{2}+\delta_{2})}}\end{bmatrix}\right\|_{F}^{2}= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ [ start_ARG start_ROW start_CELL over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG ⋯ over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG ⋯ over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW end_ARG ] ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5.1.15)
=1Nk=1NLkk1=r1r1k2=r2r2fk(xkt)(k1,k2)¯22absent1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptsubscript𝑘1subscript𝑟1subscript𝑟1superscriptsubscriptsubscript𝑘2subscript𝑟2subscript𝑟2superscriptsubscriptnorm¯subscript𝑓𝑘superscriptsuperscriptsubscript𝑥𝑘𝑡subscript𝑘1subscript𝑘222\displaystyle=\frac{1}{N}\sum_{k=1}^{N}L_{k}\sum_{k_{1}=-r_{1}}^{r_{1}}\sum_{k% _{2}=-r_{2}}^{r_{2}}\left\|\overline{\nabla f_{k}(x_{k}^{t})^{(k_{1},k_{2})}}% \right\|_{2}^{2}= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5.1.16)
1Nk=1NLkk1=r1r1k2=r2r2fk(xkt)¯22absent1𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptsubscript𝑘1subscript𝑟1subscript𝑟1superscriptsubscriptsubscript𝑘2subscript𝑟2subscript𝑟2superscriptsubscriptnorm¯subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22\displaystyle\leq\frac{1}{N}\sum_{k=1}^{N}L_{k}\sum_{k_{1}=-r_{1}}^{r_{1}}\sum% _{k_{2}=-r_{2}}^{r_{2}}\left\|\overline{\nabla f_{k}(x_{k}^{t})}\right\|_{2}^{2}≤ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5.1.17)
=h1h2Nk=1NLkfk(xkt)¯22absentsubscript1subscript2𝑁superscriptsubscript𝑘1𝑁subscript𝐿𝑘superscriptsubscriptnorm¯subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡22\displaystyle=\frac{h_{1}h_{2}}{N}\sum_{k=1}^{N}L_{k}\left\|\overline{\nabla f% _{k}(x_{k}^{t})}\right\|_{2}^{2}= divide start_ARG italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∥ over¯ start_ARG ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (5.1.18)

6 Convergence results

The following results are required before introducing the convergence results of our learned preconditioning.

Lemma 6.1.

Suppose that each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are Lksubscript𝐿𝑘L_{k}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-smooth then define

F(x)=1Nk=1Nfk(xk),𝐹𝑥1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscript𝑥𝑘F(x)=\frac{1}{N}\sum_{k=1}^{N}f_{k}(x^{k}),italic_F ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) , (6.0.1)

for

x=(x1xN)nN.𝑥matrixsuperscript𝑥1superscript𝑥𝑁superscript𝑛𝑁x=\begin{pmatrix}x^{1}\\ \vdots\\ x^{N}\end{pmatrix}\in\mathbb{R}^{nN}.italic_x = ( start_ARG start_ROW start_CELL italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n italic_N end_POSTSUPERSCRIPT . (6.0.2)

Then F𝐹Fitalic_F is L𝐿Litalic_L-smooth, with

L=LmaxN,𝐿subscript𝐿max𝑁L=\frac{L_{\text{max}}}{N},italic_L = divide start_ARG italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , (6.0.3)

where max{L1,,LN}subscript𝐿1subscript𝐿𝑁\max\{L_{1},\cdots,L_{N}\}roman_max { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }.

Proof.

We have

F(x)=1N(f1(x1)fN(xN)),𝐹𝑥1𝑁matrixsubscript𝑓1superscript𝑥1subscript𝑓𝑁superscript𝑥𝑁\nabla F(x)=\frac{1}{N}\begin{pmatrix}\nabla f_{1}(x^{1})\\ \vdots\\ \nabla f_{N}(x^{N})\end{pmatrix},∇ italic_F ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( start_ARG start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ) , (6.0.4)

and for any ynN𝑦superscript𝑛𝑁y\in\mathbb{R}^{nN}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n italic_N end_POSTSUPERSCRIPT,

xy2subscriptnorm𝑥𝑦2\displaystyle\|x-y\|_{2}∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =(x1xN)(y1yN)2absentsubscriptnormmatrixsuperscript𝑥1superscript𝑥𝑁matrixsuperscript𝑦1superscript𝑦𝑁2\displaystyle=\left\|\begin{pmatrix}x^{1}\\ \vdots\\ x^{N}\end{pmatrix}-\begin{pmatrix}y^{1}\\ \vdots\\ y^{N}\end{pmatrix}\right\|_{2}= ∥ ( start_ARG start_ROW start_CELL italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) - ( start_ARG start_ROW start_CELL italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (6.0.5)
=k=1Nq=1n([xk]q[yk]q)2absentsuperscriptsubscript𝑘1𝑁superscriptsubscript𝑞1𝑛superscriptsubscriptdelimited-[]superscript𝑥𝑘𝑞subscriptdelimited-[]superscript𝑦𝑘𝑞2\displaystyle=\sqrt{\sum_{k=1}^{N}\sum_{q=1}^{n}([x^{k}]_{q}-[y^{k}]_{q})^{2}}= square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_q = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( [ italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - [ italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (6.0.6)
=k=1Nxkyk22.absentsuperscriptsubscript𝑘1𝑁superscriptsubscriptnormsuperscript𝑥𝑘superscript𝑦𝑘22\displaystyle=\sqrt{\sum_{k=1}^{N}\|x^{k}-y^{k}\|_{2}^{2}}.= square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (6.0.7)

Then

F(x)F(y)2subscriptnorm𝐹𝑥𝐹𝑦2\displaystyle\|\nabla F(x)-\nabla F(y)\|_{2}∥ ∇ italic_F ( italic_x ) - ∇ italic_F ( italic_y ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =1N(f1(x1)fN(xN))1N(f1(y1)fN(yN))2absentsubscriptnorm1𝑁matrixsubscript𝑓1superscript𝑥1subscript𝑓𝑁superscript𝑥𝑁1𝑁matrixsubscript𝑓1superscript𝑦1subscript𝑓𝑁superscript𝑦𝑁2\displaystyle=\left\|\frac{1}{N}\begin{pmatrix}\nabla f_{1}(x^{1})\\ \vdots\\ \nabla f_{N}(x^{N})\end{pmatrix}-\frac{1}{N}\begin{pmatrix}\nabla f_{1}(y^{1})% \\ \vdots\\ \nabla f_{N}(y^{N})\end{pmatrix}\right\|_{2}= ∥ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( start_ARG start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ) - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( start_ARG start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (6.0.8)
=1Nk=1Nfk(xk)fk(yk)22absent1𝑁superscriptsubscript𝑘1𝑁superscriptsubscriptnormsubscript𝑓𝑘superscript𝑥𝑘subscript𝑓𝑘superscript𝑦𝑘22\displaystyle=\frac{1}{N}\sqrt{\sum_{k=1}^{N}\left\|\nabla f_{k}(x^{k})-\nabla f% _{k}(y^{k})\right\|_{2}^{2}}= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (6.0.9)
1Nk=1NLk2xkyk22absent1𝑁superscriptsubscript𝑘1𝑁superscriptsubscript𝐿𝑘2superscriptsubscriptnormsuperscript𝑥𝑘superscript𝑦𝑘22\displaystyle\leq\frac{1}{N}\sqrt{\sum_{k=1}^{N}L_{k}^{2}\left\|x^{k}-y^{k}% \right\|_{2}^{2}}≤ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (6.0.10)
max{L1,,LN}Nk=1Nxkyk22absentsubscript𝐿1subscript𝐿𝑁𝑁superscriptsubscript𝑘1𝑁superscriptsubscriptnormsuperscript𝑥𝑘superscript𝑦𝑘22\displaystyle\leq\frac{\max\{L_{1},\cdots,L_{N}\}}{N}\sqrt{\sum_{k=1}^{N}\left% \|x^{k}-y^{k}\right\|_{2}^{2}}≤ divide start_ARG roman_max { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG italic_N end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT - italic_y start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (6.0.11)
=max{L1,,LN}Nxy2.absentsubscript𝐿1subscript𝐿𝑁𝑁subscriptnorm𝑥𝑦2\displaystyle=\frac{\max\{L_{1},\cdots,L_{N}\}}{N}\|x-y\|_{2}.= divide start_ARG roman_max { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG italic_N end_ARG ∥ italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (6.0.12)

Therefore, F𝐹Fitalic_F is L𝐿Litalic_L-smooth, where

L=max{L1,,LN}N𝐿subscript𝐿1subscript𝐿𝑁𝑁L=\frac{\max\{L_{1},\cdots,L_{N}\}}{N}italic_L = divide start_ARG roman_max { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG italic_N end_ARG (6.0.13)

Lemma 6.2.

Suppose each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-strongly convex. Then F𝐹Fitalic_F is μ𝜇\muitalic_μ-strongly convex, with

μ=μminN,𝜇subscript𝜇min𝑁\mu=\frac{\mu_{\text{min}}}{N},italic_μ = divide start_ARG italic_μ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG , (6.0.14)

where μmin=min{μ1,,μN}subscript𝜇minsubscript𝜇1subscript𝜇𝑁\mu_{\text{min}}=\min\{\mu_{1},\cdots,\mu_{N}\}italic_μ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }.

Proof.

It is sufficient to show that (F(x)min{μ1,,μN}Nx22)𝐹𝑥subscript𝜇1subscript𝜇𝑁𝑁superscriptsubscriptnorm𝑥22(F(x)-\frac{\min\{\mu_{1},\cdots,\mu_{N}\}}{N}\|x\|_{2}^{2})( italic_F ( italic_x ) - divide start_ARG roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG italic_N end_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) is convex, and that this is the smallest such constant. We have

(F(x)min{μ1,,μN}Nx22)𝐹𝑥subscript𝜇1subscript𝜇𝑁𝑁superscriptsubscriptnorm𝑥22\displaystyle(F(x)-\frac{\min\{\mu_{1},\cdots,\mu_{N}\}}{N}\|x\|_{2}^{2})( italic_F ( italic_x ) - divide start_ARG roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG italic_N end_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (6.0.15)
=1Nk=1N(fk(xk)min{μ1,,μN}xk22)absent1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscript𝑥𝑘subscript𝜇1subscript𝜇𝑁superscriptsubscriptnormsuperscript𝑥𝑘22\displaystyle=\frac{1}{N}\sum_{k=1}^{N}(f_{k}(x^{k})-{\min\{\mu_{1},\cdots,\mu% _{N}\}}\|x^{k}\|_{2}^{2})= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ∥ italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (6.0.16)

Notice that

(fk(xk)min{μ1,,μN}xk22)subscript𝑓𝑘superscript𝑥𝑘subscript𝜇1subscript𝜇𝑁superscriptsubscriptnormsuperscript𝑥𝑘22(f_{k}(x^{k})-\min\{\mu_{1},\cdots,\mu_{N}\}\|x^{k}\|_{2}^{2})( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) - roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } ∥ italic_x start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (6.0.17)

is convex for all k𝑘kitalic_k, as each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-strongly convex and μkmin{μ1,,μN}subscript𝜇𝑘subscript𝜇1subscript𝜇𝑁\mu_{k}\geq\min\{\mu_{1},\cdots,\mu_{N}\}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≥ roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT }. This property would no longer hold if we chose a constant m𝑚mitalic_m such that μk<msubscript𝜇𝑘𝑚\mu_{k}<mitalic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT < italic_m. Therefore (6.0.15) is convex and so F𝐹Fitalic_F is min{μ1,,μN}Nsubscript𝜇1subscript𝜇𝑁𝑁\frac{\min\{\mu_{1},\cdots,\mu_{N}\}}{N}divide start_ARG roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG italic_N end_ARG-strongly convex. ∎

Firstly, define

τ={2μmin+Lmax, if F is L-smooth and μ-strongly convex,1Lmax, if F is L-smooth.𝜏cases2subscript𝜇minsubscript𝐿max if F is L-smooth and μ-strongly convex,otherwise1subscript𝐿max if F is L-smoothotherwise\tau=\begin{cases}\frac{2}{\mu_{\text{min}}+L_{\text{max}}},\text{ if $F$ is $% L$-smooth and $\mu$-strongly convex,}\\ \frac{1}{L_{\text{max}}},\text{ if $F$ is $L$-smooth}.\end{cases}italic_τ = { start_ROW start_CELL divide start_ARG 2 end_ARG start_ARG italic_μ start_POSTSUBSCRIPT min end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG , if italic_F is italic_L -smooth and italic_μ -strongly convex, end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_L start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG , if italic_F is italic_L -smooth . end_CELL start_CELL end_CELL end_ROW (6.0.18)

The following result shows that our parametrisations generalise gradient descent with a constant step size given by τ𝜏\tauitalic_τ. This property will be used to prove the convergence rate of our learned preconditioners on the training set.

Lemma 6.3.

For all parametrisations Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (P1-P4) in Table 1, there exists θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG such that

Gθ~=τI.subscript𝐺~𝜃𝜏𝐼G_{\tilde{\theta}}=\tau I.italic_G start_POSTSUBSCRIPT over~ start_ARG italic_θ end_ARG end_POSTSUBSCRIPT = italic_τ italic_I . (6.0.19)
Proof.
  1. 1.

    For scalar step sizes, Gθ=θIsubscript𝐺𝜃𝜃𝐼G_{\theta}=\theta Iitalic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_θ italic_I, take θ~=τ~𝜃𝜏\tilde{\theta}=\tauover~ start_ARG italic_θ end_ARG = italic_τ.

  2. 2.

    For diagonal preconditioning, Gθ=diag(θ)subscript𝐺𝜃diag𝜃G_{\theta}=\operatorname{diag}(\theta)italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = roman_diag ( italic_θ ), take θ~=τ𝟏~𝜃𝜏1\tilde{\theta}=\tau\mathbf{1}over~ start_ARG italic_θ end_ARG = italic_τ bold_1.

  3. 3.

    For full matrix preconditioning, Gθ=θsubscript𝐺𝜃𝜃G_{\theta}=\thetaitalic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT = italic_θ, take

    (θ~)ij={τ, if i=j,0, otherwise.subscript~𝜃𝑖𝑗cases𝜏 if 𝑖𝑗otherwise0 otherwise.otherwise(\tilde{\theta})_{ij}=\begin{cases}\tau,\text{ if }i=j,\\ 0,\text{ otherwise.}\end{cases}( over~ start_ARG italic_θ end_ARG ) start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL italic_τ , if italic_i = italic_j , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , otherwise. end_CELL start_CELL end_CELL end_ROW (6.0.20)
  4. 4.

    For convolutional preconditioning, Gθx=θxsubscript𝐺𝜃𝑥𝜃𝑥G_{\theta}x=\theta\ast xitalic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_x = italic_θ ∗ italic_x, take

    κ(i,j)={τ, if i=j=0,0, otherwise.𝜅𝑖𝑗cases𝜏 if 𝑖𝑗0otherwise0 otherwiseotherwise\kappa(i,j)=\begin{cases}\tau,\text{ if }i=j=0,\\ 0,\text{ otherwise}.\end{cases}italic_κ ( italic_i , italic_j ) = { start_ROW start_CELL italic_τ , if italic_i = italic_j = 0 , end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 , otherwise . end_CELL start_CELL end_CELL end_ROW (6.0.21)

Theorem 6.1.

Convergence in training set algorithm.
Assuming fkLk1,1subscriptfksuperscriptsubscriptsubscriptLk11f_{k}\in\mathcal{F}_{L_{k}}^{1,1}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ caligraphic_F start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT is bounded below for all k{1,,N}k1Nk\in\{1,\cdots,N\}italic_k ∈ { 1 , ⋯ , italic_N }. Then, for any learned optimisation algorithm such that

gt(θt)gt(θ~),subscript𝑔𝑡subscript𝜃𝑡subscript𝑔𝑡~𝜃g_{t}(\theta_{t})\leq g_{t}(\tilde{\theta}),italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≤ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) , (6.0.22)

we have that

fk(xkt)0 as t.subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡0 as 𝑡\nabla f_{k}(x_{k}^{t})\to 0\text{ as }t\to\infty.∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) → 0 as italic_t → ∞ . (6.0.23)

Furthermore, if we denote

x0=(x10xN0),x=(x1xN).formulae-sequencesubscript𝑥0matrixsuperscriptsubscript𝑥10superscriptsubscript𝑥𝑁0superscript𝑥matrixsuperscriptsubscript𝑥1superscriptsubscript𝑥𝑁x_{0}=\begin{pmatrix}x_{1}^{0}\\ \vdots\\ x_{N}^{0}\end{pmatrix},\quad x^{*}=\begin{pmatrix}x_{1}^{*}\\ \vdots\\ x_{N}^{*}\end{pmatrix}.italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) , italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) . (6.0.24)

Then

F(xt)F(x)max{L1,,LN}2tNx0x22.𝐹subscript𝑥𝑡𝐹superscript𝑥subscript𝐿1subscript𝐿𝑁2𝑡𝑁superscriptsubscriptnormsubscript𝑥0superscript𝑥22F(x_{t})-F(x^{*})\leq\frac{\max\{L_{1},\cdots,L_{N}\}}{2tN}\|x_{0}-x^{*}\|_{2}% ^{2}.italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ divide start_ARG roman_max { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG 2 italic_t italic_N end_ARG ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6.0.25)

If, in addition, each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-strongly convex, then we have linear convergence given by

F(xt)F(x)(1max{L1,,LN}min{μ1,,μN})t(F(x0)F(x)).𝐹subscript𝑥𝑡𝐹superscript𝑥superscript1subscript𝐿1subscript𝐿𝑁subscript𝜇1subscript𝜇𝑁𝑡𝐹subscript𝑥0𝐹superscript𝑥F(x_{t})-F(x^{*})\leq\left(1-\frac{\max\{L_{1},\cdots,L_{N}\}}{\min\{\mu_{1},% \cdots,\mu_{N}\}}\right)^{t}(F(x_{0})-F(x^{*})).italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ ( 1 - divide start_ARG roman_max { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) . (6.0.26)
Proof.

We have

F(xt+1)=gt(θt)𝐹subscript𝑥𝑡1subscript𝑔𝑡subscript𝜃𝑡\displaystyle F(x_{t+1})=g_{t}(\theta_{t})italic_F ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) = italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) gt(θ~)absentsubscript𝑔𝑡~𝜃\displaystyle\leq g_{t}(\tilde{\theta})≤ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) (6.0.27)
=1Nk=1Nfk(xktτf(xkt))absent1𝑁superscriptsubscript𝑘1𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡𝜏𝑓superscriptsubscript𝑥𝑘𝑡\displaystyle=\frac{1}{N}\sum_{k=1}^{N}f_{k}\left(x_{k}^{t}-\tau\nabla f(x_{k}% ^{t})\right)= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - italic_τ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (6.0.28)
=F(xtτ(f1(x1t)fN(xNt)))absent𝐹subscript𝑥𝑡𝜏matrixsubscript𝑓1superscriptsubscript𝑥1𝑡subscript𝑓𝑁superscriptsubscript𝑥𝑁𝑡\displaystyle=F\left(x_{t}-\tau\begin{pmatrix}\nabla f_{1}(x_{1}^{t})\\ \vdots\\ \nabla f_{N}(x_{N}^{t})\end{pmatrix}\right)= italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_τ ( start_ARG start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ∇ italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ) ) (6.0.29)
=F(xtτNF(xt))absent𝐹subscript𝑥𝑡𝜏𝑁𝐹subscript𝑥𝑡\displaystyle=F\left(x_{t}-\tau N\nabla F(x_{t})\right)= italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_τ italic_N ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (6.0.30)
=F(xtτFF(xt)),absent𝐹subscript𝑥𝑡subscript𝜏𝐹𝐹subscript𝑥𝑡\displaystyle=F\left(x_{t}-\tau_{F}\nabla F(x_{t})\right),= italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (6.0.31)

for

τF={2μ+L, if F is L-smooth and μ-strongly convex,1L, if F is L-smooth.subscript𝜏𝐹cases2𝜇𝐿 if F is L-smooth and μ-strongly convex,otherwise1𝐿 if F is L-smoothotherwise\tau_{F}=\begin{cases}\frac{2}{\mu+L},\text{ if $F$ is $L$-smooth and $\mu$-% strongly convex,}\\ \frac{1}{L},\text{ if $F$ is $L$-smooth}.\end{cases}italic_τ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT = { start_ROW start_CELL divide start_ARG 2 end_ARG start_ARG italic_μ + italic_L end_ARG , if italic_F is italic_L -smooth and italic_μ -strongly convex, end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_L end_ARG , if italic_F is italic_L -smooth . end_CELL start_CELL end_CELL end_ROW (6.0.32)

F𝐹Fitalic_F is L𝐿Litalic_L-smooth as each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is Lksubscript𝐿𝑘L_{k}italic_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-smooth and μ𝜇\muitalic_μ-strongly convex if each fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT-strongly convex, where

L𝐿\displaystyle Litalic_L =max{L1,,LN}Nabsentsubscript𝐿1subscript𝐿𝑁𝑁\displaystyle=\frac{\max\{L_{1},\cdots,L_{N}\}}{N}= divide start_ARG roman_max { italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_L start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG italic_N end_ARG (6.0.33)
μ𝜇\displaystyle\muitalic_μ =min{μ1,,μN}N,absentsubscript𝜇1subscript𝜇𝑁𝑁\displaystyle=\frac{\min\{\mu_{1},\cdots,\mu_{N}\}}{N},= divide start_ARG roman_min { italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT } end_ARG start_ARG italic_N end_ARG , (6.0.34)

and therefore, using standard convergence rate results of gradient descent [17], we have

F(xt)F(x)L2tx0x22,𝐹subscript𝑥𝑡𝐹superscript𝑥𝐿2𝑡superscriptsubscriptnormsubscript𝑥0superscript𝑥22F(x_{t})-F(x^{*})\leq\frac{L}{2t}\|x_{0}-x^{*}\|_{2}^{2},italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ divide start_ARG italic_L end_ARG start_ARG 2 italic_t end_ARG ∥ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (6.0.35)

as F𝐹Fitalic_F is Llimit-from𝐿L-italic_L -smooth, and if F𝐹Fitalic_F is also μ𝜇\muitalic_μ-strongly convex we have

F(xt)F(x)(1Lμ)t(F(x0)F(x)).𝐹subscript𝑥𝑡𝐹superscript𝑥superscript1𝐿𝜇𝑡𝐹subscript𝑥0𝐹superscript𝑥F(x_{t})-F(x^{*})\leq\left(1-\frac{L}{\mu}\right)^{t}(F(x_{0})-F(x^{*})).italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ ( 1 - divide start_ARG italic_L end_ARG start_ARG italic_μ end_ARG ) start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_F ( italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) . (6.0.36)

In both cases, we have that F(xt)0𝐹subscript𝑥𝑡0\nabla F(x_{t})\to 0∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) → 0, meaning that F(xt)22=1N2k=1Nfk(xkt)220superscriptsubscriptnorm𝐹subscript𝑥𝑡221superscript𝑁2superscriptsubscript𝑘1𝑁superscriptsubscriptnormsubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡220\|\nabla F(x_{t})\|_{2}^{2}=\frac{1}{N^{2}}\sum_{k=1}^{N}\|\nabla f_{k}(x_{k}^% {t})\|_{2}^{2}\to 0∥ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ ∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → 0 as t𝑡t\to\inftyitalic_t → ∞, which implies that fk(xkt)0subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡0\nabla f_{k}(x_{k}^{t})\to 0∇ italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) → 0 as t𝑡t\to\inftyitalic_t → ∞ for all k{1,,N}𝑘1𝑁k\in\{1,\cdots,N\}italic_k ∈ { 1 , ⋯ , italic_N }. ∎

Note that this result gives a worst-case convergence bound among train functions. However, provable convergence is still acquired. Also, note that this is not an issue for a function class with constant smoothness and strongly convex parameters. Furthermore, although a weak convergence bound has been found, it is very likely that one can far exceed this rate when learning is applied to a specific class of functions.

We have proved convergence for the mean of our train functions. The following proposition proves the same convergence rate for each function in our training set.

Proposition 6.1.

Suppose we have a convergence rate for F𝐹Fitalic_F of

F(xt)FC(t)(F(x0)F).𝐹subscript𝑥𝑡superscript𝐹𝐶𝑡𝐹subscript𝑥0superscript𝐹F(x_{t})-F^{\star}\leq C(t)(F(x_{0})-F^{\star}).italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_F start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_C ( italic_t ) ( italic_F ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_F start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) . (6.0.37)

Then the convergence rate for some fksubscript𝑓𝑘f_{k}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, k{1,,N}𝑘1𝑁k\in\{1,\cdots,N\}italic_k ∈ { 1 , ⋯ , italic_N } is given by

fi(xkt)fkMiC(t)(fk(x0)fk),subscript𝑓𝑖superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝑓𝑘subscript𝑀𝑖𝐶𝑡subscript𝑓𝑘subscript𝑥0superscriptsubscript𝑓𝑘f_{i}(x_{k}^{t})-f_{k}^{\star}\leq M_{i}C(t)(f_{k}(x_{0})-f_{k}^{\star}),italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_C ( italic_t ) ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) , (6.0.38)

where

Mi=1+k=1,kiN(fk(xk0)fk)fi(xi0)fisubscript𝑀𝑖1superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘0superscriptsubscript𝑓𝑘subscript𝑓𝑖superscriptsubscript𝑥𝑖0superscriptsubscript𝑓𝑖M_{i}=1+\frac{\sum_{k=1,k\neq i}^{N}(f_{k}(x_{k}^{0})-f_{k}^{\star})}{f_{i}(x_% {i}^{0})-f_{i}^{\star}}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 + divide start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG (6.0.39)

is constant in t𝑡titalic_t.

Proof.

(6.0.37):
Note that we may write

F(xt)𝐹subscript𝑥𝑡\displaystyle F(x_{t})italic_F ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =1Nfi(xit)+1Nk=1,kiNfk(xkt),absent1𝑁subscript𝑓𝑖superscriptsubscript𝑥𝑖𝑡1𝑁superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡\displaystyle=\frac{1}{N}f_{i}(x_{i}^{t})+\frac{1}{N}\sum_{k=1,k\neq i}^{N}f_{% k}(x_{k}^{t}),= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ,
Fabsentsuperscript𝐹\displaystyle\implies F^{\star}⟹ italic_F start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT =1Nfi+1Nk=1,kiNfk.absent1𝑁superscriptsubscript𝑓𝑖1𝑁superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁superscriptsubscript𝑓𝑘\displaystyle=\frac{1}{N}f_{i}^{\star}+\frac{1}{N}\sum_{k=1,k\neq i}^{N}f_{k}^% {\star}.= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT .

Therefore, using our convergence rate on F𝐹Fitalic_F gives

1Nfi(xit)1Nfi+1Nk=1,kiN(fk(xkt)fk)C(t)(1N(fi(xi0)fi)+1Nk=1,kiN(fk(xk0)fk)),1𝑁subscript𝑓𝑖superscriptsubscript𝑥𝑖𝑡1𝑁superscriptsubscript𝑓𝑖1𝑁superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝑓𝑘𝐶𝑡1𝑁subscript𝑓𝑖superscriptsubscript𝑥𝑖0superscriptsubscript𝑓𝑖1𝑁superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘0superscriptsubscript𝑓𝑘\frac{1}{N}f_{i}(x_{i}^{t})-\frac{1}{N}f_{i}^{\star}+\frac{1}{N}\sum_{k=1,k% \neq i}^{N}(f_{k}(x_{k}^{t})-f_{k}^{\star})\leq C(t)\bigg{(}\frac{1}{N}(f_{i}(% x_{i}^{0})-f_{i}^{\star})+\frac{1}{N}\sum_{k=1,k\neq i}^{N}(f_{k}(x_{k}^{0})-f% _{k}^{\star})\bigg{)},divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_N end_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≤ italic_C ( italic_t ) ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) , (6.0.40)

which implies

fi(xit)fi+k=1,kiN(fk(xkt)fk)C(t)((fi(xi0)fi)+k=1,kiN(fk(xk0)fk)),subscript𝑓𝑖superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑓𝑖superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝑓𝑘𝐶𝑡subscript𝑓𝑖superscriptsubscript𝑥𝑖0superscriptsubscript𝑓𝑖superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘0superscriptsubscript𝑓𝑘\displaystyle f_{i}(x_{i}^{t})-f_{i}^{\star}+\sum_{k=1,k\neq i}^{N}(f_{k}(x_{k% }^{t})-f_{k}^{\star})\leq C(t)\bigg{(}(f_{i}(x_{i}^{0})-f_{i}^{\star})+\sum_{k% =1,k\neq i}^{N}(f_{k}(x_{k}^{0})-f_{k}^{\star})\bigg{)},italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ≤ italic_C ( italic_t ) ( ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) , (6.0.41)
\displaystyle\implies fi(xit)fiC(t)((fi(xi0)fi)+k=1,kiN(fk(xk0)fk)).subscript𝑓𝑖superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑓𝑖𝐶𝑡subscript𝑓𝑖superscriptsubscript𝑥𝑖0superscriptsubscript𝑓𝑖superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘0superscriptsubscript𝑓𝑘\displaystyle f_{i}(x_{i}^{t})-f_{i}^{\star}\leq C(t)\bigg{(}(f_{i}(x_{i}^{0})% -f_{i}^{\star})+\sum_{k=1,k\neq i}^{N}(f_{k}(x_{k}^{0})-f_{k}^{\star})\bigg{)}.italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_C ( italic_t ) ( ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ) . (6.0.42)

Note that

Di:=k=1,kiN(fk(xk0)fk)assignsubscript𝐷𝑖superscriptsubscriptformulae-sequence𝑘1𝑘𝑖𝑁subscript𝑓𝑘superscriptsubscript𝑥𝑘0superscriptsubscript𝑓𝑘D_{i}:=\sum_{k=1,k\neq i}^{N}(f_{k}(x_{k}^{0})-f_{k}^{\star})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_k = 1 , italic_k ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) (6.0.43)

is constant in t𝑡titalic_t. Then

fi(xit)fiC(t)(fi(xi0)fi+Di).subscript𝑓𝑖superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑓𝑖𝐶𝑡subscript𝑓𝑖superscriptsubscript𝑥𝑖0superscriptsubscript𝑓𝑖subscript𝐷𝑖\displaystyle f_{i}(x_{i}^{t})-f_{i}^{\star}\leq C(t)(f_{i}(x_{i}^{0})-f_{i}^{% \star}+D_{i}).italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ≤ italic_C ( italic_t ) ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (6.0.44)

Let D~isubscript~𝐷𝑖\tilde{D}_{i}over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be given by

D~i=1+Difi(xi0)fi.subscript~𝐷𝑖1subscript𝐷𝑖subscript𝑓𝑖superscriptsubscript𝑥𝑖0superscriptsubscript𝑓𝑖\tilde{D}_{i}=1+\frac{D_{i}}{f_{i}(x_{i}^{0})-f_{i}^{\star}}.over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 + divide start_ARG italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT end_ARG . (6.0.45)

Then

fi(xit)fisubscript𝑓𝑖superscriptsubscript𝑥𝑖𝑡superscriptsubscript𝑓𝑖\displaystyle f_{i}(x_{i}^{t})-f_{i}^{\star}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT D~iC(t)(fi(xi0)fi).absentsubscript~𝐷𝑖𝐶𝑡subscript𝑓𝑖superscriptsubscript𝑥𝑖0superscriptsubscript𝑓𝑖\displaystyle\leq\tilde{D}_{i}C(t)(f_{i}(x_{i}^{0})-f_{i}^{\star}).≤ over~ start_ARG italic_D end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_C ( italic_t ) ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) . (6.0.46)

7 Numerical example

We now consider an image deblurring problem, with forward operator A𝐴Aitalic_A given by a Gaussian blur with σ=2.0𝜎2.0\sigma=2.0italic_σ = 2.0. We take the following:

  • Ground truth data x𝒳similar-to𝑥𝒳x\sim\mathcal{X}italic_x ∼ caligraphic_X, where 𝒳𝒳\mathcal{X}caligraphic_X is the set of 28×28282828\times 2828 × 28 pixel MNIST images [8].

  • For x𝒳similar-to𝑥𝒳x\sim\mathcal{X}italic_x ∼ caligraphic_X, generate an observation y=Ax+ε𝑦𝐴𝑥𝜀y=Ax+\varepsilonitalic_y = italic_A italic_x + italic_ε, where ε𝜀\varepsilonitalic_ε is noise sampled from a zero-mean Gaussian distribution, with yAx2/Ax20.04subscriptnorm𝑦𝐴𝑥2subscriptnorm𝐴𝑥20.04\|y-Ax\|_{2}/\|Ax\|_{2}\approx 0.04∥ italic_y - italic_A italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ∥ italic_A italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≈ 0.04.

Refer to caption
Refer to caption
Figure 1: Observation (Left) and reconstruction (Right)

Figure 1 shows an example observation and ground-truth pair. For the purpose of recovering an approximation to xtruesubscript𝑥truex_{\text{true}}italic_x start_POSTSUBSCRIPT true end_POSTSUBSCRIPT using our observation y𝑦yitalic_y, we formulate a minimisation problem given by

minx{f(x):=12Axy22+αHϵ(x)},subscript𝑥assign𝑓𝑥12superscriptsubscriptnorm𝐴𝑥𝑦22𝛼subscript𝐻italic-ϵ𝑥\min_{x}\left\{f(x):=\frac{1}{2}\|Ax-y\|_{2}^{2}+\alpha H_{\epsilon}(x)\right\},roman_min start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT { italic_f ( italic_x ) := divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α italic_H start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT ( italic_x ) } , (7.0.1)

with α=104𝛼superscript104\alpha=10^{-4}italic_α = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, and Hϵsubscript𝐻italic-ϵH_{\epsilon}italic_H start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT given as the Huber regularised Total Variation [13, 19] with ϵ=0.01italic-ϵ0.01\epsilon=0.01italic_ϵ = 0.01, defined as

Hε(Du)=i=1,j=1m,nhε((Du)i,j,12+(Du)i,j,22),subscript𝐻𝜀D𝑢superscriptsubscriptformulae-sequence𝑖1𝑗1𝑚𝑛subscript𝜀superscriptsubscriptD𝑢𝑖𝑗12superscriptsubscriptD𝑢𝑖𝑗22H_{\varepsilon}(\mathrm{D}u)=\sum_{i=1,j=1}^{m,n}h_{\varepsilon}\left(\sqrt{(% \mathrm{D}u)_{i,j,1}^{2}+(\mathrm{D}u)_{i,j,2}^{2}}\right),italic_H start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( roman_D italic_u ) = ∑ start_POSTSUBSCRIPT italic_i = 1 , italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m , italic_n end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( square-root start_ARG ( roman_D italic_u ) start_POSTSUBSCRIPT italic_i , italic_j , 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( roman_D italic_u ) start_POSTSUBSCRIPT italic_i , italic_j , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (7.0.2)

with the Huber loss defined as

hε(s)={12εs2, if |s|ε|s|ε2, otherwise,subscript𝜀𝑠cases12𝜀superscript𝑠2 if 𝑠𝜀otherwise𝑠𝜀2 otherwiseotherwiseh_{\varepsilon}(s)=\begin{cases}\frac{1}{2\varepsilon}{s^{2}},\text{ if }|s|% \leq\varepsilon\\ |s|-\frac{\varepsilon}{2},\text{ otherwise},\end{cases}italic_h start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_s ) = { start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG 2 italic_ε end_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , if | italic_s | ≤ italic_ε end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL | italic_s | - divide start_ARG italic_ε end_ARG start_ARG 2 end_ARG , otherwise , end_CELL start_CELL end_CELL end_ROW (7.0.3)

and the gradient operator D:m×nm×n×2:Dsuperscript𝑚𝑛superscript𝑚𝑛2\mathrm{D}:\mathbb{R}^{m\times n}\rightarrow\mathbb{R}^{m\times n\times 2}roman_D : blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m × italic_n × 2 end_POSTSUPERSCRIPT equal to

(Du)i,j,1={ui+1,jui,j if 1i<m,0 else ,subscriptD𝑢𝑖𝑗1casessubscript𝑢𝑖1𝑗subscript𝑢𝑖𝑗 if 1𝑖𝑚0 else \displaystyle(\mathrm{D}u)_{i,j,1}=\begin{cases}u_{i+1,j}-u_{i,j}&\text{ if }1% \leq i<m,\\ 0&\text{ else },\end{cases}( roman_D italic_u ) start_POSTSUBSCRIPT italic_i , italic_j , 1 end_POSTSUBSCRIPT = { start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i + 1 , italic_j end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL if 1 ≤ italic_i < italic_m , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL else , end_CELL end_ROW (7.0.4)
(Du)i,j,2={ui,j+1ui,j if 1j<n,0 else.subscriptD𝑢𝑖𝑗2casessubscript𝑢𝑖𝑗1subscript𝑢𝑖𝑗 if 1𝑗𝑛0 else.\displaystyle(\mathrm{D}u)_{i,j,2}=\begin{cases}u_{i,j+1}-u_{i,j}&\text{ if }1% \leq j<n,\\ 0&\text{ else. }\end{cases}( roman_D italic_u ) start_POSTSUBSCRIPT italic_i , italic_j , 2 end_POSTSUBSCRIPT = { start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i , italic_j + 1 end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL if 1 ≤ italic_j < italic_n , end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL else. end_CELL end_ROW (7.0.5)

Then, each function f𝑓fitalic_f is L𝐿Litalic_L-smooth, where [5]

L=A2+αD2ε1+8αε=1.08.𝐿superscriptnorm𝐴2𝛼superscriptnorm𝐷2𝜀18𝛼𝜀1.08L=\|A\|^{2}+\frac{\alpha\|D\|^{2}}{\varepsilon}\leq 1+\frac{8\alpha}{% \varepsilon}=1.08.italic_L = ∥ italic_A ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α ∥ italic_D ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ε end_ARG ≤ 1 + divide start_ARG 8 italic_α end_ARG start_ARG italic_ε end_ARG = 1.08 . (7.0.6)

Learning preconditioners Gθtsubscript𝐺subscript𝜃𝑡G_{\theta_{t}}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT:

Let

𝒳k={x𝒳:x has label k},subscript𝒳𝑘conditional-set𝑥𝒳𝑥 has label 𝑘\mathcal{X}_{k}=\{x\in\mathcal{X}:x\text{ has label }k\},caligraphic_X start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_x ∈ caligraphic_X : italic_x has label italic_k } , (7.0.7)

then define the training set and two testing sets by the following:

  • Training set: Image set 𝒯1,1𝒳1subscript𝒯11subscript𝒳1\mathcal{T}_{1,1}\subset\mathcal{X}_{1}caligraphic_T start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ⊂ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of MNIST ones, with |𝒯1,1|=1000subscript𝒯111000|\mathcal{T}_{1,1}|=1000| caligraphic_T start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT | = 1000 .

  • Testing set 1111: Image set 𝒯1,2𝒳1,2subscript𝒯12subscript𝒳12\mathcal{T}_{1,2}\subset\mathcal{X}_{1,2}caligraphic_T start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ⊂ caligraphic_X start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT of MNIST ones not in training set, with |𝒯1,2|=100subscript𝒯12100|\mathcal{T}_{1,2}|=100| caligraphic_T start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT | = 100.

  • Testing set 2222: Image set 𝒯2𝒳𝒳1subscript𝒯2𝒳subscript𝒳1\mathcal{T}_{2}\subset\mathcal{X}\setminus\mathcal{X}_{1}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⊂ caligraphic_X ∖ caligraphic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT of MNIST digits in {2,3,4,5,6,7,8,9,0}234567890\{2,3,4,5,6,7,8,9,0\}{ 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 0 }, with |𝒯2|=100subscript𝒯2100|\mathcal{T}_{2}|=100| caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | = 100.

The learned convolutional kernel is chosen to be of size 28×28282828\times 2828 × 28.

At iteration t𝑡titalic_t, for a parametrisation Gθsubscript𝐺𝜃G_{\theta}italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, we learn θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via the following procedure, for a tolerance ν=103𝜈superscript103\nu=10^{-3}italic_ν = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT we have the stop** condition given by

gt(θtw)2gt(θ~)2<ν,subscriptnormsubscript𝑔𝑡superscriptsubscript𝜃𝑡𝑤2subscriptnormsubscript𝑔𝑡~𝜃2𝜈\frac{\|\nabla g_{t}(\theta_{t}^{w})\|_{2}}{\|\nabla g_{t}(\tilde{\theta})\|_{% 2}}<\nu,divide start_ARG ∥ ∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ ∇ italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_θ end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG < italic_ν , (7.0.8)

at some sub-iteration w𝑤witalic_w, for θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG defined in (6.0.19) and θtwsuperscriptsubscript𝜃𝑡𝑤\theta_{t}^{w}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT as in (5.0.1). A stop** iteration T~=5000~𝑇5000\tilde{T}=5000over~ start_ARG italic_T end_ARG = 5000 is also used, such that if the optimisation algorithm hasn’t terminated due to the criterion (7.0.8), set θt=θtT~subscript𝜃𝑡superscriptsubscript𝜃𝑡~𝑇\theta_{t}=\theta_{t}^{\tilde{T}}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over~ start_ARG italic_T end_ARG end_POSTSUPERSCRIPT.

Preconditioners are learned up to iteration T=100𝑇100T=100italic_T = 100, such that we learn preconditioners Gθ0,,Gθ99subscript𝐺subscript𝜃0subscript𝐺subscript𝜃99G_{\theta_{0}},\cdots,G_{\theta_{99}}italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , ⋯ , italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 99 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for our parameterisations (P1-P4) in Table 1.

Comparison of learned parametrisations with classical hand-crafted optimisation algorithms

In Figure 2, we see the performance of the learned preconditioners over the first 100100100100 iterations against gradient descent with a constant step size equal to 1/L1𝐿1/L1 / italic_L. In particular, we see that, despite the diagonal preconditioner and the convolutional preconditioner having the same number of parameters, the convolutional preconditioner dramatically outperforms the diagonal preconditioner in this numerical experiment. In this case, there is evidence that adding local information is more important than adding pixel-specific flexibility.

Refer to caption
Figure 2: Performance of learned preconditioners on the test set 𝒯1,2subscript𝒯12\mathcal{T}_{1,2}caligraphic_T start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT, compared to a step-size given by 1/L1𝐿1/L1 / italic_L. On the x𝑥xitalic_x-axis, we have the iteration number, and on the y𝑦yitalic_y-axis, we have fk(xkt)fksubscript𝑓𝑘superscriptsubscript𝑥𝑘𝑡superscriptsubscript𝑓𝑘f_{k}(x_{k}^{t})-f_{k}^{*}italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) - italic_f start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT averaged over all data points.

Furthermore, we see more comparisons in Figure 3. For example, in the left image, we compare our learned preconditioning with FISTA and BFGS. The methods we learned initially significantly outperformed these methods. However, these handcrafted methods outperform at further iterations, as shown in Figure 4. Figure 5 shows that these further iterations may be of little importance as the images at these iterations are very similar to those at slightly higher objective values.

Refer to caption
Refer to caption
Figure 3: (Left) Performance of learned preconditioners on the test set 𝒯1,2subscript𝒯12\mathcal{T}_{1,2}caligraphic_T start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT, compared to a step-size given by 1/L1𝐿1/L1 / italic_L, backtracking line-search, and BFGS. (Right) Performance of learned preconditioners on the test set 𝒯2subscript𝒯2\mathcal{T}_{2}caligraphic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, compared to a step-size given by 1/L1𝐿1/L1 / italic_L, backtracking line-search, and BFGS.
Refer to caption
Figure 4: A comparison of the learned convolutional and full matrix preconditioners to BFGS on the test set 𝒯1,2subscript𝒯12\mathcal{T}_{1,2}caligraphic_T start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT. We observe much faster initial convergence for the learned methods, but BFGS is eventually overtaken. In Figure 5, we see this may not be relevant as the images do not visually change much below a loss of 51045superscript104\frac{5}{10^{4}}divide start_ARG 5 end_ARG start_ARG 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG.

Iteration 10

Iteration 20

Iteration 50

Iteration 90

Full

Refer to caption
Refer to caption
Refer to caption
Refer to caption

Convolution

Refer to caption
Refer to caption
Refer to caption
Refer to caption

BFGS

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: A comparison of BFGS, convolution and full matrix preconditioning reconstructions.

Freeze vs recycle

Now we compare whether freezing our final preconditioner or recycling all learned preconditioners is favourable. Figure 6 shows that recycling preconditioners can produce more unstable behaviour, for example, in the case of the learned step-sizes in blue and the learned full matrix in green. In the case of the learned full matrix, however, we see that the final performance is better when recycling preconditioners. However, when freezing the final preconditioner, the learned diagonal preconditioner leads to divergence. Therefore, it is not obvious which choice is better; it depends on which parameterisation is under consideration.

Refer to caption
Figure 6: Performance comparison freezing the final preconditioner vs recycling the final preconditioner on iterations AFTER training, i.e. starting at iteration 100100100100.

Learned preconditioners:

Now, we visualise the learned parameters for convolutional and diagonal preconditioning and learned step sizes. We see learned convolutional kernels in Figures 8 and 9. Note that these learned kernels contain negative values. While this does not necessarily imply that the corresponding matrices learned are not positive-definite, in Figures 11 and 12, we see that the learned diagonals have negative values, meaning that the learned matrices are not positive-definite!

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Learned convolutional kernels for the first 8888 images, on the same scale.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Learned convolutional kernels at iteration 10,25,50,751025507510,25,50,7510 , 25 , 50 , 75.
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Learned diagonal preconditioner (reshaped to an image) for the first 8888 images, on the same scale.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Learned diagonal preconditioner (reshaped to an image) at iteration 10,25,50,751025507510,25,50,7510 , 25 , 50 , 75.

Figure 13 shows learned step sizes. Note that these values often fluctuate up and down. Furthermore, the learned step sizes are initially all greater than 2/L2𝐿2/L2 / italic_L and, therefore, out of the range of provable convergence. However, after iteration 55555555, the learned step sizes fluctuate above and below this threshold.

Refer to caption
Figure 13: Learned step sizes.

8 Conclusions and future work

8.1 Conclusions

This paper introduced how one can learn preconditioners for gradient descent for use in ill-conditioned optimisation problems. We formulated how to generate a sequence of preconditioners learned using a convex optimisation problem on a dataset such that

  • The preconditioners need not be positive definite, nor symmetric,

  • the preconditioner at iteration t𝑡titalic_t is constant over all functions,

  • these preconditioners have a closed-form equation for least-squares problems,

  • and convergence is guaranteed for all functions in the training set,

  • with proved convergence rates for each train function.

  • Empirical performance was tested, with good results, especially for convolutional preconditioning in the image deblurring example, with maintained performance on out-of-distribution test images.

8.2 Future Work

As was seen in the numerical experiments, despite the tremendous early-iteration performance of the full-matrix and convolutional preconditioning, we saw that the performance of FISTA eventually overtook that of the learned algorithms. Future research aims to extend these learned methods to include momentum terms. One way of achieving this is to extend the minimisation problem (3.0.5) to include a momentum preconditioner.

One potential drawback of the learned preconditioning is not knowing limtGθtsubscript𝑡subscript𝐺subscript𝜃𝑡\lim_{t\to\infty}G_{\theta_{t}}roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT, which limits the convergence analysis on unseen data. To remedy this, one can introduce regularisation of the learned parameters. Another use of regularisation is to encourage preconditioners to exhibit specific behaviours. One may encourage preconditioners to exhibit symmetry or smoothness, for example.

This paper considered only diagonal, full matrices, and convolution, which is not an exhaustive list of potential parametrisations. Despite significantly fewer parameters learned, we saw that convolution may offer a more expressive update than the diagonal preconditioner. One could also consider an update similar to Quasi-Newton methods.

This paper only considered an explicit dataset of functions f1,,fNsubscript𝑓1subscript𝑓𝑁f_{1},\cdots,f_{N}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, equivalent to sampling from the class of functions \mathcal{F}caligraphic_F using a Dirac delta distribution. One can instead consider sampling from a class of function \mathcal{F}caligraphic_F using an arbitrary probability distribution.

Finally, in this paper, we only considered learning a matrix preconditioner. Future work would extend this to consider preconditioners as a function of xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and f(xt)𝑓subscript𝑥𝑡\nabla f(x_{t})∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), for example.

References

  • [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
  • [2] Sebastian Banert, Axel Ringh, Jonas Adler, Johan Karlsson, and Ozan Oktem. Data-driven nonsmooth optimization. SIAM Journal on Optimization, 30(1):102–131, 2020.
  • [3] Amir Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with MATLAB. SIAM, 2014.
  • [4] Silvia Bonettini, Ignace Loris, Federica Porta, and Marco Prato. Variable metric inexact line-search-based methods for nonsmooth optimization. SIAM journal on optimization, 26(2):891–921, 2016.
  • [5] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imaging. Acta Numerica, 25:161–319, 2016.
  • [6] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
  • [7] William C Davidon. Variable metric method for minimization. SIAM Journal on optimization, 1(1):1–17, 1991.
  • [8] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
  • [9] Matthias J Ehrhardt, Pawel Markiewicz, and Carola-Bibiane Schönlieb. Faster pet reconstruction with non-smooth priors by randomization and preconditioning. Physics in Medicine & Biology, 64(22):225019, 2019.
  • [10] Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3):317–322, 1970.
  • [11] Paul Häusner, Ozan Öktem, and Jens Sjölund. Neural incomplete factorization: learning preconditioners for the conjugate gradient method. arXiv preprint arXiv:2305.16368, 2023.
  • [12] Tao Hong, Xiaojian Xu, Jason Hu, and Jeffrey A Fessler. Provable preconditioned plug-and-play approach for compressed sensing mri reconstruction. arXiv preprint arXiv:2405.03854, 2024.
  • [13] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492–518. Springer, 1992.
  • [14] Kirsten Koolstra and Rob Remis. Learning a preconditioner to accelerate compressed sensing reconstructions in mri. Magnetic Resonance in Medicine, 87(4):2063–2073, 2022.
  • [15] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
  • [16] Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Processing Magazine, 38(2):18–44, 2021.
  • [17] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
  • [18] Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999.
  • [19] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
  • [20] Hong Ye Tan, Subhadip Mukherjee, Junqi Tang, and Carola-Bibiane Schönlieb. Data-driven mirror descent with input-convex neural networks. SIAM Journal on Mathematics of Data Science, 5(2):558–587, 2023.
  • [21] Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model based reconstruction. In 2013 IEEE global conference on signal and information processing, pages 945–948. IEEE, 2013.

Appendix A Nonconvexity for multiple steps

This can be seen as

f(x2(θ0,θ1))𝑓subscript𝑥2subscript𝜃0subscript𝜃1\displaystyle f(x_{2}(\theta_{0},\theta_{1}))italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) =f(x1Gθ1(f(x1(θ0)))\displaystyle=f(x_{1}-G_{\theta_{1}}(\nabla f(x_{1}(\theta_{0})))= italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∇ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) (A.0.1)
=f(x0Gθ0(f(x0))Gθ1(f(x0Gθ0(f(x0))))\displaystyle=f(x_{0}-G_{\theta_{0}}(\nabla f(x_{0}))-G_{\theta_{1}}(\nabla f(% x_{0}-G_{\theta_{0}}(\nabla f(x_{0}))))= italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_G start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) ) ) (A.0.2)

is non-convex in general. In particular, if we wish to learn a single step-size α𝛼\alphaitalic_α for a least-squares function f(x)=12Axy𝑓𝑥12norm𝐴𝑥𝑦f(x)=\frac{1}{2}\|Ax-y\|italic_f ( italic_x ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_A italic_x - italic_y ∥, then f(x)=AT(Axy)𝑓𝑥superscript𝐴𝑇𝐴𝑥𝑦\nabla f(x)=A^{T}(Ax-y)∇ italic_f ( italic_x ) = italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A italic_x - italic_y ) and

f(x2(α))𝑓subscript𝑥2𝛼\displaystyle f(x_{2}(\alpha))italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_α ) ) =f(x0αf(x0)αAT(A(x0αf(x0))y)\displaystyle=f(x_{0}-\alpha\nabla f(x_{0})-\alpha A^{T}(A(x_{0}-\alpha\nabla f% (x_{0}))-y)= italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_α ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_α italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_A ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_α ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) - italic_y ) (A.0.4)
=f(x0αf(x0)αATAx0+α2ATAf(x0)αATy)absent𝑓subscript𝑥0𝛼𝑓subscript𝑥0𝛼superscript𝐴𝑇𝐴subscript𝑥0superscript𝛼2superscript𝐴𝑇𝐴𝑓subscript𝑥0𝛼superscript𝐴𝑇𝑦\displaystyle=f(x_{0}-\alpha\nabla f(x_{0})-\alpha A^{T}Ax_{0}+\alpha^{2}A^{T}% A\nabla f(x_{0})-\alpha A^{T}y)= italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_α ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_α italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A ∇ italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_α italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_y ) (A.0.5)

and we see we get the composition of a quadratic function in α𝛼\alphaitalic_α with f𝑓fitalic_f, which is not convex in general. If we consider h(x)𝑥h(x)italic_h ( italic_x ) convex and define q(x)=h(x2)𝑞𝑥subscript𝑥2q(x)=h(x_{2})italic_q ( italic_x ) = italic_h ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), then assuming that q𝑞qitalic_q is convex, we have

q(0)=q(12x+12(x))12q(x)+12q(x)=h(x2),𝑞0𝑞12𝑥12𝑥12𝑞𝑥12𝑞𝑥subscript𝑥2\displaystyle q(0)=q\left(\frac{1}{2}x+\frac{1}{2}(-x)\right)\leq\frac{1}{2}q(% x)+\frac{1}{2}q(-x)=h(x_{2}),italic_q ( 0 ) = italic_q ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_x + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( - italic_x ) ) ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_q ( italic_x ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_q ( - italic_x ) = italic_h ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , (A.0.7)

a contradiction.