Learning Preconditioners for Inverse Problems
Abstract
We explore the application of preconditioning in optimisation algorithms, specifically those appearing in Inverse Problems in imaging. Such problems often contain an ill-posed forward operator and are large-scale. Therefore, computationally efficient algorithms which converge quickly are desirable. To remedy these issues, learning-to-optimise leverages training data to accelerate solving particular optimisation problems. Many traditional optimisation methods use scalar hyperparameters, significantly limiting their convergence speed when applied to ill-conditioned problems. In contrast, we propose a novel approach that replaces these scalar quantities with matrices learned using data. Often, preconditioning considers only symmetric positive-definite preconditioners. However, we consider multiple parametrisations of the preconditioner, which do not require symmetry or positive-definiteness. These parametrisations include using full matrices, diagonal matrices, and convolutions. We analyse the convergence properties of these methods and compare their performance against classical optimisation algorithms. Generalisation performance of these methods is also considered, both for in-distribution and out-of-distribution data.
1 Introduction
A linear inverse problem is defined by receiving an observation , generated from a ground-truth via some linear forward operator , such that
(1.0.1) |
where is some random noise. In this formulation, and are known, and the goal is to recover . Such a problem is often ill-posed due to the noise inherent in the observation. To remedy this, one may introduce a data-fidelity term to enforce and are "close" and a regularisation function to enforce the solution has desired properties, such that the minimiser of the function
(1.0.2) |
approximates .
In this paper, we refer to solving the optimisation problem given by
(1.0.3) |
with the assumption that is continuously differentiable, convex, and -smooth, and that a global minimiser exists.
To approximate a solution to this optimisation problem, one can use gradient descent:
(1.0.4) |
Various strategies exist for determining the step size , including using fixed step size, exact line search, and backtracking line search [18]. However, especially for ill-conditioned problems, gradient descent leads to very slow convergence.
1.1 Preconditioning
The issue of slow convergence in gradient descent can be remedied by introducing a matrix value step size, otherwise referred to as a preconditioner. Preconditioned Gradient Descent often considers a symmetric positive-definite matrix such that the update is now given by
(1.1.1) |
One such choice of is Newton’s method, which considers an update equation given by
(1.1.2) |
This method, given that is twice continuously differentiable, -smooth and -strongly convex for some , achieves quadratic convergence, compared to linear convergence for gradient descent [3]. However, this method comes with multiple drawbacks:
-
•
For ill-conditioned problems, the computation of may be unstable and lead to an incorrect estimate.
-
•
To remedy this, one may calculate the inverse Hessian as the solution to the equation
(1.1.3) This can be approximated, for example, by using Conjugate Gradient.
-
•
Approximating the solution to equation (1.1.3) can be computationally expensive. This can be an issue when optimising quickly is important.
-
•
Storing the inverse Hessian requires storing an matrix, which may be infeasible for large , often occurring in imaging inverse problems.
Other choices of include quasi-Newton methods. Such methods construct as an approximation of the inverse Hessian and can change over iterations. One example is the BFGS algorithm [10], which starts with some symmetric positive matrix and calculates from using a rank- update. Quasi-Newton methods lie within ’variable metric’ methods [7], which construct a symmetric, positive definite matrix at each iteration. This general class of methods has been studied for nonsmooth optimisation [4].
One application of hand-crafted preconditioners in inverse problems is in parallel MRI. The authors of [12] propose hand-crafted preconditioners with the aim of speeding up the convergence of a plug-and-play approach [21], whereas [14] consider a circulant preconditioner which leads to an acceleration factor of . Preconditioning has also been applied for PET imaging [9].
Learning preconditioners offline can remedy the issues of calculating preconditioners online and improve performance on a ’small’ class of relevant functions . Learned preconditioners have been considered in [11], where the preconditioner is constrained to the set of symmetric, positive-definite matrices by learning a map** such that the preconditioner is given by . In [14], a convolutional neural network preconditioner is learned as a function of the observation. However, this preconditioner is not required to be symmetric or positive-definite. Due to the learning of the preconditioner, the resulting optimisation algorithm is not necessarily convergent.
1.2 Learning-to-optimise
Although there exist optimisation algorithms that are optimal for large problem classes, practitioners usually only focus on a very narrow subclass. For example, one may only be interested in reconstructing blurred observations generated from a distribution with a known constant blurring operator . One might then consider the following class of functions:
(1.2.1) |
where is a chosen regularisation function. Learning-to-optimise [6] aims to minimise objective functions quickly over a given class of functions (see (1.2.1)) and a distribution of initial points . If the class of functions chosen is small, an optimisation algorithm that massively accelerates optimisation within this class can likely be learned. However, the performance on functions outside of this class may be poor. If the optimisation algorithm can be parametrised by
(1.2.2) |
for . Then, the parameters can be chosen to satisfy
(1.2.3) |
for some fixed , for example.
Algorithm unrolling [16], otherwise known as unrolling, directly parameterises the update step as a ’neural network’, often taking previous iterates and gradients as arguments. Parameters are found to approximate the solution (1.2.3). These methods have been empirically shown to speed up optimisation in various settings. However, many learned optimisation solvers do not have convergence guarantees, including those using reinforcement learning [15] and RNNs [1]. However, others come with provable convergence; for example, Banert et al. developed a method [2] for nonsmooth optimisation inspired by proximal splitting methods. However, such methods often greatly limit the number of learnable parameters and, therefore, the extent to which the algorithm can be adapted to a particular problem class. Learned optimisation algorithms exist where the parameters are chosen constant throughout iterations, i.e. for all . For example, [20] learn mirror maps using input-convex neural networks within the mirror descent optimisation algorithm.
1.3 Our approach
We consider learning a preconditioner at each iteration of gradient descent. Therefore, we seek to learn a parametrised update map of the form
(1.3.1) |
We simplify the learning procedure by reducing this to the following parametrisation:
(1.3.2) |
In other words,
-
•
The parametrisation is constant throughout different iterations: for all (but with potentially different parameters).
-
•
is a matrix and does not take any inputs.
We create an optimisation algorithm that is provably convergent on training data without requiring learned preconditioners confined to symmetric positive-definite matrices. We propose parameter learning as a convex optimisation problem using greedy learning and specific parameterisations . Therefore, any local minimiser is a global minimiser, removing the issue of being ’stuck’ in local optima. We also derive closed-form preconditioners in the case of least-squares objective functions . There is also an investigation into the generalisation properties of these methods, with out-of-sample data (data within the class used in training but not seen in training) and out-of-distribution data (data with a different distribution to those used in training). Firstly, we require a few definitions.
2 Background
We require the following definitions.
Definition 2.1.
Convexity
A function is convex if for all and for all :
(2.0.1) |
Definition 2.2.
Strong Convexity
A function is strongly convex with parameter if is convex.
Definition 2.3.
L-smoothness
A function is L-smooth with parameter if its gradient is Lipschitz continuous, i.e., if for all , the following inequality holds:
(2.0.2) |
We say
-
•
if is convex, continuously differentiable and -smooth.
-
•
if, in addition, is -strongly convex.
3 Greedy preconditioning
With these definitions, we can now formulate our method. This paper considers learning from a class of functions given by some . With this in mind, we consider a dataset of functions with initial points and minimisers , respectively. Throughout this paper, we assume that for some , for each .
We parametrise a preconditioner at each iteration in the preconditioned gradient descent algorithm (1.3.2), and restrict such that is affine in the parameters , where represents the iterate at iteration for datapoint . Then there exist such that
(3.0.1) |
Note that when is affine in its , then the optimisation problem
(3.0.2) |
is convex as it is the composition of a convex function with an affine function [3]. Therefore, if a local minimiser exists, it is a global minimiser, and we avoid local minima traps. Note that if , then the problem (3.0.2) reduces to exact line search for .
If instead we chose to learn sets of parameters for simultaneously, such that
(3.0.3) |
we have obtained a nonconvex optimisation problem for (shown in Appendix A), where
(3.0.4) |
Consider the optimisation problem at iteration given by
(3.0.5) |
As is affine in and each function is convex, this optimisation problem is convex. In this optimisation problem, given the current iterates for , we seek to choose the optimal greedy parameters at iteration such that we minimise the mean over every . In learning-to-optimise, learning parameters is often a non-convex optimisation problem. Therefore, the performance of learned optimisers is highly dependent on the optimisation algorithm used and its hyperparameters. However, as our problem is convex, one can use any convex optimisation algorithm with convergence guarantees.
We focus on the following four parameterisations of , presented in Table 1.
Label | Description | Parametrisation |
---|---|---|
(P1) | A scalar step size | |
(P2) | A diagonal matrix | |
(P3) | A full matrix | |
(P4) | Image convolution |
These four parametrisations can have a wildly varying number of parameters if is large. An increasing number of parameters may make more expressive, enabling better performance on training data. However, it may cause lower generalisation performance on out-of-sample data. However, some parameterisations may be more expressive than others, given as many or even fewer parameters as seen in section 7. Note that using a full-matrix preconditioner corresponds to the same memory usage as Newton’s method.
Suppose that training is terminated after iterations, after learning parameters , due to an apriori choice of stop** iteration or some stop** condition, for example. Then, preconditioners are learned and, for a new test function with initial point , the minimum of can be approximated using the following optimisation algorithm:
(3.0.6) |
One other choice is to ’recycle’ the learned parameters , such that at iteration , the parameters are used.
In section 4, we restrict each to least-squares problems and see closed-form solutions for parameterisations (P1), (P2) and (P3). In particular, we will see that a diagonal preconditioner can cause preconditioned gradient descent to converge instantly. In section 5, we will see how to approximate the optimal preconditioner using optimisation for a more general class of functions. Then, in section 6, we provide convergence results, including rates for the closed-form and approximated greedy preconditioners for all parameterisations . Following this, in section 7, we apply these methods to a series of problems and compare performance with classical optimisation methods and other learned approaches.
4 Closed-form solutions
In this section, we assume each can be written as
(4.0.1) |
with corresponding and forward model . Under these assumptions, there exists a closed-form solution for affine parametrisations.
Proposition 4.1.
Proof.
Because this problem is convex, if a solution is found by differentiating and equating equal to zero, this is a global minimiser. First, note that
(4.0.3) | ||||
(4.0.4) | ||||
(4.0.5) |
Now,
(4.0.7) | |||
(4.0.8) | |||
(4.0.9) |
is equal to zero if and only if
(4.0.10) |
Note that
(4.0.11) |
and so can be given by
(4.0.12) |
Due to the properties of the pseudoinverse, this is the least-norm solution. ∎
The following proposition tells us when the parameters satisfying (3.0.5) are unique.
Proposition 4.2.
Suppose is convex and twice continuously differentiable for . Furthermore, suppose there exists some for which both is injective and also is -strongly convex. Then defined in (3.0.5) is strongly convex and has a unique global minimiser .
Proof.
Each is twice continuously differentiable; therefore, is twice continuously differentiable. It is then sufficient to show there exists such that
(4.0.13) |
for all , as this implies that is strongly convex and has a unique global minimiser. Note that
(4.0.14) |
Each is convex and so for all ,
(4.0.15) |
and is -strongly convex, therefore
(4.0.16) |
For ,
(4.0.17) | |||
(4.0.18) | |||
(4.0.19) | |||
(4.0.20) | |||
(4.0.21) | |||
(4.0.22) | |||
(4.0.23) |
where is the minimum eigenvalue of . Due to the symmetry of , and is greater than zero if and only if is injective. As is injective, then and
(4.0.25) | |||
(4.0.26) |
and therefore is strongly-convex. ∎
This result can then be used when considering least-squares functions.
Corollary 4.1.
Uniqueness of optimal parameters in the least-squares case
When our can be written as least-squares functions (4.0.1), then has a unique global minimiser if there exists some for which both and are injective.
Proof.
If is injective then is invertible which means that is strongly convex. ∎
4.1 Diagonal preconditioning
We first consider the diagonal parametrisation (P2). With this parametrisation, the optimisation problem (3.0.5) with has the following closed-form solution.
Proposition 4.3.
defined by
(4.1.1) |
is the minimal-norm solution to (3.0.5) with , where represents the Hadarmard (element-wise) product.
Furthermore, suppose that there exists such that is injective ( is invertible) and that for all , where denotes the component of the vector . Then the inverse exists, and one can write
(4.1.2) |
Proof.
For a diagonal preconditioner for we have that
(4.1.3) |
and so we take
(4.1.4) | ||||
(4.1.5) | ||||
(4.1.6) |
Now,
(4.1.7) | ||||
(4.1.8) |
and
(4.1.9) |
Inserting these values in (4.0.2) gives
(4.1.10) |
In this case we have and . is therefore injective if and only if for and therefore by proposition 4.2 there is a unique solution, and so the inverse exists.
∎
Proposition 4.4.
In the case , we consider a lone function with an initial point . Then, the preconditioned gradient descent algorithm (1.3.2) with diagonal preconditioner converges in one iteration.
Proof.
Denote by the starting point and a global minimum of . As is chosen to be the global minimum of , it is sufficient to show there exists some diagonal preconditioner which leads to . Choose the vector such that
(4.1.11) |
then let
(4.1.12) |
Due to the fact that if then , we have
(4.1.13) |
Then
(4.1.14) | ||||
(4.1.15) | ||||
(4.1.16) | ||||
(4.1.17) |
as required. ∎
4.2 Full matrix preconditioning
Next, we consider the full matrix parametrisation (P3). We consider the optimisation problem (3.0.5) with .
Proposition 4.5.
Let be such that
(4.2.1) |
Then define by
(4.2.2) |
This is the minimal-norm solution to (3.0.5) with , where represents the Kronecker product of two matrices, defined as
.
Note the matrix in (4.2.1) is of dimension . If is large, this dimension becomes extremely large.
Proof.
For a full matrix preconditioner , we require
(4.2.3) |
where, in this instance . From (4.2.4) we have that
(4.2.4) |
Note that
(4.2.5) | ||||
(4.2.6) | ||||
(4.2.7) |
Therefore,
(4.2.8) |
where , such that
(4.2.9) |
Therefore, is the matrix with columns for . Therefore,
(4.2.10) |
which corresponds to defined as
(4.2.11) |
We can also write
(4.2.12) |
Note that then
(4.2.13) |
(4.2.14) | ||||
(4.2.15) | ||||
(4.2.16) |
Secondly,
(4.2.17) | ||||
(4.2.18) | ||||
(4.2.19) |
Therefore, , the vectorised form of can be given by
(4.2.20) |
∎
While diagonal preconditioning obtains instant convergence for one function, full matrix preconditioning, under certain conditions, can obtain immediate convergence for all functions in the dataset if .
Corollary 4.2.
Suppose that is a linearly independent set, then if , the full matrix preconditioner causes instant convergence for all datapoints. In particular,
(4.2.21) |
for all .
Proof.
It is sufficient to show there exists a matrix such that (4.2.21) is satisfied for all . We require
(4.2.22) |
Each of these equations gives linear equations in unknowns. There are such equations and so we have linear equations in unknowns. Rewritten, these read
(4.2.23) |
For such a to exist we require
-
•
The columns to be linearly independent,
-
•
, which is equivalent to .
Note in the case that , we have a unique choice of :
(4.2.24) |
∎
4.3 Scalar step-size
Consider now the case where we learn scalars in (P1) such that .
Proposition 4.6.
If each can be written as a least-squares function (4.0.1), then can be given as
(4.3.1) |
Note that in the case , this reduces to exact line search for least-squares functions.
Proof.
In this case, we wish to calculate the optimal greedy scalar step size , such that
(4.3.2) |
Then we take
(4.3.3) | ||||
(4.3.4) |
Then (4.0.2) reduces to
(4.3.5) |
∎
5 Approximating optimal parameters
In the general case, we can’t simply consider least-squares functions, a closed-form solution does not exist for choosing , , in (P1)-(P3). Instead, we require an optimisation algorithm to approximate these quantities. With information of
-
•
, and
-
•
, the Lipschitz constant of ,
one can use a first-order convex optimisation algorithm, such as gradient descent FISTA, or stochastic methods (especially for large ) to approximate . For example, one can start at an initial guess at iteration and update via gradient descent
(5.0.1) |
The following result illustrates how these values can be calculated.
Proposition 5.1.
For a general affine preconditioner , the gradient of with respect to can be calculated as
(5.0.2) |
and is -smooth, where
(5.0.3) |
Proof.
As
(5.0.4) |
then by the chain rule
(5.0.5) |
as required. To calculate the smoothness constant, we have
(5.0.6) | ||||
(5.0.7) | ||||
(5.0.8) | ||||
(5.0.9) | ||||
(5.0.10) |
Due to the properties of the triangle inequality, the Cauchy-Schwarz inequality and the operator norm, this bound is tight. Therefore the Lipschitz constant of is given by
(5.0.12) |
as required. ∎
With this result, we can now see how to approximate the optimal diagonal and full matrix preconditioners, and the optimal scalar step size.
Corollary 5.1.
Suppose each .
Diagonal preconditioning
For diagonal preconditioning, gives and .
Then by (5.0.2)
(5.0.13) |
and the Lipschitz constant of is given by
(5.0.14) |
Full matrix preconditioning
In this case we have , has a corresponding matrix , such that .
The gradient of is given by
(5.0.15) |
and the Lipschitz constant of is given by
(5.0.16) |
Scalar step size
We now take .
The derivative of with respect to is given by
(5.0.17) |
and the Lipschitz constant of is given by
(5.0.18) |
Proof.
Diagonal preconditioning
In this case, we have and that
-
•
, and
-
•
.
Therefore, is given by
(5.0.19) |
and the smoothness constant of is
(5.0.20) |
Full matrix preconditioning
For the proof of full matrix preconditioning, we require the following propositions.
Lemma 5.1.
Let . Then
(5.0.21) |
Proof.
Note that
(5.0.22) |
then
(5.0.23) | ||||
(5.0.24) | ||||
(5.0.25) |
∎
Lemma 5.2.
For vectors ,
(5.0.26) |
Proof.
(5.0.27) | ||||
(5.0.28) | ||||
(5.0.29) |
∎
Lemma 5.3.
(5.0.30) |
In the case of full matrix preconditioning, we have that
-
•
, and
-
•
.
Therefore, the gradient is given by
(5.0.33) | ||||
(5.0.34) |
the smoothness constant of is given by
(5.0.35) |
where the last equality is as a result of Lemma 5.3.
Scalar step size
In this case, we have and that
-
•
, and
-
•
.
Therefore, is given by
(5.0.36) | ||||
(5.0.37) |
First, note that
(5.0.38) |
and therefore, the smoothness constant of is
(5.0.39) |
∎
5.1 Convolutional preconditioning
We now introduce convolution preconditioning, which enables the preconditioner to consider local information instead of information at each pixel individually, unlike diagonal preconditioning.
Let (therefore the corresponding dimension is given by ). Define a convolution kernel and define . Define
(5.1.1) |
where, for ,
(5.1.2) |
The convolution at coordinate is given by
(5.1.3) | ||||
(5.1.4) |
Notice that the convolution is linear in its parameters, and so the optimisation problem given by
(5.1.5) |
for fixed is convex. The following proposition provides the gradient and smoothness constant of using the convolutional parametrisation.
Proposition 5.2.
Firstly, define
(5.1.6) |
Finally, denote by the image translated by pixels down and pixels right, in other words, for an image
(5.1.7) |
Then
(5.1.8) |
Then the gradient of with respect to is given by
(5.1.9) |
Furthermore, an upper bound for the smoothness constant of is given by
(5.1.10) |
where represents the Frobenius norm of a matrix.
Proof.
We have that
(5.1.12) | ||||
(5.1.13) |
Furthermore, an upper bound for the smoothness constant of is given by
(5.1.15) | ||||
(5.1.16) | ||||
(5.1.17) | ||||
(5.1.18) |
∎
6 Convergence results
The following results are required before introducing the convergence results of our learned preconditioning.
Lemma 6.1.
Suppose that each are -smooth then define
(6.0.1) |
for
(6.0.2) |
Then is -smooth, with
(6.0.3) |
where .
Proof.
We have
(6.0.4) |
and for any ,
(6.0.5) | ||||
(6.0.6) | ||||
(6.0.7) |
Then
(6.0.8) | ||||
(6.0.9) | ||||
(6.0.10) | ||||
(6.0.11) | ||||
(6.0.12) |
Therefore, is -smooth, where
(6.0.13) |
∎
Lemma 6.2.
Suppose each are -strongly convex. Then is -strongly convex, with
(6.0.14) |
where .
Proof.
It is sufficient to show that is convex, and that this is the smallest such constant. We have
(6.0.15) | |||
(6.0.16) |
Notice that
(6.0.17) |
is convex for all , as each is -strongly convex and . This property would no longer hold if we chose a constant such that . Therefore (6.0.15) is convex and so is -strongly convex. ∎
Firstly, define
(6.0.18) |
The following result shows that our parametrisations generalise gradient descent with a constant step size given by . This property will be used to prove the convergence rate of our learned preconditioners on the training set.
Lemma 6.3.
For all parametrisations (P1-P4) in Table 1, there exists such that
(6.0.19) |
Proof.
-
1.
For scalar step sizes, , take .
-
2.
For diagonal preconditioning, , take .
-
3.
For full matrix preconditioning, , take
(6.0.20) -
4.
For convolutional preconditioning, , take
(6.0.21)
∎
Theorem 6.1.
Convergence in training set algorithm.
Assuming is bounded below for all . Then, for any learned optimisation algorithm such that
(6.0.22) |
we have that
(6.0.23) |
Furthermore, if we denote
(6.0.24) |
Then
(6.0.25) |
If, in addition, each is -strongly convex, then we have linear convergence given by
(6.0.26) |
Proof.
We have
(6.0.27) | ||||
(6.0.28) | ||||
(6.0.29) | ||||
(6.0.30) | ||||
(6.0.31) |
for
(6.0.32) |
is -smooth as each is -smooth and -strongly convex if each is -strongly convex, where
(6.0.33) | ||||
(6.0.34) |
and therefore, using standard convergence rate results of gradient descent [17], we have
(6.0.35) |
as is smooth, and if is also -strongly convex we have
(6.0.36) |
In both cases, we have that , meaning that as , which implies that as for all . ∎
Note that this result gives a worst-case convergence bound among train functions. However, provable convergence is still acquired. Also, note that this is not an issue for a function class with constant smoothness and strongly convex parameters. Furthermore, although a weak convergence bound has been found, it is very likely that one can far exceed this rate when learning is applied to a specific class of functions.
We have proved convergence for the mean of our train functions. The following proposition proves the same convergence rate for each function in our training set.
Proposition 6.1.
Suppose we have a convergence rate for of
(6.0.37) |
Then the convergence rate for some , is given by
(6.0.38) |
where
(6.0.39) |
is constant in .
Proof.
(6.0.37):
Note that we may write
Therefore, using our convergence rate on gives
(6.0.40) |
which implies
(6.0.41) | ||||
(6.0.42) |
Note that
(6.0.43) |
is constant in . Then
(6.0.44) |
Let be given by
(6.0.45) |
Then
(6.0.46) |
∎
7 Numerical example
We now consider an image deblurring problem, with forward operator given by a Gaussian blur with . We take the following:
-
•
Ground truth data , where is the set of pixel MNIST images [8].
-
•
For , generate an observation , where is noise sampled from a zero-mean Gaussian distribution, with .
![Refer to caption](extracted/5633817/values/obs.png)
![Refer to caption](extracted/5633817/values/recon.png)
Figure 1 shows an example observation and ground-truth pair. For the purpose of recovering an approximation to using our observation , we formulate a minimisation problem given by
(7.0.1) |
with , and given as the Huber regularised Total Variation [13, 19] with , defined as
(7.0.2) |
with the Huber loss defined as
(7.0.3) |
and the gradient operator equal to
(7.0.4) | |||
(7.0.5) |
Then, each function is -smooth, where [5]
(7.0.6) |
Learning preconditioners :
Let
(7.0.7) |
then define the training set and two testing sets by the following:
-
•
Training set: Image set of MNIST ones, with .
-
•
Testing set : Image set of MNIST ones not in training set, with .
-
•
Testing set : Image set of MNIST digits in , with .
The learned convolutional kernel is chosen to be of size .
At iteration , for a parametrisation , we learn via the following procedure, for a tolerance we have the stop** condition given by
(7.0.8) |
at some sub-iteration , for defined in (6.0.19) and as in (5.0.1). A stop** iteration is also used, such that if the optimisation algorithm hasn’t terminated due to the criterion (7.0.8), set .
Preconditioners are learned up to iteration , such that we learn preconditioners , for our parameterisations (P1-P4) in Table 1.
Comparison of learned parametrisations with classical hand-crafted optimisation algorithms
In Figure 2, we see the performance of the learned preconditioners over the first iterations against gradient descent with a constant step size equal to . In particular, we see that, despite the diagonal preconditioner and the convolutional preconditioner having the same number of parameters, the convolutional preconditioner dramatically outperforms the diagonal preconditioner in this numerical experiment. In this case, there is evidence that adding local information is more important than adding pixel-specific flexibility.
![Refer to caption](extracted/5633817/FinalImgs/conv-learned.png)
Furthermore, we see more comparisons in Figure 3. For example, in the left image, we compare our learned preconditioning with FISTA and BFGS. The methods we learned initially significantly outperformed these methods. However, these handcrafted methods outperform at further iterations, as shown in Figure 4. Figure 5 shows that these further iterations may be of little importance as the images at these iterations are very similar to those at slightly higher objective values.
![Refer to caption](extracted/5633817/FinalImgs/conv-all-val.png)
![Refer to caption](extracted/5633817/FinalImgs/conv-all-others.png)
![Refer to caption](extracted/5633817/FinalImgs/performance_vs_BFGS_val.png)
Iteration 10
Iteration 20
Iteration 50
Iteration 90
Full
![Refer to caption](extracted/5633817/FinalImgs/full_ones_10.png)
![Refer to caption](extracted/5633817/FinalImgs/full_ones_20.png)
![Refer to caption](extracted/5633817/FinalImgs/full_ones_50.png)
![Refer to caption](extracted/5633817/FinalImgs/full_ones_50.png)
Convolution
![Refer to caption](extracted/5633817/FinalImgs/conv_ones_10.png)
![Refer to caption](extracted/5633817/FinalImgs/conv_ones_20.png)
![Refer to caption](extracted/5633817/FinalImgs/conv_ones_50.png)
![Refer to caption](extracted/5633817/FinalImgs/conv_ones_90.png)
BFGS
![Refer to caption](extracted/5633817/FinalImgs/BFGS_ones_10.png)
![Refer to caption](extracted/5633817/FinalImgs/BFGS_ones_20.png)
![Refer to caption](extracted/5633817/FinalImgs/BFGS_ones_50.png)
![Refer to caption](extracted/5633817/FinalImgs/BFGS_ones_90.png)
Freeze vs recycle
Now we compare whether freezing our final preconditioner or recycling all learned preconditioners is favourable. Figure 6 shows that recycling preconditioners can produce more unstable behaviour, for example, in the case of the learned step-sizes in blue and the learned full matrix in green. In the case of the learned full matrix, however, we see that the final performance is better when recycling preconditioners. However, when freezing the final preconditioner, the learned diagonal preconditioner leads to divergence. Therefore, it is not obvious which choice is better; it depends on which parameterisation is under consideration.
![Refer to caption](extracted/5633817/FinalImgs/freeze_vs_recycle.png)
Learned preconditioners:
Now, we visualise the learned parameters for convolutional and diagonal preconditioning and learned step sizes. We see learned convolutional kernels in Figures 8 and 9. Note that these learned kernels contain negative values. While this does not necessarily imply that the corresponding matrices learned are not positive-definite, in Figures 11 and 12, we see that the learned diagonals have negative values, meaning that the learned matrices are not positive-definite!
![[Uncaptioned image]](extracted/5633817/FinalImgs/iter0_kernel.png)
![[Uncaptioned image]](extracted/5633817/FinalImgs/iter1_kernel.png)
![[Uncaptioned image]](extracted/5633817/FinalImgs/iter2_kernel.png)
![[Uncaptioned image]](extracted/5633817/FinalImgs/iter3_kernel.png)
![Refer to caption](extracted/5633817/FinalImgs/iter4_kernel.png)
![Refer to caption](extracted/5633817/FinalImgs/iter5_kernel.png)
![Refer to caption](extracted/5633817/FinalImgs/iter6_kernel.png)
![Refer to caption](extracted/5633817/FinalImgs/iter7_kernel.png)
![Refer to caption](extracted/5633817/FinalImgs/iter10_kernel.png)
![Refer to caption](extracted/5633817/FinalImgs/iter25_kernel.png)
![Refer to caption](extracted/5633817/FinalImgs/iter50_kernel.png)
![Refer to caption](extracted/5633817/FinalImgs/iter75_kernel.png)
![[Uncaptioned image]](extracted/5633817/FinalImgs/iter0_diagonal.png)
![[Uncaptioned image]](extracted/5633817/FinalImgs/iter1_diag.png)
![[Uncaptioned image]](extracted/5633817/FinalImgs/iter2_diag.png)
![[Uncaptioned image]](extracted/5633817/FinalImgs/iter3_diag.png)
![Refer to caption](extracted/5633817/FinalImgs/iter4_diag.png)
![Refer to caption](extracted/5633817/FinalImgs/iter5_diag.png)
![Refer to caption](extracted/5633817/FinalImgs/iter6_diag.png)
![Refer to caption](extracted/5633817/FinalImgs/iter7_diag.png)
![Refer to caption](extracted/5633817/FinalImgs/iter10_diag.png)
![Refer to caption](extracted/5633817/FinalImgs/iter25_diag.png)
![Refer to caption](extracted/5633817/FinalImgs/iter50_diag.png)
![Refer to caption](extracted/5633817/FinalImgs/iter75_diag.png)
Figure 13 shows learned step sizes. Note that these values often fluctuate up and down. Furthermore, the learned step sizes are initially all greater than and, therefore, out of the range of provable convergence. However, after iteration , the learned step sizes fluctuate above and below this threshold.
![Refer to caption](extracted/5633817/FinalImgs/learned-steps.png)
8 Conclusions and future work
8.1 Conclusions
This paper introduced how one can learn preconditioners for gradient descent for use in ill-conditioned optimisation problems. We formulated how to generate a sequence of preconditioners learned using a convex optimisation problem on a dataset such that
-
•
The preconditioners need not be positive definite, nor symmetric,
-
•
the preconditioner at iteration is constant over all functions,
-
•
these preconditioners have a closed-form equation for least-squares problems,
-
•
and convergence is guaranteed for all functions in the training set,
-
•
with proved convergence rates for each train function.
-
•
Empirical performance was tested, with good results, especially for convolutional preconditioning in the image deblurring example, with maintained performance on out-of-distribution test images.
8.2 Future Work
As was seen in the numerical experiments, despite the tremendous early-iteration performance of the full-matrix and convolutional preconditioning, we saw that the performance of FISTA eventually overtook that of the learned algorithms. Future research aims to extend these learned methods to include momentum terms. One way of achieving this is to extend the minimisation problem (3.0.5) to include a momentum preconditioner.
One potential drawback of the learned preconditioning is not knowing , which limits the convergence analysis on unseen data. To remedy this, one can introduce regularisation of the learned parameters. Another use of regularisation is to encourage preconditioners to exhibit specific behaviours. One may encourage preconditioners to exhibit symmetry or smoothness, for example.
This paper considered only diagonal, full matrices, and convolution, which is not an exhaustive list of potential parametrisations. Despite significantly fewer parameters learned, we saw that convolution may offer a more expressive update than the diagonal preconditioner. One could also consider an update similar to Quasi-Newton methods.
This paper only considered an explicit dataset of functions , equivalent to sampling from the class of functions using a Dirac delta distribution. One can instead consider sampling from a class of function using an arbitrary probability distribution.
Finally, in this paper, we only considered learning a matrix preconditioner. Future work would extend this to consider preconditioners as a function of and , for example.
References
- [1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, and Nando De Freitas. Learning to learn by gradient descent by gradient descent. Advances in neural information processing systems, 29, 2016.
- [2] Sebastian Banert, Axel Ringh, Jonas Adler, Johan Karlsson, and Ozan Oktem. Data-driven nonsmooth optimization. SIAM Journal on Optimization, 30(1):102–131, 2020.
- [3] Amir Beck. Introduction to nonlinear optimization: Theory, algorithms, and applications with MATLAB. SIAM, 2014.
- [4] Silvia Bonettini, Ignace Loris, Federica Porta, and Marco Prato. Variable metric inexact line-search-based methods for nonsmooth optimization. SIAM journal on optimization, 26(2):891–921, 2016.
- [5] Antonin Chambolle and Thomas Pock. An introduction to continuous optimization for imaging. Acta Numerica, 25:161–319, 2016.
- [6] Tianlong Chen, Xiaohan Chen, Wuyang Chen, Howard Heaton, Jialin Liu, Zhangyang Wang, and Wotao Yin. Learning to optimize: A primer and a benchmark. Journal of Machine Learning Research, 23(189):1–59, 2022.
- [7] William C Davidon. Variable metric method for minimization. SIAM Journal on optimization, 1(1):1–17, 1991.
- [8] Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE signal processing magazine, 29(6):141–142, 2012.
- [9] Matthias J Ehrhardt, Pawel Markiewicz, and Carola-Bibiane Schönlieb. Faster pet reconstruction with non-smooth priors by randomization and preconditioning. Physics in Medicine & Biology, 64(22):225019, 2019.
- [10] Roger Fletcher. A new approach to variable metric algorithms. The computer journal, 13(3):317–322, 1970.
- [11] Paul Häusner, Ozan Öktem, and Jens Sjölund. Neural incomplete factorization: learning preconditioners for the conjugate gradient method. arXiv preprint arXiv:2305.16368, 2023.
- [12] Tao Hong, Xiaojian Xu, Jason Hu, and Jeffrey A Fessler. Provable preconditioned plug-and-play approach for compressed sensing mri reconstruction. arXiv preprint arXiv:2405.03854, 2024.
- [13] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492–518. Springer, 1992.
- [14] Kirsten Koolstra and Rob Remis. Learning a preconditioner to accelerate compressed sensing reconstructions in mri. Magnetic Resonance in Medicine, 87(4):2063–2073, 2022.
- [15] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
- [16] Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing. IEEE Signal Processing Magazine, 38(2):18–44, 2021.
- [17] Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
- [18] Jorge Nocedal and Stephen J Wright. Numerical optimization. Springer, 1999.
- [19] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
- [20] Hong Ye Tan, Subhadip Mukherjee, Junqi Tang, and Carola-Bibiane Schönlieb. Data-driven mirror descent with input-convex neural networks. SIAM Journal on Mathematics of Data Science, 5(2):558–587, 2023.
- [21] Singanallur V Venkatakrishnan, Charles A Bouman, and Brendt Wohlberg. Plug-and-play priors for model based reconstruction. In 2013 IEEE global conference on signal and information processing, pages 945–948. IEEE, 2013.
Appendix A Nonconvexity for multiple steps
This can be seen as
(A.0.1) | ||||
(A.0.2) |
is non-convex in general. In particular, if we wish to learn a single step-size for a least-squares function , then and
(A.0.4) | ||||
(A.0.5) |
and we see we get the composition of a quadratic function in with , which is not convex in general. If we consider convex and define , then assuming that is convex, we have
(A.0.7) |
a contradiction.