Sampling Strategies in Bayesian Inversion: A Study of RTO and Langevin Methods ^†^†thanks: This work was funded by a Villum Investigator grant (no. 25893) from the Villum Foundation.

Rémi Laumont
Department of Applied Mathematics and Computer Science
Technical University of Denmark
Kongens Lyngby, 2800
[email protected]
&

Yiqiu Dong
Department of Applied Mathematics and Computer Science
Technical University of Denmark
Kongens Lyngby, 2800
[email protected]
&

Martin Skovgaard Andersen
Department of Applied Mathematics and Computer Science
Technical University of Denmark
Kongens Lyngby, 2800
[email protected]
Corresponding author

Abstract

This paper studies two classes of sampling methods for the solution of inverse problems, namely Randomize-Then-Optimize (RTO), which is rooted in sensitivity analysis, and Langevin methods, which are rooted in the Bayesian framework. The two classes of methods correspond to different assumptions and yield samples from different target distributions. We highlight the main conceptual and theoretical differences between the two approaches and compare them from a practical point of view by tackling two classical inverse problems in imaging: deblurring and inpainting. We show that the choice of the sampling method has a significant impact on the quality of the reconstruction and that the RTO method is more robust to the choice of the parameters.

Keywords Inverse problems, sampling, RTO, Langevin methods, deblurring, inpainting, parameter selection

1 Introduction

A typical inverse problem in imaging is to retrieve an image $x\in\mathbb{R}^{d}$ from a degraded observation $y\in\mathbb{R}^{m}$ . A common observation model is the additive noise model, which can be formulated as

y=\mathcal{A}(x)+n,

(1)

where $\mathcal{A}\colon\mathbb{R}^{d}\to\mathbb{R}^{m}$ is a so-called forward operator that models the deterministic aspects of the observation process. The term $n$ represents additive noise, and we will henceforth assume that $n$ is zero-mean Gaussian white noise with covariance $\sigma^{2}I$ . We will also assume that the forward operator $\mathcal{A}$ is known and linear, which allows us to express the measurement model as $y=Ax+n$ with $A\in\mathbb{R}^{m\times d}$ .

The problem of estimating $x$ from the observation vector $y$ is often ill-posed. Generally speaking, this means that the observation may not correspond to a unique reconstruction or that the reconstruction is very sensitive to perturbations of $y$ . Regularization in the form of prior information on $x$ is then necessary to obtain a well-posed problem [18]. In the variational framework, the reconstruction problem takes the form of an optimization problem,

x^{*}\in\operatorname*{argmin}_{x}\,\{f(Ax,y)+g(x)\},

(2)

where $x^{*}$ denotes the reconstruction. The objective function consists of a data-fidelity term $f(Ax,y)$ and a regularization term $g(x)$ . The data-fidelity term measures the discrepancy between the observation $y$ and the the model output $Ax$ whereas the regularization term $g(x)$ penalizes images with undesirable properties, e.g., images outside a neighborhood of some sub-manifold. In many cases, the minimization problem in (2) can be solved efficiently using advanced optimization methods. However, a solution to the problem in (2) does not provide information about the inherent uncertainty that may arise because of noisy measurements, discretization, and/or model errors.

Information about uncertainty can be obtained by adopting a probabilistic approach, and the Bayesian framework is a natural choice for this purpose. Assuming that the true image $x$ and the observation $y$ are considered realizations of random variables, the posterior probability density function (pdf) of $x$ given $y$ can be expressed in terms of the pdf $\pi_{\mathbbm{y}|\mathbbm{x}=x}(y)$ , which characterizes the observation model, and a prior pdf $\pi_{\mathbbm{x}}(x)$ , which encodes any prior information we may have about $x$ before observing $y$ . Using Bayes’ formula, we can express the posterior density as

\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)=\frac{\pi_{\mathbbm{x}}(x)\ \pi_{\mathbbm{y% }|\mathbbm{x}=x}(y)}{\pi_{\mathbbm{y}}(y)},\qquad\pi_{\mathbbm{y}}(y)=\int_{% \mathbb{R}^{d}}\pi_{\mathbbm{x}}(x)\ \pi_{\mathbbm{y}|\mathbbm{x}=x}(y)\,% \mathrm{d}x.

(3)

From a Bayesian point of view, the posterior $\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)$ is the solution to the inverse problem since it provides a complete characterization of the uncertainty. The posterior density can be used to compute point estimates, such as a maximum a posteriori (MAP) estimate or the posterior mean, and to quantify uncertainty in the form of credible intervals or second-order moments.

Although the Bayesian framework is well-established from a theoretical point of view, the posterior density is often intractable and cannot be computed in closed form. Thus, the posterior density is often explored using sampling methods. These generate a set of samples $\{x^{(i)}\}_{i=1}^{N}$ from the posterior density $\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)$ , allowing us to compute point estimates and quantify uncertainty using Monte Carlo approximations to high-dimensional integrals.

The purpose of this paper is to contrast and compare two classes of sampling methods for solving inverse problems within the Bayesian framework, namely Randomize-Then-Optimize (RTO) [4] and Langevin sampling methods [34, 13, 14]. On one hand, RTO is deeply rooted within the variational framework as it solves a perturbed optimization problem in order to generate one sample. On the other hand, Langevin methods stem from the discretization of a stochastic differential equation (SDE) whose solution is the target distribution. The two classes of methods are similar in many ways, but they are based on different principles and assumptions, and hence they lead to different densities. This is illustrated in Figure 1, which shows different solution densities obtained from an observation $y=x+n$ where $n$ is standard normal and $x\in[a,b]$ is an unknown parameter. Adopting a uniform prior on $[a,b]$ leads to a truncated Gaussian posterior. The Langevin approach requires smoothness and leads to a smooth approximate truncated Gaussian posterior. In contrast, the RTO approach yields a density that is a mixture of a truncated Gaussian distribution and a distribution on the boundary of the interval. The example highlights some key differences: unlike the smooth approximate truncated Gaussian posterior, the RTO density assigns non-zero probability to the boundary of the interval and has compact support. This RTO density is associated with an implicit prior that is supported on $[a,b]$ and depends on the observation $y$ . Consequently, it violates the typical Bayesian assumption that the prior is independent of the observation. Using somewhat unconventional terminology, we will refer to the RTO density as the RTO posterior.

Refer to caption — Figure 1: Truncated Gaussian posterior, a smooth approximation, and the RTO density for a Gaussian observation model. The vertical dotted line marks the target mean relative to the observation $y$ , which is the mode of the likelihood.

As the example illustrates, the choice of prior has an important impact on the posterior $\pi_{\mathbbm{x}|\mathbbm{x}=y}(x)$ . A good prior is typically related to the nature of the inverse problem. In the Bayesian imaging literature, there are many examples of priors that promote sparsity or piecewise regularity in some transformed domain (e.g., involving the $l_{1}$ norm or the total-variation (TV) pseudo-norm [35, 9, 25, 28]). Priors can also come in the form of a Markov random field [6], a learned prior like a patch-based Gaussian, or a Gaussian mixture model [44, 43, 1, 36, 21]. Choices pertaining to the prior are often informed by the resulting tractability of the posterior [28, 14, 32, 17, 11] or the ability to derive convergence guarantees [28, 14, 32, 17, 11]. Moreover, recent progress on neural networks has spurred an interest in data-driven approaches where a prior is learned from a large dataset $\{x_{i}\}_{i=1}^{N}\sim\pi_{\mathbbm{x}}(x)$ [22, 24, 20, 8, 19].

In the following sections, we compare two sampling techniques: Randomize-Then-Optimize (RTO) and the Moreau–Yoshida Unadjusted Langevin Algorithm (MYULA) [14]. The latter finds frequent application in the field of imaging science. We begin by contrasting the theoretical and practical aspects of these methods in Section 2, and we quantitatively compare and evaluate RTO and MYULA in Section 3 on two classical imaging inverse problems, namely deblurring and inpainting. Finally, we conclude in Section 4 by summarizing the main findings and discussing future research directions.

2 Theory and Methods

The posterior density $\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)$ is often intractable and cannot be computed in closed form. In such cases, the posterior density is explored using sampling methods. In this section, we will study two such methods, namely Randomize-Then-Optimize (RTO) and Langevin methods with a focus on MYULA. We will compare the two classes of methods from a theoretical point of view, and discuss their practical implications. We note that besides RTO and Langevin methods, there are many other efficient sampling methods for high-dimensional densities based on Gibbs samplers, see e.g. [38, 30]. These are mainly designed to sample from high-dimensional Gaussian distributions, and their adaptations to more complex target densities frequently involve Langevin methods [38].

2.1 Randomize-Then-Optimize (RTO) sampling method

The RTO sampling method for Bayesian inverse problems was proposed in [4] as a way to approximate samples from a posterior distribution arising from a nonlinear inverse problem with a Gaussian likelihood and a Gaussian prior. The basic idea is to perturb the observation vector $y$ by noise and solve a MAP-like optimization problem to obtain a sample. Specifically, if we consider the inverse problem in (1) with a Gaussian likelihood $\mathcal{N}(\mathcal{A}(x),\sigma^{2}I)$ and a Gaussian prior $\mathcal{N}(x_{0},S)$ , the RTO method generates a sample from the posterior distribution by solving the perturbed optimization problem

x^{\dagger}\in\operatorname*{argmin}_{x}\,\left\{\frac{1}{2\sigma^{2}}\|% \mathcal{A}(x)-\hat{y}\|_{2}^{2}+g(x)\right\},\qquad g(x)=(x-x_{0})^{T}S^{-1}(% x-x_{0}),

(4)

where the perturbed data $\hat{y}$ is drawn from the Gaussian pdf $\mathcal{N}(y,\sigma^{2}I)$ . For a linear inverse problem, this procedure yields independent samples from the posterior and reduces to a technique for Gaussian sampling proposed in [27]. When dealing with a nonlinear forward operator $\mathcal{A}$ , the problem (4) transforms into a nonlinear least-squares problem, and RTO generates samples from an approximate posterior. However, samples from the exact posterior can be obtained by incorporating a Metropolis-Hastings step.

Several extensions of RTO have been proposed in the literature. For example, in [39], RTO was extended to Laplace priors by converting it to a standard Gaussian prior using a variable transformation. Moreover, the RTO method has been extended to linear inverse problems with implicit priors such as nonnegativity constraints [2, 3], polyhedral constraints [16], and more generally, implicit log-priors with a polyhedral hypograph [15]. In the latter case, the function $g$ in (4) is a convex piecewise linear function that can be expressed as

g(x)=\gamma\left(i_{\mathcal{C}}(x)+\max_{i\in\mathcal{I}}(c_{i}^{T}x+d_{i})% \right),

(5)

where $\mathcal{C}\subseteq\mathbb{R}^{d}$ is a polyhedral set, $\gamma>0$ is a constant, $\mathcal{I}$ is a finite index set, and $(c_{i},d_{i})\in\mathbb{R}^{d}\times\mathbb{R}$ for $i\in\mathcal{I}$ . This class of implicit priors includes nonnegativity, Besov priors, and $l_{1}$ -norm based priors such as the anisotropic total variation (TV). As shown in [16], RTO generates samples from a well-defined probability density when $g$ is of the form (5) and $\operatorname{rank}(A)=d$ . The resulting density assigns a positive probability to low-dimensional sets that correspond to the faces of the polyhedral epigraph of $g$ . In the end, RTO reads


$\displaystyle\mathbbm{z}$	$\displaystyle\sim\mathcal{N}(0,\sigma^{2}I)$	(6a)
$\displaystyle\mathbbm{x}$	$\displaystyle=\operatorname{prox}_{g}^{\Sigma^{-1}}(A^{\dagger}(y+\mathbbm{z})% )\ ,$	(6b)

where $\Sigma^{-1}=A^{T}A/\sigma^{2}$ , $A^{\dagger}=(A^{T}A)^{-1}A^{T}$ and $\operatorname{prox}_{g}^{\Sigma^{-1}}$ is the proximal operator with respect to the norm induced by $\Sigma^{-1}$ defined for all $x\in\mathbb{R}^{d}$ by $\operatorname{prox}_{g}^{\Sigma^{-1}}(x)=\operatorname{argmin}_{u}\{g(u)+\frac% {1}{2}\|x-u\|_{\Sigma^{-1}}^{2}\}$ . In the case where $\operatorname{rank}(A)<d$ , additional regularization may be necessary to guarantee the positive definiteness of $A^{T}A$ and the existence of a well-defined posterior density. For example, this can be achieved by adding a quadratic term of the form $\frac{\alpha}{2}\|x\|_{2}^{2}$ to the optimization problem in (4) [16].

Although RTO is conceptually simple and can be used to sample from a wide range of densities, we recall that it does not define a posterior distribution in a rigorous manner as its associated implicit prior is observation-dependent and the Bayes’ rule does not hold. RTO is rather anchored in the sensitivity analysis framework, as it aims at quantifying the uncertainty resulting from perturbations occurring in the data-space. Eventually, its underlying distribution describes possible solutions given an observation. From a more practical point of view, we emphasize that the cost of solving an optimization problem for each sample can be prohibitive for high-dimensional problems. However, in practice it is not necessary to solve the optimization problem to high accuracy to obtain useful samples.

2.2 Langevin methods

Langevin sampling methods arise from the Langevin stochastic differential equation (SDE), which reads

\textrm{d}x_{t}=\nabla\log\pi_{\mathbbm{x}|\mathbbm{y}=y}(x_{t})\ \mathrm{d}t+% \sqrt{2}\ \textrm{d}b_{t}\ ,

(7)

where $(b_{t})_{t\geq 0}$ denotes a $d$ -dimensional Brownian motion, and the posterior density $\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)$ is the target density. Under mild assumptions on $\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)$ , for any initial condition $x_{0}$ , (7) admits a unique strong solution $(x_{t})_{t\geq 0}$ with the target density as the unique stationary density [34]. However, (7) in general cannot be solved analytically, and we have to discretize it in order to solve it numerically.

The unadjusted Langevin algorithm (ULA) is obtained by discretizing the Langevin SDE (7) using the Euler–Maruyama scheme. This yields the homogeneous Markov chain

\mathbbm{x}_{k+1}=x_{k}+\delta\nabla\log\pi_{\mathbbm{x}|\mathbbm{y}=y}(x_{k})% +\sqrt{2\delta}\ \mathbbm{z}_{k+1},

(8)

where $(\mathbbm{z}_{k+1})_{k\in\mathbb{N}}$ is a sequence of independent normal random variables, and $\delta>0$ is the discretization step-size that controls the trade-off between accuracy and convergence-speed. Under mild assumptions on the target density, this homogeneous Markov chain admits a unique stationary density $\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\delta}(x)$ [34], for which non-asymptotic bounds on the distance between the target density $\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)$ and the stationary density $\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\delta}(x)$ is provided in [13]. By combining ULA with a Metropolis–Hasting step, one obtains the Metropolis-adjusted Langevin algorithm (MALA) [34], which produces samples from the target posterior. ULA requires the target log-density to be differentiable and is a popular method as it is straightforward to implement and scales well with the dimension $d$ [13]. It has been extended to nondifferentiable log-concave target densities by considering a surrogate density as proposed in [28, 14]. Specifically, if the target density can be expressed as $\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)\propto\exp(-f(x,y)-g(x))$ with nondifferentiable $g(x)$ and with $f(\cdot,y)$ and $g$ both convex, the Moreau–Yoshida ULA (MYULA) is obtained by using the Moreau–Yoshida envelope ${g}_{\alpha}$ of $g$ [5] to construct a surrogate target density $\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\alpha}(x)\propto\pi_{\mathbbm{y}|\mathbbm{x}% =x}(y)\ \pi_{\mathbbm{x}}^{\alpha}(x)$ with $\pi_{\mathbbm{x}}^{\alpha}(x)\propto\exp(-g_{\alpha}(x))$ . The function $g_{\alpha}$ is continuously differentiable with $1/\alpha$ -Lipschitz gradient and such that $\nabla g_{\alpha}(x)=(x-\operatorname{prox}_{g}^{\alpha}(x))/\alpha$ , where $\operatorname{prox}_{g}^{\alpha}=\operatorname{prox}_{g}^{\alpha^{-1}I}$ . The resulting homogeneous Markov chain reads


$\displaystyle\mathbbm{x}_{k+1}$	$\displaystyle=x_{k}+\delta\nabla\log\pi_{\mathbbm{x}\|\mathbbm{y}=y}^{\alpha}(x% _{k})+\sqrt{2\delta}\ \mathbbm{z}_{k+1}$	(9a)
	$\displaystyle=x_{k}-\delta[\nabla f(x_{k},y)+\nabla g_{\alpha}(x_{k})]+\sqrt{2% \delta}\ \mathbbm{z}_{k+1}$	(9b)
	$\displaystyle=\left(1-\frac{\delta}{\alpha}\right)x_{k}-\delta\nabla f(x_{k},y% )+\frac{\delta}{\alpha}\operatorname{prox}_{g}^{\alpha}(x_{k})+\sqrt{2\delta}% \ \mathbbm{z}_{k+1},$	(9c)

where $\operatorname{prox}_{g}^{\alpha}$ denotes the proximal operator associated with $g$ and parameter $\alpha>0$ [5]. In other words, the transition Markov kernel is Gaussian and reads


$\displaystyle\tilde{x}_{k+1}(x_{k})$	$\displaystyle=\left(1-\frac{\delta}{\alpha}\right)x_{k}-\delta\nabla f(x_{k},y% )+\frac{\delta}{\alpha}\operatorname{prox}_{g}^{\alpha}(x_{k}),$	(10a)
$\displaystyle(\mathbbm{x}_{k+1}\|\mathbbm{x}_{k}=x_{k})$	$\displaystyle\sim\mathcal{N}(\tilde{x}_{k+1}(x_{k}),2\delta I).$	(10b)

Non-asymptotic bounds on the distance between the target density $\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)$ and the stationary density associated with MYULA $\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\alpha,\delta}(x)$ are provided in [14].

MYULA is conceptually simple. The evaluation of the proximal operator $\operatorname{prox}_{g}^{\alpha}$ requires the solution of a convex optimization problem, and hence the cost of a sample can be high if no closed-form expression is available. However, as shown empirically in [14], evaluating the proximal operator inaccurately can still produce good results.

2.3 Conceptual differences between RTO and MYULA

We now compare RTO and Langevin methods from a theoretical point of view while paying special attention to models with non-differentiable priors. Specifically, we will focus on the RTO method proposed in [15, 16] and MYULA [28, 14]. Table 1 lists the main conceptual differences, which we will now explain in more details. To simplify this comparison, we focus on a Gaussian likelihood as the behavior of RTO can be characterized in this setting.

	RTO	MYULA
Log-prior	polyhedral hypograph	concave
Target density	implicit form	explicit form
Random perturbation	in data space	in image space
Samples	independent	correlated
Burn-in	no	required
Sample generation cost	high: computing (6b)	low: computing (10a)
Parameter selection	online	offline
	hierarchical approach	empirical approach
Parallelization	yes	no

Table 1: Comparisons between RTO and MYULA from a theoretical point of view for a Gaussian likelihood.

Firstly, it is important to note that the two methods are based on different assumptions. RTO’s analysis relies on the assumption that the likelihood is Gaussian and that the log-prior has a polyhedral hypograph. In contrast, MYULA is compatible with more general posteriors that are formed from a continuously differentiable log-concave likelihood with a Lipschitz gradient and a log-concave prior. Because of the discretization of the Langevin SDE defined in (7) and the use of a surrogate log-prior density, MYULA does not actually produce samples from the target posterior but from a surrogate posterior density. The distance between the surrogate density and the target density can be quantified in a non-asymptotic sense. However, the samples generated by RTO are from a pdf in an implicit form, which is also different from the target posterior.

Both RTO and MYULA involve a stochastic perturbation to produce a sample, but at different stages. RTO applies a perturbation in the data space before solving an instance of the minimization problem, as shown in (6). The operation in (6b) in fact corresponds to the generalization of an oblique projection and tends to concentrate the probability mass in some areas of the space. It partially explains why RTO can assign a strictly positive probability mass to low-dimensional subsets and why RTO samples can be sparse. In MYULA, the perturbation is applied in the image space after performing an explicit gradient descent step with respect to the surrogate posterior, see (10). That is why Langevin-based methods can be interpreted as a perturbed gradient descent [42]. In MYULA, due to the use of the surrogate model to overcome non-differentiable prior we also need perform a proximal operation but only on the prior, see (10a). The computation of this proximal operation is often much cheaper than the one in RTO, since the forward operator $A$ is not involved. The accuracy of computing $\operatorname{prox}_{g}^{\alpha}$ has an impact on the quality of samples, which is investigated in Section 3.1. Because the density sampled by MYULA is continuous on the whole image space, MYULA is then unable to enforce constraints.

Comparing the computational costs, we need to solve an optimization problem, i.e., computing (6b), in order to generate an RTO sample. However, with MYULA we obtain one sample at each iteration that involves the computation of a proximal operator, i.e., computing (10a). In RTO, as the perturbations are independent, samples are independent. Furthermore, we can parallelize RTO to generate several samples at the same time. MYULA, as all Markov chain Monte Carlo (MCMC) methods, require a burn-in period in order for the Markov chain to enter its stationary regime and actually produce samples from its stationary density. The length of the burn-in period is unknown a-priori. In addition, samples generated from MYULA as well as other Langevin methods suffer from correlations as they are generated by using the previous sample. Further, this type of methods cannot be parallelized, contrary to RTO.

Finally, we will compare the flexibility on the extension to model parameter selection. Often the posteriors include some unknown parameters, e.g. $\sigma$ in (4), $\gamma$ in (5) and $\alpha$ in (9a). The choice of those parameters can have great impact on the posterior. RTO allows us to perform automatic parameter selection by considering a hierarchical model, i.e. we add prior distributions on the hyperparameters $\lambda=1/\sigma^{2}$ and $\gamma$ and consider the posterior

\pi_{\mathbbm{x},\bblambda,\bbgamma|\mathbbm{y}=y}(x,\lambda,\gamma)=\pi_{% \mathbbm{x}|\mathbbm{y}=y,\bblambda=\lambda,\bbgamma=\gamma}(x)\pi_{\bblambda}% (\lambda)\pi_{\bbgamma}(\gamma).

(11)

To utilize the conjugacy with Gaussian distributions, we impose $\Gamma$ -distributions for both $\bblambda$ and $\bbgamma$ . The negative-logarithm of the conditional distribution $\pi_{\mathbbm{x}|\mathbbm{y}=y,\bblambda=\lambda,\bbgamma=\gamma}$ is proportional to

\lambda\|\mathcal{A}(x)-\hat{y}\|_{2}^{2}+g(x)

(12)

with $g(x)$ given in (5).

Algorithm 1 summarizes the procedure to sample from such a hierarchical model within the RTO framework.

Algorithm 1 RTO Hierarchical Gibbs sampler for

(\mathbbm{x},\bblambda,\bbgamma)

N\in\mathbb{N}^{*},\ x_{0}\in\mathbb{R}^{d}.

(x_{k})_{k\in\{1,...,N\}}

for

k=1\mathrm{\ to\ }N

Step 1: Sample

\lambda_{k}\sim\pi_{\bblambda|\mathbbm{x}=x_{k-1},\mathbbm{y}=y}(\lambda)

and

\gamma_{k}\sim\pi_{\bbgamma|\mathbbm{x}=x_{k-1},\mathbbm{y}=y}(\gamma)

, and both follow

\Gamma

-distributions.

Step 2: Sample

x_{k}\sim\pi_{\mathbbm{x}|\bblambda=\lambda_{k},\bbgamma=\gamma_{k},\mathbbm{y% }=y}(x)

by RTO.

end for

Selecting posterior parameters in the Langevin-based methods is much more difficult. As far as we know, no Langevin-based methods can update or simultaneously estimate the model parameters. Recently, Vidal et al. proposed to perform parameter selection based on maximum likelihood estimation [37, 12], which can simultaneously estimate multiple model parameters and is theoretically well-founded. But the procedure has to be implemented offline, i.e., the parameters have to be predetermined before sampling.

3 Experimental study

In this section, we aim at numerically investigating the differences between RTO and MYULA methods for sampling a high-dimensional posterior model. To do so, we tackle two kinds of imaging inverse problems described by (1): deblurring and inpainting. In both cases, the forward operator $A$ is known and linear.

Figure 2 shows both 256-by-256 test images with an intensity range $[0,1]$ that were used in this section. These images have been selected because of their differences in content and level of details. For example, Traffic presents a mix of piece-wise constant and more textured areas. We consider the prior of the form $p(x)\propto\exp(-g(x))$ with

g(x)=\gamma\|\nabla x\|_{1,1}+i_{\mathcal{C}}(x)

(13)

and $\mathcal{C}=[0,1]^{d}$ . It is a non-differentiable log-concave prior which promotes images with sparse gradients. We compare the results obtained by RTO with the results of MYULA. Note that both RTO and MYULA sample their target posterior densities approximately, and their corresponding approximations are different.

We generate 1000 RTO samples. To solve the minimization problem (4) with $g(x)$ defined in (13) in RTO, we apply the alternating direction method of multipliers (ADMM) with the stop** criterion suggested in [7, Eq (3.12)]. According to the theories in [15, 16], to guarantee a well-defined posterior density we need to add a term $\frac{\alpha}{2}\|x\|_{2}^{2}$ for the rank-deficient $A$ , which is the case here, and we set $\alpha=10^{-8}$ to ensure little impact on the results.

According to $g(x)$ defined in (13), MYULA targets the surrogate posterior density $\pi_{\alpha_{1},\alpha_{2}}(x|y)$ :

\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\alpha_{1},\alpha_{2}}(x)\propto\exp\left(-% \frac{1}{2\sigma^{2}}\|Ax-y\|_{2}^{2}-[\gamma\|\nabla.\|_{1,1}]_{\alpha_{1}}(x% )-[i_{c}]_{\alpha_{2}}(x)\right)\ ,

(14)

where $[\gamma\|\nabla.\|_{1,1}]_{\alpha_{1}}$ and $[i_{c}]_{\alpha_{2}}$ denote the Moreau-Yoshida envelopes of the two terms in $g$ , respectively. To apply MYULA defined in (9a), we need $\operatorname{prox}_{\gamma\|\nabla.\|_{1,1}}^{\alpha_{1}}$ . However, since $\operatorname{prox}_{\gamma\|\nabla.\|_{1,1}}^{\alpha_{1}}$ has no closed-form expression, we apply the first-order primal-dual algorithm introduced in [10] to estimate it. In the following, we set the number of iterations to estimate this proximal operator to $n_{pd}=50$ . We initialize MYULA at the data $y$ and run $10^{6}$ iterations with the first $2.5\times 10^{4}$ iterations as burn-in. In addition, we set the thinning parameter to $250$ , i.e., we store a sample every $250$ iterations. It ends up with $3900$ samples in total. As suggested in [14], we set $\alpha_{1}=\alpha_{2}=1/(\|A^{T}A\|/\sigma^{2})$ and $\delta=1/(2/\alpha_{1}+2/\alpha_{2}+2\|A^{T}A\|/\sigma^{2})$ .

To quantitatively evaluate the reconstruction quality, we use the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM) [41, 40].

3.1 Deblurring

In this section, images are degraded by a $9\times 9$ uniform blurring kernel and additive white Gaussian noise with zero mean and variance $\sigma^{2}=0.001$ . We assume the periodic boundary condition, which turns out that $A$ can be diagonalized by the Fourier transform. In Figure 3 we show the degraded images.

For the stop** criteria in ADMM for RTO, we set the tolerance of the primal and dual residuals to $\texttt{tol}=10^{-4}$ and the maximum iteration number to $\texttt{maxiter}=2000$ . In addition, we set the regularization parameter $\gamma$ in $g(x)$ to $\gamma=5$ . For MYULA, we use $\gamma=10$ .

Remark.

As explained in Section 2, regularization parameters can have significant impacts on the results. For the regularization parameter $\gamma$ in RTO, we use the parameter achieving the largest PSNR when solving the unperturbed optimization problem. It generally delivers good results in the RTO framework. It means that the maximum-a-posteriori (MAP) point estimate can be a good reference for selecting $\gamma$ and highlights the connection between optimization and sampling within the RTO framework. Finding a good $\gamma$ for MYULA is more complicated, since we found that the regularization parameter generating a MAP with high PSNR and SSIM scores for the unperturbed optimization problem often leads to too noisy samples in MYULA. Consequently, we had to run several Markov chains with different regularization parameters $\gamma$ in order to find a good value, which is much more time-consuming compared to RTO.

Results and discussions: In Figure 4 we show the minimum mean square error (MMSE) estimates of RTO and MYULA together with the MAP estimates for comparison. We can see that MMSE of RTO achieves the highest PSNR and SSIM scores. Visually it resembles the MAP estimate with the stair-casing artifacts due to the use of TV regularization. However, we do not observe the stair-casing artifacts for the MMSE estimate computed by MYULA, but a grid pattern ruins the restoration.

	RTO	MYULA	MAP
Simpson .
	PSNR=25.69/SSIM=0.79	PSNR=24.97/SSIM=0.72	PSNR=25.17/SSIM=0.77
Traffic .
	PSNR=24.00/SSIM=0.66	PSNR=23.18/SSIM=0.63	PSNR=23.69/SSIM=0.65

Figure 5 shows the standard deviation maps of RTO and MYULA, respectively. In the results of both methods we observe higher uncertainties around edges, which is coherent as high frequency information is more challenging to restore. However, the standard deviation of MYULA exhibits a more spread out uncertainty: edges are less uncertain than the ones of RTO whereas the constant regions have larger standard deviations than those of RTO. These differences relate to where the perturbation is performed. In RTO we perturb the data before computing the proximal operator, whereas in MYULA the perturbation takes place after taking a gradient step in the image space.

RTO .
MYULA .

To check the correlation among the samples, we compute the sample auto-correlation functions (ACFs), which measure how fast samples become uncorrelated. A fast decay of ACFs indicates no linear dependency among the samples and is also interpreted as a short mixing time for Markov chains. It comes with accurate Monte-Carlo estimates. As images are high-dimensional objects, directly estimating the $d$ -dimensional ACFs is not realistically doable. However, the convergence speeds can be inferred from the posterior covariance matrix [29]. We assume that the posterior covariance shape is mostly determined by the likelihood. Therefore, we approximate the posterior covariance using the directions provided by the diagonalization basis of the forward operator $A$ , i.e., the Fourier basis, for the deblurring experiments as in [24]. Figure 6 shows the evolution of the estimated ACFs both of RTO and the complete Markov chain of MYULA. ACFs of RTO samples immediately drop to zero, which shows that RTO samples are independent, while samples generated by MYULA only become uncorrelated after $500$ iterations. It means that in order to get $1000$ uncorrelated samples, we need to run MYULA for $5\times 10^{5}$ iterations. Note that the number of iterations before MYULA samples get uncorrelated actually depends on the inverse problem we deal with, and cannot evaluate a-priori but only after running the Markov chain. Furthermore, MYULA requires a burn-in phase that corresponds to the time required by the chain to enter in its stationary regime, i.e., when samples generated by the Markov chain are sampled from $\pi_{\delta,\alpha_{1},\alpha_{2}}(x|y)$ . The number of iterations in burn-in also cannot assess a-priori but only after running the chain.

In Figure 7 we show a few samples generated by RTO and MYULA, respectively. It is clear that MYULA samples are much more noisy than the ones of RTO. The reason is that RTO samples are in fact MAP estimates with a perturbed observation. On the other hand, MYULA samples are generated by a gradient descent step with added noise. In addition, we can see that MYULA samples do not belong to $\mathcal{C}$ , contrarily to RTO samples.

RTO .
PSNR=24.76/SSIM=0.74	PSNR=24.72/SSIM=0.74	PSNR=24.72/SSIM=0.74
MYULA .
PSNR=21.17/SSIM=0.37	PSNR=21.18/SSIM=0.37	PSNR=21.18/SSIM=0.37

Figure 8 shows relative errors of the samples generated by RTO and MYULA to their corresponding MMSEs. It is interesting to note that the relative error is higher in MYULA’s case. It means that the RTO posterior is more concentrated around its mean than MYULA’s. In RTO the perturbation takes place in the data space, and then is smoothed due to the use of the regularization.

For the computational time, it took around $6$ hours to generate 1000 RTO samples and around $35$ hours to generate 3900 MYULA samples after burn-in and thinning on an Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz. To reduce the computational cost, we can decrease the accuracy when solving the optimization problems in the sampling methods, i.e., the computation of $\operatorname{prox}_{\gamma\|\nabla.\|_{1,1}}^{\alpha_{1}}$ in MYULA and MAP on the perturbed data in RTO. In Figure 9 we show the results produced by MYULA when setting $n_{pd}=10$ for Simpson. In this case, the running time is reduced to around $10$ hours. Comparing with the results in Figure 4 and Figure 5 where we set $n_{pd}=50$ , we can see that MMSE is more noisy and the uncertainty is not well discovered. In addition, the ACFs figure shows that we need many more iterations in order to obtain uncorrelated samples. In RTO we change tol in ADMM from $10^{-4}$ to $10^{-2}$ and show the results in Figure 10. It is clear that MMSE is nearly identical to the one in Figure 4. However, the standard deviation close to the edges are slightly smaller than the ones in Figure 5. It means that reasonably reducing the solution accuracy has a minor influence on the MMSE estimate, but can lead to insufficiently explored uncertainty.

3.2 Inpainting

In this section, we study the inpainting inverse problem, where the forward operator $A$ indicates the locations of the known pixel values. Figure 11 shows the observations $y$ , where the black regions mark the information of pixels is lost. In this test, $58549$ of $65536$ pixels are observed, i.e., $A\in\mathbb{R}^{58549\times 65536}$ . In addition, $y$ is also corrupted by additive white Gaussian noise with zero mean and standard deviation $\sigma=0.02$ . For both methods we set the regularization parameter $\gamma=8$ . To ensure good results in a reasonable amount of time, we set the stop** criteria for ADMM in RTO as that the primal residual is smaller than $\texttt{tol}=2\times 10^{-3}$ or the number of iterations is larger than $500$ .

Results and discussions: Figure 12 shows MMSEs obtained by RTO and MYULA as well as MAPs. MMSEs of both methods are very similar to the corresponding MAP estimates, and the mask regions are well filled except the regions with rich textures, see the tree leafs in Traffic, where the inpainting task is the most difficult. This good visual impression is confirmed by the high PSNR and SSIM scores. However, both methods tend to struggle to correctly fill the gaps as we can see on the contours of the dress in Simpson. In addition, MMSEs of MYULA are slightly more noisy, and MYULA seems to have more difficulties to fill the regions with rich textures, see the tree leafs in Traffic.

The standard deviation maps associated with RTO and MYULA for Simpson and Traffic are displayed in Figure 13. We can see that the standard deviations of RTO show larger uncertainties around edges, especially at the edges where the information of pixels is lost. It is also interesting to note that the hidden constant areas are restored with high confidence for RTO. It means that the solutions to the perturbed optimization problems are very similar in these regions and that the perturbations do not affect the restorations in these areas. MYULA behaves very differently from RTO. The high uncertainties appear at the whole regions where the information is lost. Further, in these regions the standard deviations from MYULA are nearly double of the RTO values.

Finally, we consider the efficiency of both methods. The RTO experiments took around $10$ hours while for MYULA they took around $35$ hours.

In Figure 14 we show the evolution of the ACFs for both methods. We computed them in the pixel domain, where the forward operator $A$ is diagonal, as we assume that the likelihood imposes the posterior covariance shape. It is obvious that samples generated by RTO are independent, but samples generated with MYULA exhibit correlations. For Traffic, the complete Markov chain associated with MYULA needs more than $6000$ iterations in order to eventually achieve uncorrelated samples.

3.3 Automatic parameter selection

In Sections 3.1 and 3.2 we choose the regularization parameter $\gamma$ such that it gives the best results in terms of PSNR and SSIM, which requires many more experiments in order to determine $\gamma$ . Furthermore, we assume that we have access to the noise level $\sigma$ that is not always known. Being able to automatically select the parameters in a robust fashion is a matter of prime importance. As explained in Section 2, Langevin methods like MYULA do not allow us to perform online parameter selection, but RTO can be plugged into an augmented hierarchical model in order to automatically select $\sigma^{2}$ and $\gamma$ without any additional computational cost [15, 16], see Algorithm 1. It turns RTO into a nearly parameter-free method. To ensure efficient sampling for the conditional distributions with respect to $\sigma^{2}$ and $\gamma$ , we are limited to $\Gamma$ -priors for $\lambda=1/\sigma^{2}$ and $\gamma$ , and only for some specific choices of $g$ . For more detailed discussions, we refer to [16].

In Figure 15 we show the results from the same deblurring problem as in Section 3.1 but with slightly different $g(x)$ comparing with (13). Here we use the non-negativity constraint instead of the box constraint, i.e., $g(x)=\gamma\|\nabla x\|_{1,1}+i_{\mathbb{R}^{+}}(x)$ , since in this case the conditional distribution $\pi_{\bbgamma|\mathbbm{x},\mathbbm{y}}(\gamma)$ can be efficiently sampled. Comparing the results shown in Figure 15 and in Figures 4 and 5, we can see that they are very similar visually as well as quantitatively. In Table 2, we list the means and the standard deviations of $\lambda$ and $\gamma$ obtained from the hierarchical Gibbs sampler. It is clear that the estimates for both $\gamma$ and $\lambda$ are comparable with the ones used in Section 3.1. In the end, Figure 16 shows the ACFs of the image samples and trace plots of $\lambda$ and $\gamma$ . Due to the Gibbs sampler, we cannot expect independent samples, but we notice that ACFs still decay to zero immediately. Further, the trace plots show very good mixing in samples.

	Obtained from hierarchical Gibbs sampler	Used in RTO in Section 3.1
	mean (standard deviation)
$\gamma$	3.97 (0.44)	5
$\lambda$	1015.49 (7.19)	1000

Table 2: (Hierarchical Gibbs sampler) Empirical mean and standard deviation of the noise level and regularization parameters

\lambda

and

\gamma

4 Conclusion

We compared two classes of sampling methods for solve inverse problems in imaging, RTO and the Langevin method MYULA, and highlighted their main conceptual and theoretical differences.

RTO is derived from the sensitivity analysis framework and samples from a target distribution by solving perturbed optimization problems where the perturbation occurs in the data space. The RTO target density can have non-zero probability mass on subsets of measure zero. However, it is not anchored in the Bayesian framework as the target density corresponds to an implicit prior that depends on the observed data. In addition, RTO can be incorporated into a hierarchical model in order to perform automatic parameter selection. The main limitation of RTO is that it has only been characterized for the posteriors with Gaussian likelihoods and polyhedral hypograph log-prior. In contrast, MYULA is firmly rooted in the Bayesian framework and is applicable to a broader range of posteriors. Although it samples the posterior density approximately, the distribution behind samples can be characterized with respect to the posterior density. Similar as other Langevin methods, MYULA suffers from typical MCMC drawbacks, such as slow convergence, correlated samples, etc. Through two classical imaging inverse problems: deblurring and inpainting, we compared RTO and MYULA numerically with particular attention to computational cost. Both methods produced accurate results for deblurring, but MYULA struggled with severely ill-posed problems like inpainting. Additionally, while RTO concentrates the sample mass around the MMSE estimate, MYULA results in a more dispersed distribution.

One future research direction is to extend RTO to more general posteriors. It would be valuable to explore other noise models, such as Poisson noise as in [3], and investigate how we can characterize the RTO distribution both theoretically and practically. In addition, motivated by the work in [24], it would be also interesting to extend RTO to data-driven regularization [31, 22, 26], particularly to generative models [33, 23].

References

Aguerrebere et al. [2017] Cecilia Aguerrebere, Andres Almansa, Julie Delon, Yann Gousseau, and Pablo Muse. A Bayesian Hyperprior Approach for Joint Image Denoising and Interpolation, With an Application to HDR Imaging. IEEE Transactions on Computational Imaging, 3(4):633–646, dec 2017. ISSN 2333-9403. doi:10.1109/TCI.2017.2704439. URL https://nounsse.github.io/HBE_project/.
Bardsley and Fox [2012] J. M. Bardsley and C. Fox. An MCMC method for uncertainty quantification in nonnegativity constrained inverse problems. Inverse Problems in Science and Engineering, 20(4):477–498, June 2012. ISSN 1741-5985. doi:10.1080/17415977.2011.637208. URL http://dx.doi.org/10.1080/17415977.2011.637208.
Bardsley and Hansen [2020] Johnathan M Bardsley and Per Christian Hansen. MCMC algorithms for computational UQ of nonnegativity constrained linear inverse problems. SIAM Journal on Scientific Computing, 42(2):A1269–A1288, 2020.
Bardsley et al. [2014] Johnathan M Bardsley, Antti Solonen, Heikki Haario, and Marko Laine. Randomize-then-optimize: A method for sampling from posterior distributions in nonlinear inverse problems. SIAM Journal on Scientific Computing, 36(4):A1895–A1910, 2014.
Bauschke et al. [2017] Heinz H Bauschke, Patrick L Combettes, Heinz H Bauschke, and Patrick L Combettes. Correction to: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2017.
Blake et al. [2011] A. Blake, P. Kohli, and C. Rother. Markov Random Fields for vision and image processing. EBSCO ebook academic collection. MIT Press, 2011. ISBN 9780262015776. doi:10.7551/mitpress/8579.003.0001.
Boyd et al. [2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
Cai et al. [2023] Ziruo Cai, Junqi Tang, Subhadip Mukherjee, **glai Li, Carola Bibiane Schönlieb, and Xiaoqun Zhang. NF-ULA: Langevin Monte carlo with normalizing flow prior for imaging inverse problems, 2023.
Chambolle [2004] A Chambolle. An algorithm for Total Variation Minimization and Applications. Journal of Mathematical Imaging and Vision, 20:89–97, 2004. doi:10.1023/B:JMIV.0000011325.36760.1e.
Chambolle and Pock [2011] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40:120–145, 2011.
Chen et al. [2014] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 1683–1691. PMLR, 22–24 Jun 2014. URL https://proceedings.mlr.press/v32/cheni14.html.
De Bortoli et al. [2020] Valentin De Bortoli, Alain Durmus, Marcelo Pereyra, and Ana Fernandez Vidal. Maximum likelihood estimation of regularization parameters in high-dimensional inverse problems: An empirical Bayesian approach. part ii: Theoretical analysis. SIAM Journal on Imaging Sciences, 13(4):1990–2028, 2020.
Durmus and Moulines [2017] Alain Durmus and Éric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Annals of Applied Probability, 27(3):1551–1587, 2017.
Durmus et al. [2018] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by proximal Markov chain Monte Carlo: When Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018. doi:10.1137/16M1108340.
Everink et al. [2023a] Jasper M Everink, Yiqiu Dong, and Martin S Andersen. Sparse Bayesian inference with regularized Gaussian distributions. Inverse Problems, 39(11):115004, oct 2023a. doi:10.1088/1361-6420/acf9c5. URL https://dx.doi.org/10.1088/1361-6420/acf9c5.
Everink et al. [2023b] Jasper M Everink, Yiqiu Dong, and Martin S Andersen. Bayesian inference with projected densities. SIAM/ASA Journal on Uncertainty Quantification, 11(3):1025–1043, 2023b.
Girolami and Calderhead [2011] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011. doi:10.1111/j.1467-9868.2010.00765.x.
Hansen et al. [2021] Per Christian Hansen, Jakob Jørgensen, and William RB Lionheart. Computed tomography: algorithms, insight, and just enough theory. SIAM, 2021.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Holden et al. [2022] Matthew Holden, Marcelo Pereyra, and Konstantinos C Zygalakis. Bayesian imaging with data-driven priors encoded by neural networks. SIAM Journal on Imaging Sciences, 15(2):892–924, 2022.
Houdard et al. [2018] Antoine Houdard, Charles Bouveyron, and Julie Delon. High-Dimensional Mixture Models For Unsupervised Image Denoising (HDMI). SIAM Journal on Imaging Sciences, 11(4):2815–2846, 2018. doi:10.1137/17M1135694.
Hurault et al. [2022] Samuel Hurault, Arthur Leclaire, and Nicolas Papadakis. Proximal denoiser for convergent plug-and-play optimization with nonconvex regularization. In International Conference on Machine Learning, pages 9483–9505. PMLR, 2022.
Kingma et al. [2019] Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
Laumont et al. [2022] Rémi Laumont, Valentin De Bortoli, Andrés Almansa, Julie Delon, Alain Durmus, and Marcelo Pereyra. Bayesian imaging using plug & play priors: when Langevin meets Tweedie. SIAM Journal on Imaging Sciences, 15(2):701–737, 2022.
Louchet and Moisan [2013] Cécile Louchet and Lionel Moisan. Posterior expectation of the total variation model: Properties and experiments. SIAM Journal on Imaging Sciences, 6(4):2640–2684, dec 2013. ISSN 19364954. doi:10.1137/120902276.
Mukherjee et al. [2024] S Mukherjee, S Dittmer, Z Shumaylov, S Lunz, O Öktem, and C-B Schönlieb. Data-driven convex regularizers for inverse problems. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 13386–13390. IEEE, 2024.
Papandreou and Yuille [2010] G. Papandreou and A. L. Yuille. Gaussian sampling by local perturbations. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper_files/paper/2010/file/d09bf41544a3365a46c9077ebb5e35c3-Paper.pdf.
Pereyra [2016] Marcelo Pereyra. Proximal Markov chain Monte Carlo algorithms. Statistics and Computing, 26(4):745–760, jul 2016. ISSN 0960-3174. doi:10.1007/s11222-015-9567-4.
Pereyra et al. [2020] Marcelo Pereyra, Luis Vargas Mieles, and Konstantinos C Zygalakis. Accelerating proximal Markov chain Monte Carlo by using an explicit stabilized method. SIAM Journal on Imaging Sciences, 13(2):905–935, 2020.
Pereyra et al. [2023] Marcelo Pereyra, Luis A Vargas-Mieles, and Konstantinos C Zygalakis. The split Gibbs sampler revisited: improvements to its algorithmic structure and augmented target distribution. SIAM Journal on Imaging Sciences, 16(4):2040–2071, 2023.
Pesquet et al. [2021] Jean-Christophe Pesquet, Audrey Repetti, Matthieu Terris, and Yves Wiaux. Learning maximally monotone operators for image recovery. SIAM Journal on Imaging Sciences, 14(3):1206–1237, 2021.
Repetti et al. [2019] Audrey Repetti, Marcelo Pereyra, and Yves Wiaux. Scalable Bayesian Uncertainty Quantification in Imaging Inverse Problems via Convex Optimization. SIAM Journal on Imaging Sciences, 12(1):87–118, 2019. ISSN 1936-4954. doi:10.1137/18M1173629.
Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
Roberts and Tweedie [1996] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
Rudin et al. [1992] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992. ISSN 01672789. doi:10.1016/0167-2789(92)90242-F.
Teodoro et al. [2018] Afonso M. Teodoro, José M. Bioucas-Dias, and Mário A. T. Figueiredo. Scene-Adapted Plug-and-Play Algorithm with Guaranteed Convergence: Applications to Data Fusion in Imaging. pages 1–11, jan 2018.
Vidal et al. [2020] Ana Fernandez Vidal, Valentin De Bortoli, Marcelo Pereyra, and Alain Durmus. Maximum likelihood estimation of regularization parameters in high-dimensional inverse problems: An empirical Bayesian approach part i: Methodology and experiments. SIAM Journal on Imaging Sciences, 13(4):1945–1989, 2020.
Vono et al. [2019] Maxime Vono, Nicolas Dobigeon, and Pierre Chainais. Split-and-augmented Gibbs sampler—application to large-scale inference problems. IEEE Transactions on Signal Processing, 67(6):1648–1661, 2019.
Wang et al. [2017] Zheng Wang, Johnathan M Bardsley, Antti Solonen, Tiangang Cui, and Youssef M Marzouk. Bayesian inverse problems with $l_{1}$ priors: a randomize-then-optimize approach. SIAM Journal on Scientific Computing, 39(5):S140–S166, 2017.
Wang and Bovik [2009] Zhou Wang and Alan C Bovik. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE signal processing magazine, 26(1):98–117, 2009.
Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
Welling and Teh [2011] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
Yu et al. [2011] Guoshen Yu, Guillermo Sapiro, and Stéphane Mallat. Solving Inverse Problems with Piecewise Linear Estimators: From Gaussian Mixture Models to Structured Sparsity. IEEE Transactions on Image Processing, 21(5):2481–2499, 2011. doi:10.1109/TIP.2011.2176743.
Zoran and Weiss [2011] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pages 479–486. IEEE, nov 2011. ISBN 978-1-4577-1102-2. doi:10.1109/ICCV.2011.6126278. URL http://people.csail.mit.edu/danielzoran/EPLLICCVCameraReady.pdf.

	RTO	MYULA	MAP
Simpson .
	PSNR=35.01/SSIM=0.94	PSNR=33.86/SSIM=0.89	PSNR=34.99/SSIM=0.95
Traffic .
	PSNR=30.77/SSIM=0.92	PSNR=30.43/SSIM=0.89	PSNR=30.63/SSIM=0.92


PSNR=21.75/SSIM=0.47	PSNR=19.90/SSIM=0.37


MMSE (PSNR=25.71/SSIM=0.79)	Standard deviation


PSNR=16.45/SSIM=0.68	PSNR=15.89/SSIM=0.74

MMSE	Standard deviation

PSNR=26.00/SSIM=0.79

Sampling Strategies in Bayesian Inversion: A Study of RTO and Langevin Methods ††thanks: This work was funded by a Villum Investigator grant (no. 25893) from the Villum Foundation.