Sampling Strategies in Bayesian Inversion: A Study of RTO and Langevin Methods thanks: This work was funded by a Villum Investigator grant (no. 25893) from the Villum Foundation.

[Uncaptioned image] Rémi Laumont
Department of Applied Mathematics and Computer Science
Technical University of Denmark
Kongens Lyngby, 2800
[email protected]
&[Uncaptioned image] Yiqiu Dong
Department of Applied Mathematics and Computer Science
Technical University of Denmark
Kongens Lyngby, 2800
[email protected]
&[Uncaptioned image] Martin Skovgaard Andersen
Department of Applied Mathematics and Computer Science
Technical University of Denmark
Kongens Lyngby, 2800
[email protected]
Corresponding author
Abstract

This paper studies two classes of sampling methods for the solution of inverse problems, namely Randomize-Then-Optimize (RTO), which is rooted in sensitivity analysis, and Langevin methods, which are rooted in the Bayesian framework. The two classes of methods correspond to different assumptions and yield samples from different target distributions. We highlight the main conceptual and theoretical differences between the two approaches and compare them from a practical point of view by tackling two classical inverse problems in imaging: deblurring and inpainting. We show that the choice of the sampling method has a significant impact on the quality of the reconstruction and that the RTO method is more robust to the choice of the parameters.

Keywords Inverse problems, sampling, RTO, Langevin methods, deblurring, inpainting, parameter selection

1 Introduction

A typical inverse problem in imaging is to retrieve an image xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT from a degraded observation ym𝑦superscript𝑚y\in\mathbb{R}^{m}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. A common observation model is the additive noise model, which can be formulated as

y=𝒜(x)+n,𝑦𝒜𝑥𝑛y=\mathcal{A}(x)+n,italic_y = caligraphic_A ( italic_x ) + italic_n , (1)

where 𝒜:dm:𝒜superscript𝑑superscript𝑚\mathcal{A}\colon\mathbb{R}^{d}\to\mathbb{R}^{m}caligraphic_A : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is a so-called forward operator that models the deterministic aspects of the observation process. The term n𝑛nitalic_n represents additive noise, and we will henceforth assume that n𝑛nitalic_n is zero-mean Gaussian white noise with covariance σ2Isuperscript𝜎2𝐼\sigma^{2}Iitalic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I. We will also assume that the forward operator 𝒜𝒜\mathcal{A}caligraphic_A is known and linear, which allows us to express the measurement model as y=Ax+n𝑦𝐴𝑥𝑛y=Ax+nitalic_y = italic_A italic_x + italic_n with Am×d𝐴superscript𝑚𝑑A\in\mathbb{R}^{m\times d}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT.

The problem of estimating x𝑥xitalic_x from the observation vector y𝑦yitalic_y is often ill-posed. Generally speaking, this means that the observation may not correspond to a unique reconstruction or that the reconstruction is very sensitive to perturbations of y𝑦yitalic_y. Regularization in the form of prior information on x𝑥xitalic_x is then necessary to obtain a well-posed problem [18]. In the variational framework, the reconstruction problem takes the form of an optimization problem,

xargminx{f(Ax,y)+g(x)},superscript𝑥subscriptargmin𝑥𝑓𝐴𝑥𝑦𝑔𝑥x^{*}\in\operatorname*{argmin}_{x}\,\{f(Ax,y)+g(x)\},italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT { italic_f ( italic_A italic_x , italic_y ) + italic_g ( italic_x ) } , (2)

where xsuperscript𝑥x^{*}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the reconstruction. The objective function consists of a data-fidelity term f(Ax,y)𝑓𝐴𝑥𝑦f(Ax,y)italic_f ( italic_A italic_x , italic_y ) and a regularization term g(x)𝑔𝑥g(x)italic_g ( italic_x ). The data-fidelity term measures the discrepancy between the observation y𝑦yitalic_y and the the model output Ax𝐴𝑥Axitalic_A italic_x whereas the regularization term g(x)𝑔𝑥g(x)italic_g ( italic_x ) penalizes images with undesirable properties, e.g., images outside a neighborhood of some sub-manifold. In many cases, the minimization problem in (2) can be solved efficiently using advanced optimization methods. However, a solution to the problem in (2) does not provide information about the inherent uncertainty that may arise because of noisy measurements, discretization, and/or model errors.

Information about uncertainty can be obtained by adopting a probabilistic approach, and the Bayesian framework is a natural choice for this purpose. Assuming that the true image x𝑥xitalic_x and the observation y𝑦yitalic_y are considered realizations of random variables, the posterior probability density function (pdf) of x𝑥xitalic_x given y𝑦yitalic_y can be expressed in terms of the pdf π𝕪|𝕩=x(y)subscript𝜋conditional𝕪𝕩𝑥𝑦\pi_{\mathbbm{y}|\mathbbm{x}=x}(y)italic_π start_POSTSUBSCRIPT blackboard_y | blackboard_x = italic_x end_POSTSUBSCRIPT ( italic_y ), which characterizes the observation model, and a prior pdf π𝕩(x)subscript𝜋𝕩𝑥\pi_{\mathbbm{x}}(x)italic_π start_POSTSUBSCRIPT blackboard_x end_POSTSUBSCRIPT ( italic_x ), which encodes any prior information we may have about x𝑥xitalic_x before observing y𝑦yitalic_y. Using Bayes’ formula, we can express the posterior density as

π𝕩|𝕪=y(x)=π𝕩(x)π𝕪|𝕩=x(y)π𝕪(y),π𝕪(y)=dπ𝕩(x)π𝕪|𝕩=x(y)dx.formulae-sequencesubscript𝜋conditional𝕩𝕪𝑦𝑥subscript𝜋𝕩𝑥subscript𝜋conditional𝕪𝕩𝑥𝑦subscript𝜋𝕪𝑦subscript𝜋𝕪𝑦subscriptsuperscript𝑑subscript𝜋𝕩𝑥subscript𝜋conditional𝕪𝕩𝑥𝑦differential-d𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)=\frac{\pi_{\mathbbm{x}}(x)\ \pi_{\mathbbm{y% }|\mathbbm{x}=x}(y)}{\pi_{\mathbbm{y}}(y)},\qquad\pi_{\mathbbm{y}}(y)=\int_{% \mathbb{R}^{d}}\pi_{\mathbbm{x}}(x)\ \pi_{\mathbbm{y}|\mathbbm{x}=x}(y)\,% \mathrm{d}x.italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG italic_π start_POSTSUBSCRIPT blackboard_x end_POSTSUBSCRIPT ( italic_x ) italic_π start_POSTSUBSCRIPT blackboard_y | blackboard_x = italic_x end_POSTSUBSCRIPT ( italic_y ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT blackboard_y end_POSTSUBSCRIPT ( italic_y ) end_ARG , italic_π start_POSTSUBSCRIPT blackboard_y end_POSTSUBSCRIPT ( italic_y ) = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT blackboard_x end_POSTSUBSCRIPT ( italic_x ) italic_π start_POSTSUBSCRIPT blackboard_y | blackboard_x = italic_x end_POSTSUBSCRIPT ( italic_y ) roman_d italic_x . (3)

From a Bayesian point of view, the posterior π𝕩|𝕪=y(x)subscript𝜋conditional𝕩𝕪𝑦𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ) is the solution to the inverse problem since it provides a complete characterization of the uncertainty. The posterior density can be used to compute point estimates, such as a maximum a posteriori (MAP) estimate or the posterior mean, and to quantify uncertainty in the form of credible intervals or second-order moments.

Although the Bayesian framework is well-established from a theoretical point of view, the posterior density is often intractable and cannot be computed in closed form. Thus, the posterior density is often explored using sampling methods. These generate a set of samples {x(i)}i=1Nsuperscriptsubscriptsuperscript𝑥𝑖𝑖1𝑁\{x^{(i)}\}_{i=1}^{N}{ italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT from the posterior density π𝕩|𝕪=y(x)subscript𝜋conditional𝕩𝕪𝑦𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ), allowing us to compute point estimates and quantify uncertainty using Monte Carlo approximations to high-dimensional integrals.

The purpose of this paper is to contrast and compare two classes of sampling methods for solving inverse problems within the Bayesian framework, namely Randomize-Then-Optimize (RTO) [4] and Langevin sampling methods [34, 13, 14]. On one hand, RTO is deeply rooted within the variational framework as it solves a perturbed optimization problem in order to generate one sample. On the other hand, Langevin methods stem from the discretization of a stochastic differential equation (SDE) whose solution is the target distribution. The two classes of methods are similar in many ways, but they are based on different principles and assumptions, and hence they lead to different densities. This is illustrated in Figure 1, which shows different solution densities obtained from an observation y=x+n𝑦𝑥𝑛y=x+nitalic_y = italic_x + italic_n where n𝑛nitalic_n is standard normal and x[a,b]𝑥𝑎𝑏x\in[a,b]italic_x ∈ [ italic_a , italic_b ] is an unknown parameter. Adopting a uniform prior on [a,b]𝑎𝑏[a,b][ italic_a , italic_b ] leads to a truncated Gaussian posterior. The Langevin approach requires smoothness and leads to a smooth approximate truncated Gaussian posterior. In contrast, the RTO approach yields a density that is a mixture of a truncated Gaussian distribution and a distribution on the boundary of the interval. The example highlights some key differences: unlike the smooth approximate truncated Gaussian posterior, the RTO density assigns non-zero probability to the boundary of the interval and has compact support. This RTO density is associated with an implicit prior that is supported on [a,b]𝑎𝑏[a,b][ italic_a , italic_b ] and depends on the observation y𝑦yitalic_y. Consequently, it violates the typical Bayesian assumption that the prior is independent of the observation. Using somewhat unconventional terminology, we will refer to the RTO density as the RTO posterior.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Truncated Gaussian posterior, a smooth approximation, and the RTO density for a Gaussian observation model. The vertical dotted line marks the target mean relative to the observation y𝑦yitalic_y, which is the mode of the likelihood.

As the example illustrates, the choice of prior has an important impact on the posterior π𝕩|𝕩=y(x)subscript𝜋conditional𝕩𝕩𝑦𝑥\pi_{\mathbbm{x}|\mathbbm{x}=y}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_x = italic_y end_POSTSUBSCRIPT ( italic_x ). A good prior is typically related to the nature of the inverse problem. In the Bayesian imaging literature, there are many examples of priors that promote sparsity or piecewise regularity in some transformed domain (e.g., involving the l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT norm or the total-variation (TV) pseudo-norm [35, 9, 25, 28]). Priors can also come in the form of a Markov random field [6], a learned prior like a patch-based Gaussian, or a Gaussian mixture model [44, 43, 1, 36, 21]. Choices pertaining to the prior are often informed by the resulting tractability of the posterior [28, 14, 32, 17, 11] or the ability to derive convergence guarantees [28, 14, 32, 17, 11]. Moreover, recent progress on neural networks has spurred an interest in data-driven approaches where a prior is learned from a large dataset {xi}i=1Nπ𝕩(x)similar-tosuperscriptsubscriptsubscript𝑥𝑖𝑖1𝑁subscript𝜋𝕩𝑥\{x_{i}\}_{i=1}^{N}\sim\pi_{\mathbbm{x}}(x){ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∼ italic_π start_POSTSUBSCRIPT blackboard_x end_POSTSUBSCRIPT ( italic_x ) [22, 24, 20, 8, 19].

In the following sections, we compare two sampling techniques: Randomize-Then-Optimize (RTO) and the Moreau–Yoshida Unadjusted Langevin Algorithm (MYULA)  [14]. The latter finds frequent application in the field of imaging science. We begin by contrasting the theoretical and practical aspects of these methods in Section 2, and we quantitatively compare and evaluate RTO and MYULA in Section 3 on two classical imaging inverse problems, namely deblurring and inpainting. Finally, we conclude in Section 4 by summarizing the main findings and discussing future research directions.

2 Theory and Methods

The posterior density π𝕩|𝕪=y(x)subscript𝜋conditional𝕩𝕪𝑦𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ) is often intractable and cannot be computed in closed form. In such cases, the posterior density is explored using sampling methods. In this section, we will study two such methods, namely Randomize-Then-Optimize (RTO) and Langevin methods with a focus on MYULA. We will compare the two classes of methods from a theoretical point of view, and discuss their practical implications. We note that besides RTO and Langevin methods, there are many other efficient sampling methods for high-dimensional densities based on Gibbs samplers, see e.g. [38, 30]. These are mainly designed to sample from high-dimensional Gaussian distributions, and their adaptations to more complex target densities frequently involve Langevin methods [38].

2.1 Randomize-Then-Optimize (RTO) sampling method

The RTO sampling method for Bayesian inverse problems was proposed in [4] as a way to approximate samples from a posterior distribution arising from a nonlinear inverse problem with a Gaussian likelihood and a Gaussian prior. The basic idea is to perturb the observation vector y𝑦yitalic_y by noise and solve a MAP-like optimization problem to obtain a sample. Specifically, if we consider the inverse problem in (1) with a Gaussian likelihood 𝒩(𝒜(x),σ2I)𝒩𝒜𝑥superscript𝜎2𝐼\mathcal{N}(\mathcal{A}(x),\sigma^{2}I)caligraphic_N ( caligraphic_A ( italic_x ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) and a Gaussian prior 𝒩(x0,S)𝒩subscript𝑥0𝑆\mathcal{N}(x_{0},S)caligraphic_N ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S ), the RTO method generates a sample from the posterior distribution by solving the perturbed optimization problem

xargminx{12σ2𝒜(x)y^22+g(x)},g(x)=(xx0)TS1(xx0),formulae-sequencesuperscript𝑥subscriptargmin𝑥12superscript𝜎2superscriptsubscriptnorm𝒜𝑥^𝑦22𝑔𝑥𝑔𝑥superscript𝑥subscript𝑥0𝑇superscript𝑆1𝑥subscript𝑥0x^{\dagger}\in\operatorname*{argmin}_{x}\,\left\{\frac{1}{2\sigma^{2}}\|% \mathcal{A}(x)-\hat{y}\|_{2}^{2}+g(x)\right\},\qquad g(x)=(x-x_{0})^{T}S^{-1}(% x-x_{0}),italic_x start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ∈ roman_argmin start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT { divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ caligraphic_A ( italic_x ) - over^ start_ARG italic_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_g ( italic_x ) } , italic_g ( italic_x ) = ( italic_x - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (4)

where the perturbed data y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is drawn from the Gaussian pdf 𝒩(y,σ2I)𝒩𝑦superscript𝜎2𝐼\mathcal{N}(y,\sigma^{2}I)caligraphic_N ( italic_y , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ). For a linear inverse problem, this procedure yields independent samples from the posterior and reduces to a technique for Gaussian sampling proposed in [27]. When dealing with a nonlinear forward operator 𝒜𝒜\mathcal{A}caligraphic_A, the problem (4) transforms into a nonlinear least-squares problem, and RTO generates samples from an approximate posterior. However, samples from the exact posterior can be obtained by incorporating a Metropolis-Hastings step.

Several extensions of RTO have been proposed in the literature. For example, in [39], RTO was extended to Laplace priors by converting it to a standard Gaussian prior using a variable transformation. Moreover, the RTO method has been extended to linear inverse problems with implicit priors such as nonnegativity constraints [2, 3], polyhedral constraints [16], and more generally, implicit log-priors with a polyhedral hypograph [15]. In the latter case, the function g𝑔gitalic_g in (4) is a convex piecewise linear function that can be expressed as

g(x)=γ(i𝒞(x)+maxi(ciTx+di)),𝑔𝑥𝛾subscript𝑖𝒞𝑥subscript𝑖superscriptsubscript𝑐𝑖𝑇𝑥subscript𝑑𝑖g(x)=\gamma\left(i_{\mathcal{C}}(x)+\max_{i\in\mathcal{I}}(c_{i}^{T}x+d_{i})% \right),italic_g ( italic_x ) = italic_γ ( italic_i start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x ) + roman_max start_POSTSUBSCRIPT italic_i ∈ caligraphic_I end_POSTSUBSCRIPT ( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x + italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (5)

where 𝒞d𝒞superscript𝑑\mathcal{C}\subseteq\mathbb{R}^{d}caligraphic_C ⊆ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a polyhedral set, γ>0𝛾0\gamma>0italic_γ > 0 is a constant, \mathcal{I}caligraphic_I is a finite index set, and (ci,di)d×subscript𝑐𝑖subscript𝑑𝑖superscript𝑑(c_{i},d_{i})\in\mathbb{R}^{d}\times\mathbb{R}( italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R for i𝑖i\in\mathcal{I}italic_i ∈ caligraphic_I. This class of implicit priors includes nonnegativity, Besov priors, and l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-norm based priors such as the anisotropic total variation (TV). As shown in [16], RTO generates samples from a well-defined probability density when g𝑔gitalic_g is of the form (5) and rank(A)=drank𝐴𝑑\operatorname{rank}(A)=droman_rank ( italic_A ) = italic_d. The resulting density assigns a positive probability to low-dimensional sets that correspond to the faces of the polyhedral epigraph of g𝑔gitalic_g. In the end, RTO reads

𝕫𝕫\displaystyle\mathbbm{z}blackboard_z 𝒩(0,σ2I)similar-toabsent𝒩0superscript𝜎2𝐼\displaystyle\sim\mathcal{N}(0,\sigma^{2}I)∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) (6a)
𝕩𝕩\displaystyle\mathbbm{x}blackboard_x =proxgΣ1(A(y+𝕫)),absentsuperscriptsubscriptprox𝑔superscriptΣ1superscript𝐴𝑦𝕫\displaystyle=\operatorname{prox}_{g}^{\Sigma^{-1}}(A^{\dagger}(y+\mathbbm{z})% )\ ,= roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_A start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ( italic_y + blackboard_z ) ) , (6b)

where Σ1=ATA/σ2superscriptΣ1superscript𝐴𝑇𝐴superscript𝜎2\Sigma^{-1}=A^{T}A/\sigma^{2}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, A=(ATA)1ATsuperscript𝐴superscriptsuperscript𝐴𝑇𝐴1superscript𝐴𝑇A^{\dagger}=(A^{T}A)^{-1}A^{T}italic_A start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT = ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and proxgΣ1superscriptsubscriptprox𝑔superscriptΣ1\operatorname{prox}_{g}^{\Sigma^{-1}}roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT is the proximal operator with respect to the norm induced by Σ1superscriptΣ1\Sigma^{-1}roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT defined for all xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT by proxgΣ1(x)=argminu{g(u)+12xuΣ12}superscriptsubscriptprox𝑔superscriptΣ1𝑥subscriptargmin𝑢𝑔𝑢12superscriptsubscriptnorm𝑥𝑢superscriptΣ12\operatorname{prox}_{g}^{\Sigma^{-1}}(x)=\operatorname{argmin}_{u}\{g(u)+\frac% {1}{2}\|x-u\|_{\Sigma^{-1}}^{2}\}roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ) = roman_argmin start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT { italic_g ( italic_u ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ italic_x - italic_u ∥ start_POSTSUBSCRIPT roman_Σ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }. In the case where rank(A)<drank𝐴𝑑\operatorname{rank}(A)<droman_rank ( italic_A ) < italic_d, additional regularization may be necessary to guarantee the positive definiteness of ATAsuperscript𝐴𝑇𝐴A^{T}Aitalic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A and the existence of a well-defined posterior density. For example, this can be achieved by adding a quadratic term of the form α2x22𝛼2superscriptsubscriptnorm𝑥22\frac{\alpha}{2}\|x\|_{2}^{2}divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to the optimization problem in (4) [16].

Although RTO is conceptually simple and can be used to sample from a wide range of densities, we recall that it does not define a posterior distribution in a rigorous manner as its associated implicit prior is observation-dependent and the Bayes’ rule does not hold. RTO is rather anchored in the sensitivity analysis framework, as it aims at quantifying the uncertainty resulting from perturbations occurring in the data-space. Eventually, its underlying distribution describes possible solutions given an observation. From a more practical point of view, we emphasize that the cost of solving an optimization problem for each sample can be prohibitive for high-dimensional problems. However, in practice it is not necessary to solve the optimization problem to high accuracy to obtain useful samples.

2.2 Langevin methods

Langevin sampling methods arise from the Langevin stochastic differential equation (SDE), which reads

dxt=logπ𝕩|𝕪=y(xt)dt+2dbt,dsubscript𝑥𝑡subscript𝜋conditional𝕩𝕪𝑦subscript𝑥𝑡d𝑡2dsubscript𝑏𝑡\textrm{d}x_{t}=\nabla\log\pi_{\mathbbm{x}|\mathbbm{y}=y}(x_{t})\ \mathrm{d}t+% \sqrt{2}\ \textrm{d}b_{t}\ ,d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∇ roman_log italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_d italic_t + square-root start_ARG 2 end_ARG d italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (7)

where (bt)t0subscriptsubscript𝑏𝑡𝑡0(b_{t})_{t\geq 0}( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT denotes a d𝑑ditalic_d-dimensional Brownian motion, and the posterior density π𝕩|𝕪=y(x)subscript𝜋conditional𝕩𝕪𝑦𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ) is the target density. Under mild assumptions on π𝕩|𝕪=y(x)subscript𝜋conditional𝕩𝕪𝑦𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ), for any initial condition x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, (7) admits a unique strong solution (xt)t0subscriptsubscript𝑥𝑡𝑡0(x_{t})_{t\geq 0}( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ≥ 0 end_POSTSUBSCRIPT with the target density as the unique stationary density [34]. However, (7) in general cannot be solved analytically, and we have to discretize it in order to solve it numerically.

The unadjusted Langevin algorithm (ULA) is obtained by discretizing the Langevin SDE (7) using the Euler–Maruyama scheme. This yields the homogeneous Markov chain

𝕩k+1=xk+δlogπ𝕩|𝕪=y(xk)+2δ𝕫k+1,subscript𝕩𝑘1subscript𝑥𝑘𝛿subscript𝜋conditional𝕩𝕪𝑦subscript𝑥𝑘2𝛿subscript𝕫𝑘1\mathbbm{x}_{k+1}=x_{k}+\delta\nabla\log\pi_{\mathbbm{x}|\mathbbm{y}=y}(x_{k})% +\sqrt{2\delta}\ \mathbbm{z}_{k+1},blackboard_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_δ ∇ roman_log italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_δ end_ARG blackboard_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , (8)

where (𝕫k+1)ksubscriptsubscript𝕫𝑘1𝑘(\mathbbm{z}_{k+1})_{k\in\mathbb{N}}( blackboard_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT is a sequence of independent normal random variables, and δ>0𝛿0\delta>0italic_δ > 0 is the discretization step-size that controls the trade-off between accuracy and convergence-speed. Under mild assumptions on the target density, this homogeneous Markov chain admits a unique stationary density π𝕩|𝕪=yδ(x)superscriptsubscript𝜋conditional𝕩𝕪𝑦𝛿𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\delta}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ( italic_x ) [34], for which non-asymptotic bounds on the distance between the target density π𝕩|𝕪=y(x)subscript𝜋conditional𝕩𝕪𝑦𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ) and the stationary density π𝕩|𝕪=yδ(x)superscriptsubscript𝜋conditional𝕩𝕪𝑦𝛿𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\delta}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_δ end_POSTSUPERSCRIPT ( italic_x ) is provided in [13]. By combining ULA with a Metropolis–Hasting step, one obtains the Metropolis-adjusted Langevin algorithm (MALA) [34], which produces samples from the target posterior. ULA requires the target log-density to be differentiable and is a popular method as it is straightforward to implement and scales well with the dimension d𝑑ditalic_d [13]. It has been extended to nondifferentiable log-concave target densities by considering a surrogate density as proposed in [28, 14]. Specifically, if the target density can be expressed as π𝕩|𝕪=y(x)exp(f(x,y)g(x))proportional-tosubscript𝜋conditional𝕩𝕪𝑦𝑥𝑓𝑥𝑦𝑔𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)\propto\exp(-f(x,y)-g(x))italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ) ∝ roman_exp ( - italic_f ( italic_x , italic_y ) - italic_g ( italic_x ) ) with nondifferentiable g(x)𝑔𝑥g(x)italic_g ( italic_x ) and with f(,y)𝑓𝑦f(\cdot,y)italic_f ( ⋅ , italic_y ) and g𝑔gitalic_g both convex, the Moreau–Yoshida ULA (MYULA) is obtained by using the Moreau–Yoshida envelope gαsubscript𝑔𝛼{g}_{\alpha}italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT of g𝑔gitalic_g [5] to construct a surrogate target density π𝕩|𝕪=yα(x)π𝕪|𝕩=x(y)π𝕩α(x)proportional-tosuperscriptsubscript𝜋conditional𝕩𝕪𝑦𝛼𝑥subscript𝜋conditional𝕪𝕩𝑥𝑦superscriptsubscript𝜋𝕩𝛼𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\alpha}(x)\propto\pi_{\mathbbm{y}|\mathbbm{x}% =x}(y)\ \pi_{\mathbbm{x}}^{\alpha}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x ) ∝ italic_π start_POSTSUBSCRIPT blackboard_y | blackboard_x = italic_x end_POSTSUBSCRIPT ( italic_y ) italic_π start_POSTSUBSCRIPT blackboard_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x ) with π𝕩α(x)exp(gα(x))proportional-tosuperscriptsubscript𝜋𝕩𝛼𝑥subscript𝑔𝛼𝑥\pi_{\mathbbm{x}}^{\alpha}(x)\propto\exp(-g_{\alpha}(x))italic_π start_POSTSUBSCRIPT blackboard_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x ) ∝ roman_exp ( - italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_x ) ). The function gαsubscript𝑔𝛼g_{\alpha}italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT is continuously differentiable with 1/α1𝛼1/\alpha1 / italic_α-Lipschitz gradient and such that gα(x)=(xproxgα(x))/αsubscript𝑔𝛼𝑥𝑥superscriptsubscriptprox𝑔𝛼𝑥𝛼\nabla g_{\alpha}(x)=(x-\operatorname{prox}_{g}^{\alpha}(x))/\alpha∇ italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_x ) = ( italic_x - roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x ) ) / italic_α, where proxgα=proxgα1Isuperscriptsubscriptprox𝑔𝛼superscriptsubscriptprox𝑔superscript𝛼1𝐼\operatorname{prox}_{g}^{\alpha}=\operatorname{prox}_{g}^{\alpha^{-1}I}roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT = roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT. The resulting homogeneous Markov chain reads

𝕩k+1subscript𝕩𝑘1\displaystyle\mathbbm{x}_{k+1}blackboard_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT =xk+δlogπ𝕩|𝕪=yα(xk)+2δ𝕫k+1absentsubscript𝑥𝑘𝛿superscriptsubscript𝜋conditional𝕩𝕪𝑦𝛼subscript𝑥𝑘2𝛿subscript𝕫𝑘1\displaystyle=x_{k}+\delta\nabla\log\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\alpha}(x% _{k})+\sqrt{2\delta}\ \mathbbm{z}_{k+1}= italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_δ ∇ roman_log italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_δ end_ARG blackboard_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT (9a)
=xkδ[f(xk,y)+gα(xk)]+2δ𝕫k+1absentsubscript𝑥𝑘𝛿delimited-[]𝑓subscript𝑥𝑘𝑦subscript𝑔𝛼subscript𝑥𝑘2𝛿subscript𝕫𝑘1\displaystyle=x_{k}-\delta[\nabla f(x_{k},y)+\nabla g_{\alpha}(x_{k})]+\sqrt{2% \delta}\ \mathbbm{z}_{k+1}= italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_δ [ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y ) + ∇ italic_g start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] + square-root start_ARG 2 italic_δ end_ARG blackboard_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT (9b)
=(1δα)xkδf(xk,y)+δαproxgα(xk)+2δ𝕫k+1,absent1𝛿𝛼subscript𝑥𝑘𝛿𝑓subscript𝑥𝑘𝑦𝛿𝛼superscriptsubscriptprox𝑔𝛼subscript𝑥𝑘2𝛿subscript𝕫𝑘1\displaystyle=\left(1-\frac{\delta}{\alpha}\right)x_{k}-\delta\nabla f(x_{k},y% )+\frac{\delta}{\alpha}\operatorname{prox}_{g}^{\alpha}(x_{k})+\sqrt{2\delta}% \ \mathbbm{z}_{k+1},= ( 1 - divide start_ARG italic_δ end_ARG start_ARG italic_α end_ARG ) italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_δ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y ) + divide start_ARG italic_δ end_ARG start_ARG italic_α end_ARG roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_δ end_ARG blackboard_z start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , (9c)

where proxgαsuperscriptsubscriptprox𝑔𝛼\operatorname{prox}_{g}^{\alpha}roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT denotes the proximal operator associated with g𝑔gitalic_g and parameter α>0𝛼0\alpha>0italic_α > 0 [5]. In other words, the transition Markov kernel is Gaussian and reads

x~k+1(xk)subscript~𝑥𝑘1subscript𝑥𝑘\displaystyle\tilde{x}_{k+1}(x_{k})over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =(1δα)xkδf(xk,y)+δαproxgα(xk),absent1𝛿𝛼subscript𝑥𝑘𝛿𝑓subscript𝑥𝑘𝑦𝛿𝛼superscriptsubscriptprox𝑔𝛼subscript𝑥𝑘\displaystyle=\left(1-\frac{\delta}{\alpha}\right)x_{k}-\delta\nabla f(x_{k},y% )+\frac{\delta}{\alpha}\operatorname{prox}_{g}^{\alpha}(x_{k}),= ( 1 - divide start_ARG italic_δ end_ARG start_ARG italic_α end_ARG ) italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_δ ∇ italic_f ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_y ) + divide start_ARG italic_δ end_ARG start_ARG italic_α end_ARG roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , (10a)
(𝕩k+1|𝕩k=xk)conditionalsubscript𝕩𝑘1subscript𝕩𝑘subscript𝑥𝑘\displaystyle(\mathbbm{x}_{k+1}|\mathbbm{x}_{k}=x_{k})( blackboard_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | blackboard_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) 𝒩(x~k+1(xk),2δI).similar-toabsent𝒩subscript~𝑥𝑘1subscript𝑥𝑘2𝛿𝐼\displaystyle\sim\mathcal{N}(\tilde{x}_{k+1}(x_{k}),2\delta I).∼ caligraphic_N ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , 2 italic_δ italic_I ) . (10b)

Non-asymptotic bounds on the distance between the target density π𝕩|𝕪=y(x)subscript𝜋conditional𝕩𝕪𝑦𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ) and the stationary density associated with MYULA π𝕩|𝕪=yα,δ(x)superscriptsubscript𝜋conditional𝕩𝕪𝑦𝛼𝛿𝑥\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\alpha,\delta}(x)italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α , italic_δ end_POSTSUPERSCRIPT ( italic_x ) are provided in [14].

MYULA is conceptually simple. The evaluation of the proximal operator proxgαsuperscriptsubscriptprox𝑔𝛼\operatorname{prox}_{g}^{\alpha}roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT requires the solution of a convex optimization problem, and hence the cost of a sample can be high if no closed-form expression is available. However, as shown empirically in [14], evaluating the proximal operator inaccurately can still produce good results.

2.3 Conceptual differences between RTO and MYULA

We now compare RTO and Langevin methods from a theoretical point of view while paying special attention to models with non-differentiable priors. Specifically, we will focus on the RTO method proposed in [15, 16] and MYULA [28, 14]. Table 1 lists the main conceptual differences, which we will now explain in more details. To simplify this comparison, we focus on a Gaussian likelihood as the behavior of RTO can be characterized in this setting.

RTO MYULA
Log-prior polyhedral hypograph concave
Target density implicit form explicit form
Random perturbation in data space in image space
Samples independent correlated
Burn-in no required
Sample generation cost high: computing (6b) low: computing (10a)
Parameter selection online offline
hierarchical approach empirical approach
Parallelization yes no
Table 1: Comparisons between RTO and MYULA from a theoretical point of view for a Gaussian likelihood.

Firstly, it is important to note that the two methods are based on different assumptions. RTO’s analysis relies on the assumption that the likelihood is Gaussian and that the log-prior has a polyhedral hypograph. In contrast, MYULA is compatible with more general posteriors that are formed from a continuously differentiable log-concave likelihood with a Lipschitz gradient and a log-concave prior. Because of the discretization of the Langevin SDE defined in (7) and the use of a surrogate log-prior density, MYULA does not actually produce samples from the target posterior but from a surrogate posterior density. The distance between the surrogate density and the target density can be quantified in a non-asymptotic sense. However, the samples generated by RTO are from a pdf in an implicit form, which is also different from the target posterior.

Both RTO and MYULA involve a stochastic perturbation to produce a sample, but at different stages. RTO applies a perturbation in the data space before solving an instance of the minimization problem, as shown in (6). The operation in (6b) in fact corresponds to the generalization of an oblique projection and tends to concentrate the probability mass in some areas of the space. It partially explains why RTO can assign a strictly positive probability mass to low-dimensional subsets and why RTO samples can be sparse. In MYULA, the perturbation is applied in the image space after performing an explicit gradient descent step with respect to the surrogate posterior, see (10). That is why Langevin-based methods can be interpreted as a perturbed gradient descent [42]. In MYULA, due to the use of the surrogate model to overcome non-differentiable prior we also need perform a proximal operation but only on the prior, see (10a). The computation of this proximal operation is often much cheaper than the one in RTO, since the forward operator A𝐴Aitalic_A is not involved. The accuracy of computing proxgαsuperscriptsubscriptprox𝑔𝛼\operatorname{prox}_{g}^{\alpha}roman_prox start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT has an impact on the quality of samples, which is investigated in Section 3.1. Because the density sampled by MYULA is continuous on the whole image space, MYULA is then unable to enforce constraints.

Comparing the computational costs, we need to solve an optimization problem, i.e., computing (6b), in order to generate an RTO sample. However, with MYULA we obtain one sample at each iteration that involves the computation of a proximal operator, i.e., computing (10a). In RTO, as the perturbations are independent, samples are independent. Furthermore, we can parallelize RTO to generate several samples at the same time. MYULA, as all Markov chain Monte Carlo (MCMC) methods, require a burn-in period in order for the Markov chain to enter its stationary regime and actually produce samples from its stationary density. The length of the burn-in period is unknown a-priori. In addition, samples generated from MYULA as well as other Langevin methods suffer from correlations as they are generated by using the previous sample. Further, this type of methods cannot be parallelized, contrary to RTO.

Finally, we will compare the flexibility on the extension to model parameter selection. Often the posteriors include some unknown parameters, e.g. σ𝜎\sigmaitalic_σ in (4), γ𝛾\gammaitalic_γ in (5) and α𝛼\alphaitalic_α in (9a). The choice of those parameters can have great impact on the posterior. RTO allows us to perform automatic parameter selection by considering a hierarchical model, i.e. we add prior distributions on the hyperparameters λ=1/σ2𝜆1superscript𝜎2\lambda=1/\sigma^{2}italic_λ = 1 / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and γ𝛾\gammaitalic_γ and consider the posterior

π𝕩,λ,γ|𝕪=y(x,λ,γ)=π𝕩|𝕪=y,λ=λ,γ=γ(x)πλ(λ)πγ(γ).subscript𝜋𝕩double-struck-λconditionaldouble-struck-γ𝕪𝑦𝑥𝜆𝛾subscript𝜋formulae-sequenceconditional𝕩𝕪𝑦formulae-sequencedouble-struck-λ𝜆double-struck-γ𝛾𝑥subscript𝜋double-struck-λ𝜆subscript𝜋double-struck-γ𝛾\pi_{\mathbbm{x},\bblambda,\bbgamma|\mathbbm{y}=y}(x,\lambda,\gamma)=\pi_{% \mathbbm{x}|\mathbbm{y}=y,\bblambda=\lambda,\bbgamma=\gamma}(x)\pi_{\bblambda}% (\lambda)\pi_{\bbgamma}(\gamma).italic_π start_POSTSUBSCRIPT blackboard_x , start_UNKNOWN blackboard_λ end_UNKNOWN , start_UNKNOWN blackboard_γ end_UNKNOWN | blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x , italic_λ , italic_γ ) = italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y , start_UNKNOWN blackboard_λ end_UNKNOWN = italic_λ , start_UNKNOWN blackboard_γ end_UNKNOWN = italic_γ end_POSTSUBSCRIPT ( italic_x ) italic_π start_POSTSUBSCRIPT start_UNKNOWN blackboard_λ end_UNKNOWN end_POSTSUBSCRIPT ( italic_λ ) italic_π start_POSTSUBSCRIPT start_UNKNOWN blackboard_γ end_UNKNOWN end_POSTSUBSCRIPT ( italic_γ ) . (11)

To utilize the conjugacy with Gaussian distributions, we impose ΓΓ\Gammaroman_Γ-distributions for both λdouble-struck-λ\bblambdastart_UNKNOWN blackboard_λ end_UNKNOWN and γdouble-struck-γ\bbgammastart_UNKNOWN blackboard_γ end_UNKNOWN. The negative-logarithm of the conditional distribution π𝕩|𝕪=y,λ=λ,γ=γsubscript𝜋formulae-sequenceconditional𝕩𝕪𝑦formulae-sequencedouble-struck-λ𝜆double-struck-γ𝛾\pi_{\mathbbm{x}|\mathbbm{y}=y,\bblambda=\lambda,\bbgamma=\gamma}italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y , start_UNKNOWN blackboard_λ end_UNKNOWN = italic_λ , start_UNKNOWN blackboard_γ end_UNKNOWN = italic_γ end_POSTSUBSCRIPT is proportional to

λ𝒜(x)y^22+g(x)𝜆superscriptsubscriptnorm𝒜𝑥^𝑦22𝑔𝑥\lambda\|\mathcal{A}(x)-\hat{y}\|_{2}^{2}+g(x)italic_λ ∥ caligraphic_A ( italic_x ) - over^ start_ARG italic_y end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_g ( italic_x ) (12)

with g(x)𝑔𝑥g(x)italic_g ( italic_x ) given in (5).

Algorithm 1 summarizes the procedure to sample from such a hierarchical model within the RTO framework.

Algorithm 1 RTO Hierarchical Gibbs sampler for (𝕩,λ,γ)𝕩double-struck-λdouble-struck-γ(\mathbbm{x},\bblambda,\bbgamma)( blackboard_x , start_UNKNOWN blackboard_λ end_UNKNOWN , start_UNKNOWN blackboard_γ end_UNKNOWN )
N,x0d.formulae-sequence𝑁superscriptsubscript𝑥0superscript𝑑N\in\mathbb{N}^{*},\ x_{0}\in\mathbb{R}^{d}.italic_N ∈ blackboard_N start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .
(xk)k{1,,N}subscriptsubscript𝑥𝑘𝑘1𝑁(x_{k})_{k\in\{1,...,N\}}( italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k ∈ { 1 , … , italic_N } end_POSTSUBSCRIPT
for k=1toN𝑘1to𝑁k=1\mathrm{\ to\ }Nitalic_k = 1 roman_to italic_N do
     Step 1: Sample λkπλ|𝕩=xk1,𝕪=y(λ)similar-tosubscript𝜆𝑘subscript𝜋formulae-sequenceconditionaldouble-struck-λ𝕩subscript𝑥𝑘1𝕪𝑦𝜆\lambda_{k}\sim\pi_{\bblambda|\mathbbm{x}=x_{k-1},\mathbbm{y}=y}(\lambda)italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT start_UNKNOWN blackboard_λ end_UNKNOWN | blackboard_x = italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_λ ) and γkπγ|𝕩=xk1,𝕪=y(γ)similar-tosubscript𝛾𝑘subscript𝜋formulae-sequenceconditionaldouble-struck-γ𝕩subscript𝑥𝑘1𝕪𝑦𝛾\gamma_{k}\sim\pi_{\bbgamma|\mathbbm{x}=x_{k-1},\mathbbm{y}=y}(\gamma)italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT start_UNKNOWN blackboard_γ end_UNKNOWN | blackboard_x = italic_x start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_γ ), and both follow ΓΓ\Gammaroman_Γ-distributions.
     Step 2: Sample xkπ𝕩|λ=λk,γ=γk,𝕪=y(x)similar-tosubscript𝑥𝑘subscript𝜋formulae-sequenceconditional𝕩double-struck-λsubscript𝜆𝑘formulae-sequencedouble-struck-γsubscript𝛾𝑘𝕪𝑦𝑥x_{k}\sim\pi_{\mathbbm{x}|\bblambda=\lambda_{k},\bbgamma=\gamma_{k},\mathbbm{y% }=y}(x)italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT blackboard_x | start_UNKNOWN blackboard_λ end_UNKNOWN = italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , start_UNKNOWN blackboard_γ end_UNKNOWN = italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , blackboard_y = italic_y end_POSTSUBSCRIPT ( italic_x ) by RTO.
end for

Selecting posterior parameters in the Langevin-based methods is much more difficult. As far as we know, no Langevin-based methods can update or simultaneously estimate the model parameters. Recently, Vidal et al. proposed to perform parameter selection based on maximum likelihood estimation [37, 12], which can simultaneously estimate multiple model parameters and is theoretically well-founded. But the procedure has to be implemented offline, i.e., the parameters have to be predetermined before sampling.

3 Experimental study

In this section, we aim at numerically investigating the differences between RTO and MYULA methods for sampling a high-dimensional posterior model. To do so, we tackle two kinds of imaging inverse problems described by (1): deblurring and inpainting. In both cases, the forward operator A𝐴Aitalic_A is known and linear.

Refer to caption Refer to caption
Simpson Traffic
Figure 2: Original images used for the deblurring and inpainting experiments.

Figure 2 shows both 256-by-256 test images with an intensity range [0,1]01[0,1][ 0 , 1 ] that were used in this section. These images have been selected because of their differences in content and level of details. For example, Traffic presents a mix of piece-wise constant and more textured areas. We consider the prior of the form p(x)exp(g(x))proportional-to𝑝𝑥𝑔𝑥p(x)\propto\exp(-g(x))italic_p ( italic_x ) ∝ roman_exp ( - italic_g ( italic_x ) ) with

g(x)=γx1,1+i𝒞(x)𝑔𝑥𝛾subscriptnorm𝑥11subscript𝑖𝒞𝑥g(x)=\gamma\|\nabla x\|_{1,1}+i_{\mathcal{C}}(x)italic_g ( italic_x ) = italic_γ ∥ ∇ italic_x ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT + italic_i start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT ( italic_x ) (13)

and 𝒞=[0,1]d𝒞superscript01𝑑\mathcal{C}=[0,1]^{d}caligraphic_C = [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. It is a non-differentiable log-concave prior which promotes images with sparse gradients. We compare the results obtained by RTO with the results of MYULA. Note that both RTO and MYULA sample their target posterior densities approximately, and their corresponding approximations are different.

We generate 1000 RTO samples. To solve the minimization problem (4) with g(x)𝑔𝑥g(x)italic_g ( italic_x ) defined in (13) in RTO, we apply the alternating direction method of multipliers (ADMM) with the stop** criterion suggested in [7, Eq (3.12)]. According to the theories in [15, 16], to guarantee a well-defined posterior density we need to add a term α2x22𝛼2superscriptsubscriptnorm𝑥22\frac{\alpha}{2}\|x\|_{2}^{2}divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ∥ italic_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for the rank-deficient A𝐴Aitalic_A, which is the case here, and we set α=108𝛼superscript108\alpha=10^{-8}italic_α = 10 start_POSTSUPERSCRIPT - 8 end_POSTSUPERSCRIPT to ensure little impact on the results.

According to g(x)𝑔𝑥g(x)italic_g ( italic_x ) defined in (13), MYULA targets the surrogate posterior density πα1,α2(x|y)subscript𝜋subscript𝛼1subscript𝛼2conditional𝑥𝑦\pi_{\alpha_{1},\alpha_{2}}(x|y)italic_π start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x | italic_y ):

π𝕩|𝕪=yα1,α2(x)exp(12σ2Axy22[γ.1,1]α1(x)[ic]α2(x)),\pi_{\mathbbm{x}|\mathbbm{y}=y}^{\alpha_{1},\alpha_{2}}(x)\propto\exp\left(-% \frac{1}{2\sigma^{2}}\|Ax-y\|_{2}^{2}-[\gamma\|\nabla.\|_{1,1}]_{\alpha_{1}}(x% )-[i_{c}]_{\alpha_{2}}(x)\right)\ ,italic_π start_POSTSUBSCRIPT blackboard_x | blackboard_y = italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∝ roman_exp ( - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ italic_A italic_x - italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - [ italic_γ ∥ ∇ . ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) - [ italic_i start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ) , (14)

where [γ.1,1]α1[\gamma\|\nabla.\|_{1,1}]_{\alpha_{1}}[ italic_γ ∥ ∇ . ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and [ic]α2subscriptdelimited-[]subscript𝑖𝑐subscript𝛼2[i_{c}]_{\alpha_{2}}[ italic_i start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the Moreau-Yoshida envelopes of the two terms in g𝑔gitalic_g, respectively. To apply MYULA defined in (9a), we need proxγ.1,1α1\operatorname{prox}_{\gamma\|\nabla.\|_{1,1}}^{\alpha_{1}}roman_prox start_POSTSUBSCRIPT italic_γ ∥ ∇ . ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. However, since proxγ.1,1α1\operatorname{prox}_{\gamma\|\nabla.\|_{1,1}}^{\alpha_{1}}roman_prox start_POSTSUBSCRIPT italic_γ ∥ ∇ . ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT has no closed-form expression, we apply the first-order primal-dual algorithm introduced in [10] to estimate it. In the following, we set the number of iterations to estimate this proximal operator to npd=50subscript𝑛𝑝𝑑50n_{pd}=50italic_n start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT = 50. We initialize MYULA at the data y𝑦yitalic_y and run 106superscript10610^{6}10 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT iterations with the first 2.5×1042.5superscript1042.5\times 10^{4}2.5 × 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT iterations as burn-in. In addition, we set the thinning parameter to 250250250250, i.e., we store a sample every 250250250250 iterations. It ends up with 3900390039003900 samples in total. As suggested in [14], we set α1=α2=1/(ATA/σ2)subscript𝛼1subscript𝛼21normsuperscript𝐴𝑇𝐴superscript𝜎2\alpha_{1}=\alpha_{2}=1/(\|A^{T}A\|/\sigma^{2})italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 / ( ∥ italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A ∥ / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and δ=1/(2/α1+2/α2+2ATA/σ2)𝛿12subscript𝛼12subscript𝛼22normsuperscript𝐴𝑇𝐴superscript𝜎2\delta=1/(2/\alpha_{1}+2/\alpha_{2}+2\|A^{T}A\|/\sigma^{2})italic_δ = 1 / ( 2 / italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 / italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 ∥ italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_A ∥ / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

To quantitatively evaluate the reconstruction quality, we use the peak signal-to-noise ratio (PSNR) and the structural similarity index measure (SSIM) [41, 40].

3.1 Deblurring

In this section, images are degraded by a 9×9999\times 99 × 9 uniform blurring kernel and additive white Gaussian noise with zero mean and variance σ2=0.001superscript𝜎20.001\sigma^{2}=0.001italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 0.001. We assume the periodic boundary condition, which turns out that A𝐴Aitalic_A can be diagonalized by the Fourier transform. In Figure 3 we show the degraded images.

Refer to caption Refer to caption
PSNR=21.75/SSIM=0.47 PSNR=19.90/SSIM=0.37
Figure 3: (Deblurring) degraded images.

For the stop** criteria in ADMM for RTO, we set the tolerance of the primal and dual residuals to tol=104tolsuperscript104\texttt{tol}=10^{-4}tol = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and the maximum iteration number to maxiter=2000maxiter2000\texttt{maxiter}=2000maxiter = 2000. In addition, we set the regularization parameter γ𝛾\gammaitalic_γ in g(x)𝑔𝑥g(x)italic_g ( italic_x ) to γ=5𝛾5\gamma=5italic_γ = 5. For MYULA, we use γ=10𝛾10\gamma=10italic_γ = 10.

Remark.

As explained in Section 2, regularization parameters can have significant impacts on the results. For the regularization parameter γ𝛾\gammaitalic_γ in RTO, we use the parameter achieving the largest PSNR when solving the unperturbed optimization problem. It generally delivers good results in the RTO framework. It means that the maximum-a-posteriori (MAP) point estimate can be a good reference for selecting γ𝛾\gammaitalic_γ and highlights the connection between optimization and sampling within the RTO framework. Finding a good γ𝛾\gammaitalic_γ for MYULA is more complicated, since we found that the regularization parameter generating a MAP with high PSNR and SSIM scores for the unperturbed optimization problem often leads to too noisy samples in MYULA. Consequently, we had to run several Markov chains with different regularization parameters γ𝛾\gammaitalic_γ in order to find a good value, which is much more time-consuming compared to RTO.

Results and discussions: In Figure 4 we show the minimum mean square error (MMSE) estimates of RTO and MYULA together with the MAP estimates for comparison. We can see that MMSE of RTO achieves the highest PSNR and SSIM scores. Visually it resembles the MAP estimate with the stair-casing artifacts due to the use of TV regularization. However, we do not observe the stair-casing artifacts for the MMSE estimate computed by MYULA, but a grid pattern ruins the restoration.

RTO MYULA MAP

Simpson .

Refer to caption Refer to caption Refer to caption
PSNR=25.69/SSIM=0.79 PSNR=24.97/SSIM=0.72 PSNR=25.17/SSIM=0.77

Traffic .

Refer to caption Refer to caption Refer to caption
PSNR=24.00/SSIM=0.66 PSNR=23.18/SSIM=0.63 PSNR=23.69/SSIM=0.65
Figure 4: (Deblurring) MMSE estimates computed respectively with RTO and MYULA together with MAP estimates.

Figure 5 shows the standard deviation maps of RTO and MYULA, respectively. In the results of both methods we observe higher uncertainties around edges, which is coherent as high frequency information is more challenging to restore. However, the standard deviation of MYULA exhibits a more spread out uncertainty: edges are less uncertain than the ones of RTO whereas the constant regions have larger standard deviations than those of RTO. These differences relate to where the perturbation is performed. In RTO we perturb the data before computing the proximal operator, whereas in MYULA the perturbation takes place after taking a gradient step in the image space.

RTO .

Refer to caption
Refer to caption

MYULA .

Refer to caption
Refer to caption
Figure 5: (Deblurring) Marginal posterior standard deviation computed with the samples generated by RTO and MYULA.

To check the correlation among the samples, we compute the sample auto-correlation functions (ACFs), which measure how fast samples become uncorrelated. A fast decay of ACFs indicates no linear dependency among the samples and is also interpreted as a short mixing time for Markov chains. It comes with accurate Monte-Carlo estimates. As images are high-dimensional objects, directly estimating the d𝑑ditalic_d-dimensional ACFs is not realistically doable. However, the convergence speeds can be inferred from the posterior covariance matrix [29]. We assume that the posterior covariance shape is mostly determined by the likelihood. Therefore, we approximate the posterior covariance using the directions provided by the diagonalization basis of the forward operator A𝐴Aitalic_A, i.e., the Fourier basis, for the deblurring experiments as in [24]. Figure 6 shows the evolution of the estimated ACFs both of RTO and the complete Markov chain of MYULA. ACFs of RTO samples immediately drop to zero, which shows that RTO samples are independent, while samples generated by MYULA only become uncorrelated after 500500500500 iterations. It means that in order to get 1000100010001000 uncorrelated samples, we need to run MYULA for 5×1055superscript1055\times 10^{5}5 × 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT iterations. Note that the number of iterations before MYULA samples get uncorrelated actually depends on the inverse problem we deal with, and cannot evaluate a-priori but only after running the Markov chain. Furthermore, MYULA requires a burn-in phase that corresponds to the time required by the chain to enter in its stationary regime, i.e., when samples generated by the Markov chain are sampled from πδ,α1,α2(x|y)subscript𝜋𝛿subscript𝛼1subscript𝛼2conditional𝑥𝑦\pi_{\delta,\alpha_{1},\alpha_{2}}(x|y)italic_π start_POSTSUBSCRIPT italic_δ , italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x | italic_y ). The number of iterations in burn-in also cannot assess a-priori but only after running the chain.

Simpson Traffic

RTO .

Refer to caption Refer to caption

MYULA .

Refer to caption Refer to caption
Figure 6: (Deblurring) The comparison of ACFs of RTO and MYULA on both test images.

In Figure 7 we show a few samples generated by RTO and MYULA, respectively. It is clear that MYULA samples are much more noisy than the ones of RTO. The reason is that RTO samples are in fact MAP estimates with a perturbed observation. On the other hand, MYULA samples are generated by a gradient descent step with added noise. In addition, we can see that MYULA samples do not belong to 𝒞𝒞\mathcal{C}caligraphic_C, contrarily to RTO samples.

RTO .

Refer to caption
Refer to caption Refer to caption
PSNR=24.76/SSIM=0.74 PSNR=24.72/SSIM=0.74 PSNR=24.72/SSIM=0.74

MYULA .

Refer to caption
Refer to caption Refer to caption
PSNR=21.17/SSIM=0.37 PSNR=21.18/SSIM=0.37 PSNR=21.18/SSIM=0.37
Figure 7: (Deblurring) Samples generated by RTO and MYULA.

Figure 8 shows relative errors of the samples generated by RTO and MYULA to their corresponding MMSEs. It is interesting to note that the relative error is higher in MYULA’s case. It means that the RTO posterior is more concentrated around its mean than MYULA’s. In RTO the perturbation takes place in the data space, and then is smoothed due to the use of the regularization.

Simpson Traffic

RTO .

Refer to caption Refer to caption

MYULA .

Refer to caption Refer to caption
Figure 8: (Deblurring) Relative errors of samples generated of RTO and MYULA with respect to MMSE estimates.

For the computational time, it took around 6666 hours to generate 1000 RTO samples and around 35353535 hours to generate 3900 MYULA samples after burn-in and thinning on an Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz. To reduce the computational cost, we can decrease the accuracy when solving the optimization problems in the sampling methods, i.e., the computation of proxγ.1,1α1\operatorname{prox}_{\gamma\|\nabla.\|_{1,1}}^{\alpha_{1}}roman_prox start_POSTSUBSCRIPT italic_γ ∥ ∇ . ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in MYULA and MAP on the perturbed data in RTO. In Figure 9 we show the results produced by MYULA when setting npd=10subscript𝑛𝑝𝑑10n_{pd}=10italic_n start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT = 10 for Simpson. In this case, the running time is reduced to around 10101010 hours. Comparing with the results in Figure 4 and Figure 5 where we set npd=50subscript𝑛𝑝𝑑50n_{pd}=50italic_n start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT = 50, we can see that MMSE is more noisy and the uncertainty is not well discovered. In addition, the ACFs figure shows that we need many more iterations in order to obtain uncorrelated samples. In RTO we change tol in ADMM from 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to 102superscript10210^{-2}10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and show the results in Figure 10. It is clear that MMSE is nearly identical to the one in Figure 4. However, the standard deviation close to the edges are slightly smaller than the ones in Figure 5. It means that reasonably reducing the solution accuracy has a minor influence on the MMSE estimate, but can lead to insufficiently explored uncertainty.

Refer to caption Refer to caption Refer to caption
MMSE (PSNR=25.06/SSIM=0.66) Standard deviation ACFs
Figure 9: (Deblurring) Results generated by MYULA with npd=10subscript𝑛𝑝𝑑10n_{pd}=10italic_n start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT = 10.
Refer to caption Refer to caption
MMSE (PSNR=25.71/SSIM=0.79) Standard deviation
Figure 10: (Deblurring) Results generated by RTO with tol=102tolsuperscript102\texttt{tol}=10^{-2}tol = 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT.

3.2 Inpainting

In this section, we study the inpainting inverse problem, where the forward operator A𝐴Aitalic_A indicates the locations of the known pixel values. Figure 11 shows the observations y𝑦yitalic_y, where the black regions mark the information of pixels is lost. In this test, 58549585495854958549 of 65536655366553665536 pixels are observed, i.e., A58549×65536𝐴superscript5854965536A\in\mathbb{R}^{58549\times 65536}italic_A ∈ blackboard_R start_POSTSUPERSCRIPT 58549 × 65536 end_POSTSUPERSCRIPT. In addition, y𝑦yitalic_y is also corrupted by additive white Gaussian noise with zero mean and standard deviation σ=0.02𝜎0.02\sigma=0.02italic_σ = 0.02. For both methods we set the regularization parameter γ=8𝛾8\gamma=8italic_γ = 8. To ensure good results in a reasonable amount of time, we set the stop** criteria for ADMM in RTO as that the primal residual is smaller than tol=2×103tol2superscript103\texttt{tol}=2\times 10^{-3}tol = 2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT or the number of iterations is larger than 500500500500.

Refer to caption Refer to caption
PSNR=16.45/SSIM=0.68 PSNR=15.89/SSIM=0.74
Figure 11: (Inpainting) Observations for the inpainting problem.

Results and discussions: Figure 12 shows MMSEs obtained by RTO and MYULA as well as MAPs. MMSEs of both methods are very similar to the corresponding MAP estimates, and the mask regions are well filled except the regions with rich textures, see the tree leafs in Traffic, where the inpainting task is the most difficult. This good visual impression is confirmed by the high PSNR and SSIM scores. However, both methods tend to struggle to correctly fill the gaps as we can see on the contours of the dress in Simpson. In addition, MMSEs of MYULA are slightly more noisy, and MYULA seems to have more difficulties to fill the regions with rich textures, see the tree leafs in Traffic.

RTO MYULA MAP

Simpson .

Refer to caption Refer to caption Refer to caption
PSNR=35.01/SSIM=0.94 PSNR=33.86/SSIM=0.89 PSNR=34.99/SSIM=0.95

Traffic .

Refer to caption Refer to caption Refer to caption
PSNR=30.77/SSIM=0.92 PSNR=30.43/SSIM=0.89 PSNR=30.63/SSIM=0.92
Figure 12: (Inpainting) MMSE estimates respectively obtained of RTO and MYULA together with MAP estimates.

The standard deviation maps associated with RTO and MYULA for Simpson and Traffic are displayed in Figure 13. We can see that the standard deviations of RTO show larger uncertainties around edges, especially at the edges where the information of pixels is lost. It is also interesting to note that the hidden constant areas are restored with high confidence for RTO. It means that the solutions to the perturbed optimization problems are very similar in these regions and that the perturbations do not affect the restorations in these areas. MYULA behaves very differently from RTO. The high uncertainties appear at the whole regions where the information is lost. Further, in these regions the standard deviations from MYULA are nearly double of the RTO values.

RTO .

Refer to caption
Refer to caption

MYULA .

Refer to caption
Refer to caption
Figure 13: (Inpainting) The standard deviations computed with the samples generated by RTO and MYULA.

Finally, we consider the efficiency of both methods. The RTO experiments took around 10101010 hours while for MYULA they took around 35353535 hours.

In Figure 14 we show the evolution of the ACFs for both methods. We computed them in the pixel domain, where the forward operator A𝐴Aitalic_A is diagonal, as we assume that the likelihood imposes the posterior covariance shape. It is obvious that samples generated by RTO are independent, but samples generated with MYULA exhibit correlations. For Traffic, the complete Markov chain associated with MYULA needs more than 6000600060006000 iterations in order to eventually achieve uncorrelated samples.

Simpson Traffic

RTO .

Refer to caption Refer to caption

MYULA .

Refer to caption Refer to caption
Figure 14: (Inpainting) ACFs from samples generated by RTO and MYULA, respectively.

3.3 Automatic parameter selection

In Sections 3.1 and 3.2 we choose the regularization parameter γ𝛾\gammaitalic_γ such that it gives the best results in terms of PSNR and SSIM, which requires many more experiments in order to determine γ𝛾\gammaitalic_γ. Furthermore, we assume that we have access to the noise level σ𝜎\sigmaitalic_σ that is not always known. Being able to automatically select the parameters in a robust fashion is a matter of prime importance. As explained in Section 2, Langevin methods like MYULA do not allow us to perform online parameter selection, but RTO can be plugged into an augmented hierarchical model in order to automatically select σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and γ𝛾\gammaitalic_γ without any additional computational cost [15, 16], see Algorithm 1. It turns RTO into a nearly parameter-free method. To ensure efficient sampling for the conditional distributions with respect to σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and γ𝛾\gammaitalic_γ, we are limited to ΓΓ\Gammaroman_Γ-priors for λ=1/σ2𝜆1superscript𝜎2\lambda=1/\sigma^{2}italic_λ = 1 / italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and γ𝛾\gammaitalic_γ, and only for some specific choices of g𝑔gitalic_g. For more detailed discussions, we refer to [16].

In Figure 15 we show the results from the same deblurring problem as in Section 3.1 but with slightly different g(x)𝑔𝑥g(x)italic_g ( italic_x ) comparing with (13). Here we use the non-negativity constraint instead of the box constraint, i.e., g(x)=γx1,1+i+(x)𝑔𝑥𝛾subscriptnorm𝑥11subscript𝑖superscript𝑥g(x)=\gamma\|\nabla x\|_{1,1}+i_{\mathbb{R}^{+}}(x)italic_g ( italic_x ) = italic_γ ∥ ∇ italic_x ∥ start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT + italic_i start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_x ), since in this case the conditional distribution πγ|𝕩,𝕪(γ)subscript𝜋conditionaldouble-struck-γ𝕩𝕪𝛾\pi_{\bbgamma|\mathbbm{x},\mathbbm{y}}(\gamma)italic_π start_POSTSUBSCRIPT start_UNKNOWN blackboard_γ end_UNKNOWN | blackboard_x , blackboard_y end_POSTSUBSCRIPT ( italic_γ ) can be efficiently sampled. Comparing the results shown in Figure 15 and in Figures 4 and 5, we can see that they are very similar visually as well as quantitatively. In Table 2, we list the means and the standard deviations of λ𝜆\lambdaitalic_λ and γ𝛾\gammaitalic_γ obtained from the hierarchical Gibbs sampler. It is clear that the estimates for both γ𝛾\gammaitalic_γ and λ𝜆\lambdaitalic_λ are comparable with the ones used in Section 3.1. In the end, Figure 16 shows the ACFs of the image samples and trace plots of λ𝜆\lambdaitalic_λ and γ𝛾\gammaitalic_γ. Due to the Gibbs sampler, we cannot expect independent samples, but we notice that ACFs still decay to zero immediately. Further, the trace plots show very good mixing in samples.

MMSE Standard deviation
Refer to caption Refer to caption
PSNR=26.00/SSIM=0.79
Figure 15: (Hierarchical Gibbs sampler) MMSE and marginal posterior standard deviations computed using the hierarchical Gibbs sampler presented in Algorithm 1 for image deblurring problem defined in Section 3.1.
Refer to caption Refer to caption Refer to caption
ACFs Trace λksubscript𝜆𝑘\lambda_{k}italic_λ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Trace γksubscript𝛾𝑘\gamma_{k}italic_γ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
Figure 16: (Hierarchical Gibbs sampler) Analysis of the sample correlation of the augmented Gibbs sampler. Left: ACFs of the image samples. Middle and right: traces of the noise level and regularization parameters.
Obtained from hierarchical Gibbs sampler Used in RTO in Section 3.1
mean (standard deviation)
γ𝛾\gammaitalic_γ 3.97 (0.44) 5
λ𝜆\lambdaitalic_λ 1015.49 (7.19) 1000
Table 2: (Hierarchical Gibbs sampler) Empirical mean and standard deviation of the noise level and regularization parameters λ𝜆\lambdaitalic_λ and γ𝛾\gammaitalic_γ.

4 Conclusion

We compared two classes of sampling methods for solve inverse problems in imaging, RTO and the Langevin method MYULA, and highlighted their main conceptual and theoretical differences.

RTO is derived from the sensitivity analysis framework and samples from a target distribution by solving perturbed optimization problems where the perturbation occurs in the data space. The RTO target density can have non-zero probability mass on subsets of measure zero. However, it is not anchored in the Bayesian framework as the target density corresponds to an implicit prior that depends on the observed data. In addition, RTO can be incorporated into a hierarchical model in order to perform automatic parameter selection. The main limitation of RTO is that it has only been characterized for the posteriors with Gaussian likelihoods and polyhedral hypograph log-prior. In contrast, MYULA is firmly rooted in the Bayesian framework and is applicable to a broader range of posteriors. Although it samples the posterior density approximately, the distribution behind samples can be characterized with respect to the posterior density. Similar as other Langevin methods, MYULA suffers from typical MCMC drawbacks, such as slow convergence, correlated samples, etc. Through two classical imaging inverse problems: deblurring and inpainting, we compared RTO and MYULA numerically with particular attention to computational cost. Both methods produced accurate results for deblurring, but MYULA struggled with severely ill-posed problems like inpainting. Additionally, while RTO concentrates the sample mass around the MMSE estimate, MYULA results in a more dispersed distribution.

One future research direction is to extend RTO to more general posteriors. It would be valuable to explore other noise models, such as Poisson noise as in [3], and investigate how we can characterize the RTO distribution both theoretically and practically. In addition, motivated by the work in [24], it would be also interesting to extend RTO to data-driven regularization [31, 22, 26], particularly to generative models [33, 23].

References

  • Aguerrebere et al. [2017] Cecilia Aguerrebere, Andres Almansa, Julie Delon, Yann Gousseau, and Pablo Muse. A Bayesian Hyperprior Approach for Joint Image Denoising and Interpolation, With an Application to HDR Imaging. IEEE Transactions on Computational Imaging, 3(4):633–646, dec 2017. ISSN 2333-9403. doi:10.1109/TCI.2017.2704439. URL https://nounsse.github.io/HBE_project/.
  • Bardsley and Fox [2012] J. M. Bardsley and C. Fox. An MCMC method for uncertainty quantification in nonnegativity constrained inverse problems. Inverse Problems in Science and Engineering, 20(4):477–498, June 2012. ISSN 1741-5985. doi:10.1080/17415977.2011.637208. URL http://dx.doi.org/10.1080/17415977.2011.637208.
  • Bardsley and Hansen [2020] Johnathan M Bardsley and Per Christian Hansen. MCMC algorithms for computational UQ of nonnegativity constrained linear inverse problems. SIAM Journal on Scientific Computing, 42(2):A1269–A1288, 2020.
  • Bardsley et al. [2014] Johnathan M Bardsley, Antti Solonen, Heikki Haario, and Marko Laine. Randomize-then-optimize: A method for sampling from posterior distributions in nonlinear inverse problems. SIAM Journal on Scientific Computing, 36(4):A1895–A1910, 2014.
  • Bauschke et al. [2017] Heinz H Bauschke, Patrick L Combettes, Heinz H Bauschke, and Patrick L Combettes. Correction to: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2017.
  • Blake et al. [2011] A. Blake, P. Kohli, and C. Rother. Markov Random Fields for vision and image processing. EBSCO ebook academic collection. MIT Press, 2011. ISBN 9780262015776. doi:10.7551/mitpress/8579.003.0001.
  • Boyd et al. [2011] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • Cai et al. [2023] Ziruo Cai, Junqi Tang, Subhadip Mukherjee, **glai Li, Carola Bibiane Schönlieb, and Xiaoqun Zhang. NF-ULA: Langevin Monte carlo with normalizing flow prior for imaging inverse problems, 2023.
  • Chambolle [2004] A Chambolle. An algorithm for Total Variation Minimization and Applications. Journal of Mathematical Imaging and Vision, 20:89–97, 2004. doi:10.1023/B:JMIV.0000011325.36760.1e.
  • Chambolle and Pock [2011] Antonin Chambolle and Thomas Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40:120–145, 2011.
  • Chen et al. [2014] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, pages 1683–1691. PMLR, 22–24 Jun 2014. URL https://proceedings.mlr.press/v32/cheni14.html.
  • De Bortoli et al. [2020] Valentin De Bortoli, Alain Durmus, Marcelo Pereyra, and Ana Fernandez Vidal. Maximum likelihood estimation of regularization parameters in high-dimensional inverse problems: An empirical Bayesian approach. part ii: Theoretical analysis. SIAM Journal on Imaging Sciences, 13(4):1990–2028, 2020.
  • Durmus and Moulines [2017] Alain Durmus and Éric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Annals of Applied Probability, 27(3):1551–1587, 2017.
  • Durmus et al. [2018] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by proximal Markov chain Monte Carlo: When Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018. doi:10.1137/16M1108340.
  • Everink et al. [2023a] Jasper M Everink, Yiqiu Dong, and Martin S Andersen. Sparse Bayesian inference with regularized Gaussian distributions. Inverse Problems, 39(11):115004, oct 2023a. doi:10.1088/1361-6420/acf9c5. URL https://dx.doi.org/10.1088/1361-6420/acf9c5.
  • Everink et al. [2023b] Jasper M Everink, Yiqiu Dong, and Martin S Andersen. Bayesian inference with projected densities. SIAM/ASA Journal on Uncertainty Quantification, 11(3):1025–1043, 2023b.
  • Girolami and Calderhead [2011] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011. doi:10.1111/j.1467-9868.2010.00765.x.
  • Hansen et al. [2021] Per Christian Hansen, Jakob Jørgensen, and William RB Lionheart. Computed tomography: algorithms, insight, and just enough theory. SIAM, 2021.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Holden et al. [2022] Matthew Holden, Marcelo Pereyra, and Konstantinos C Zygalakis. Bayesian imaging with data-driven priors encoded by neural networks. SIAM Journal on Imaging Sciences, 15(2):892–924, 2022.
  • Houdard et al. [2018] Antoine Houdard, Charles Bouveyron, and Julie Delon. High-Dimensional Mixture Models For Unsupervised Image Denoising (HDMI). SIAM Journal on Imaging Sciences, 11(4):2815–2846, 2018. doi:10.1137/17M1135694.
  • Hurault et al. [2022] Samuel Hurault, Arthur Leclaire, and Nicolas Papadakis. Proximal denoiser for convergent plug-and-play optimization with nonconvex regularization. In International Conference on Machine Learning, pages 9483–9505. PMLR, 2022.
  • Kingma et al. [2019] Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  • Laumont et al. [2022] Rémi Laumont, Valentin De Bortoli, Andrés Almansa, Julie Delon, Alain Durmus, and Marcelo Pereyra. Bayesian imaging using plug & play priors: when Langevin meets Tweedie. SIAM Journal on Imaging Sciences, 15(2):701–737, 2022.
  • Louchet and Moisan [2013] Cécile Louchet and Lionel Moisan. Posterior expectation of the total variation model: Properties and experiments. SIAM Journal on Imaging Sciences, 6(4):2640–2684, dec 2013. ISSN 19364954. doi:10.1137/120902276.
  • Mukherjee et al. [2024] S Mukherjee, S Dittmer, Z Shumaylov, S Lunz, O Öktem, and C-B Schönlieb. Data-driven convex regularizers for inverse problems. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 13386–13390. IEEE, 2024.
  • Papandreou and Yuille [2010] G. Papandreou and A. L. Yuille. Gaussian sampling by local perturbations. In J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 23. Curran Associates, Inc., 2010. URL https://proceedings.neurips.cc/paper_files/paper/2010/file/d09bf41544a3365a46c9077ebb5e35c3-Paper.pdf.
  • Pereyra [2016] Marcelo Pereyra. Proximal Markov chain Monte Carlo algorithms. Statistics and Computing, 26(4):745–760, jul 2016. ISSN 0960-3174. doi:10.1007/s11222-015-9567-4.
  • Pereyra et al. [2020] Marcelo Pereyra, Luis Vargas Mieles, and Konstantinos C Zygalakis. Accelerating proximal Markov chain Monte Carlo by using an explicit stabilized method. SIAM Journal on Imaging Sciences, 13(2):905–935, 2020.
  • Pereyra et al. [2023] Marcelo Pereyra, Luis A Vargas-Mieles, and Konstantinos C Zygalakis. The split Gibbs sampler revisited: improvements to its algorithmic structure and augmented target distribution. SIAM Journal on Imaging Sciences, 16(4):2040–2071, 2023.
  • Pesquet et al. [2021] Jean-Christophe Pesquet, Audrey Repetti, Matthieu Terris, and Yves Wiaux. Learning maximally monotone operators for image recovery. SIAM Journal on Imaging Sciences, 14(3):1206–1237, 2021.
  • Repetti et al. [2019] Audrey Repetti, Marcelo Pereyra, and Yves Wiaux. Scalable Bayesian Uncertainty Quantification in Imaging Inverse Problems via Convex Optimization. SIAM Journal on Imaging Sciences, 12(1):87–118, 2019. ISSN 1936-4954. doi:10.1137/18M1173629.
  • Rezende and Mohamed [2015] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
  • Roberts and Tweedie [1996] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
  • Rudin et al. [1992] Leonid I. Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992. ISSN 01672789. doi:10.1016/0167-2789(92)90242-F.
  • Teodoro et al. [2018] Afonso M. Teodoro, José M. Bioucas-Dias, and Mário A. T. Figueiredo. Scene-Adapted Plug-and-Play Algorithm with Guaranteed Convergence: Applications to Data Fusion in Imaging. pages 1–11, jan 2018.
  • Vidal et al. [2020] Ana Fernandez Vidal, Valentin De Bortoli, Marcelo Pereyra, and Alain Durmus. Maximum likelihood estimation of regularization parameters in high-dimensional inverse problems: An empirical Bayesian approach part i: Methodology and experiments. SIAM Journal on Imaging Sciences, 13(4):1945–1989, 2020.
  • Vono et al. [2019] Maxime Vono, Nicolas Dobigeon, and Pierre Chainais. Split-and-augmented Gibbs sampler—application to large-scale inference problems. IEEE Transactions on Signal Processing, 67(6):1648–1661, 2019.
  • Wang et al. [2017] Zheng Wang, Johnathan M Bardsley, Antti Solonen, Tiangang Cui, and Youssef M Marzouk. Bayesian inverse problems with l1subscript𝑙1l_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT priors: a randomize-then-optimize approach. SIAM Journal on Scientific Computing, 39(5):S140–S166, 2017.
  • Wang and Bovik [2009] Zhou Wang and Alan C Bovik. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE signal processing magazine, 26(1):98–117, 2009.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Welling and Teh [2011] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688, 2011.
  • Yu et al. [2011] Guoshen Yu, Guillermo Sapiro, and Stéphane Mallat. Solving Inverse Problems with Piecewise Linear Estimators: From Gaussian Mixture Models to Structured Sparsity. IEEE Transactions on Image Processing, 21(5):2481–2499, 2011. doi:10.1109/TIP.2011.2176743.
  • Zoran and Weiss [2011] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pages 479–486. IEEE, nov 2011. ISBN 978-1-4577-1102-2. doi:10.1109/ICCV.2011.6126278. URL http://people.csail.mit.edu/danielzoran/EPLLICCVCameraReady.pdf.