User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient

Arnak S. Dalalyan Avetik Karagulyan CREST, ENSAE, 5 av. Henry Le Chatelier, 91120 Palaiseau, France.

Abstract

In this paper, we study the problem of sampling from a given probability density function that is known to be smooth and strongly log-concave. We analyze several methods of approximate sampling based on discretizations of the (highly overdamped) Langevin diffusion and establish guarantees on its error measured in the Wasserstein-2 distance. Our guarantees improve or extend the state-of-the-art results in three directions. First, we provide an upper bound on the error of the first-order Langevin Monte Carlo (LMC) algorithm with optimized varying step-size. This result has the advantage of being horizon free (we do not need to know in advance the target precision) and to improve by a logarithmic factor the corresponding result for the constant step-size. Second, we study the case where accurate evaluations of the gradient of the log-density are unavailable, but one can have access to approximations of the aforementioned gradient. In such a situation, we consider both deterministic and stochastic approximations of the gradient and provide an upper bound on the sampling error of the first-order LMC that quantifies the impact of the gradient evaluation inaccuracies. Third, we establish upper bounds for two versions of the second-order LMC, which leverage the Hessian of the log-density. We provide nonasymptotic guarantees on the sampling error of these second-order LMCs. These guarantees reveal that the second-order LMC algorithms improve on the first-order LMC in ill-conditioned settings.

keywords:

Markov Chain Monte Carlo, Approximate sampling, Rates of convergence, Langevin algorithm, Gradient descent,

MSC:

[2010] Primary 62J05, Secondary 62H12

^†^†journal: Stochastic Processes and their Applications

1 Introduction

The problem of sampling a random vector distributed according to a given target distribution is central in many applications. In the present paper, we consider this problem in the case of a target distribution having a smooth and log-concave density $\pi$ and when the sampling is performed by a version of the Langevin Monte Carlo algorithm (LMC). More precisely, for a positive integer $p$ , we consider a continuously differentiable function $f:\mathbb{R}^{p}\to\mathbb{R}$ satisfying the following assumption: For some positive constants $m$ and $M$ , it holds

\begin{cases}f(\boldsymbol{\theta})-f(\boldsymbol{\theta}^{\prime})-\nabla f(% \boldsymbol{\theta}^{\prime})^{\top}(\boldsymbol{\theta}-\boldsymbol{\theta}^{% \prime})\geq(\nicefrac{{m}}{{2}})\|\boldsymbol{\theta}-\boldsymbol{\theta}^{% \prime}\|_{2}^{2},\text{\vphantom{$I_{\textstyle\int_{I_{I}}}$}}\\ \|\nabla f(\boldsymbol{\theta})-\nabla f(\boldsymbol{\theta}^{\prime})\|_{2}% \leq M\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\|_{2},\end{cases}% \qquad\forall\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\in\mathbb{R}^{p},

(1)

where $\nabla f$ stands for the gradient of $f$ and $\|\cdot\|_{2}$ is the Euclidean norm. The target distributions considered in this paper are those having a density with respect to the Lebesgue measure on $\mathbb{R}^{p}$ given by

\pi(\boldsymbol{\theta})=\frac{e^{-f(\boldsymbol{\theta})}}{\int_{\mathbb{R}^{% p}}e^{-f(\boldsymbol{u})}\,d\boldsymbol{u}}.

(2)

We say that the density $\pi(\boldsymbol{\theta})\propto e^{-f(\boldsymbol{\theta})}$ is log-concave (resp. strongly log-concave) if the function $f$ satisfies the first inequality of (1) with $m=0$ (resp. $m>0$ ).

Most part of this work focused on the analysis of the LMC algorithm, which can be seen as the analogue in the problem of sampling of the gradient descent algorithm for optimization. For a sequence of positive parameters $\boldsymbol{h}=\{h_{k}\}_{k\in\mathbb{N}}$ , referred to as the step-sizes and for an initial point $\boldsymbol{\vartheta}_{0,\boldsymbol{h}}\in\mathbb{R}^{p}$ that may be deterministic or random, the iterations of the LMC algorithm are defined by the update rule

\displaystyle\boldsymbol{\vartheta}_{k+1,\boldsymbol{h}}=\boldsymbol{\vartheta% }_{k,\boldsymbol{h}}-h_{k+1}\nabla f(\boldsymbol{\vartheta}_{k,\boldsymbol{h}}% )+\sqrt{2h_{k+1}}\;\boldsymbol{\xi}_{k+1};\qquad k=0,1,2,\ldots

(3)

where $\boldsymbol{\xi}_{1},\ldots,\boldsymbol{\xi}_{k},\ldots$ is a sequence of mutually independent, and independent of $\boldsymbol{\vartheta}_{0,\boldsymbol{h}}$ , centered Gaussian vectors with covariance matrices equal to identity.

When all the $h_{k}$ ’s are equal to some value $h>0$ , we will call the sequence in (3) the constant step LMC and will denote it by $\boldsymbol{\vartheta}_{k+1,h}$ . When $f$ satisfies assumptions (1), if $h$ is small and $k$ is large (so that the product $kh$ is large), the distribution of $\boldsymbol{\vartheta}_{k,h}$ is known to be a good approximation to the distribution with density $\pi(\boldsymbol{\theta})$ . An important question is to quantify the quality of this approximation. An appealing approach to address this question is by establishing non asymptotic upper bounds on the error of sampling; this kind of bounds are particularly useful for deriving a stop** rule for the LMC algorithm, as well as for understanding the computational complexity of sampling methods in high dimensional problems. In the present paper we establish such bounds by focusing on their user-friendliness. The latter means that our bounds are easy to interpret, hold under conditions that are not difficult to check and lead to simple theoretically grounded choice of the number of iterations and the step-size.

In the present work, we measure the error of sampling in the Wasserstein-Monge-Kantorovich distance $W_{2}$ . For two measures $\mu$ and $\nu$ defined on $(\mathbb{R}^{p},\mathscr{B}(\mathbb{R}^{p}))$ , and for a real number $q\geq 1$ , $W_{q}$ is defined by

W_{q}(\mu,\nu)=\Big{(}\inf_{\varrho\in\varrho(\mu,\nu)}\int_{\mathbb{R}^{p}% \times\mathbb{R}^{p}}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\|_{2}^% {q}\,d\varrho(\boldsymbol{\theta},\boldsymbol{\theta}^{\prime})\Big{)}^{1/q},

(4)

where the $\inf$ is with respect to all joint distributions $\varrho$ having $\mu$ and $\nu$ as marginal distributions. For statistical and machine learning applications, we believe that this distance is more suitable for assessing the quality of approximate sampling schemes than other metrics such as the total variation or the Kullback-Leibler divergence. Indeed, bounds on the Wasserstein distance—unlike the bounds on the total-variation—provide direct guarantees on the accuracy of approximating the first and the second order moments.

Asymptotic properties of the LMC algorithm, also known as Unadjusted Langevin Algorithm (ULA), and its Metropolis adjusted version, MALA, have been studied in a number of papers ²⁸, ²⁶, ³⁰, ³¹, ²⁰, ²⁷. These results do not emphasize the effect of the dimension on the computational complexity of the algorithm, which is roughly proportional to the number of iterations. Non asymptotic bounds on the total variation error of the LMC for log-concave and strongly log-concave distributions have been established by ¹⁴. If a warm start is available, the results in ¹⁴ imply that after $O(p/\epsilon^{2})$ iterations the LMC algorithm has an error bounded from above by $\epsilon$ . Furthermore, if we assume that in addition to (1) the function $f$ has a Lipschitz continuous Hessian, then a modified version of the LMC, the LMC with Ozaki discretization (LMCO), needs $O(p/\epsilon)$ iterations to achieve a precision level $\epsilon$ . These results were improved and extended to the Wasserstein distance by ¹⁶, ¹⁵. More precisely, they removed the condition of the warm start and proved that under the Lipschitz continuity assumption on the Hessian of $f$ , it is not necessary to modify the LMC for getting the rate $O(p/\epsilon)$ . The last result is closely related to an error bound between a diffusion process and its Euler discretization established by ¹.

On a related note, ⁸ studied the convergence of the LMC algorithm with reflection at the boundary of a compact set, which makes it possible to sample from a compactly supported density (see also ⁷). Extensions to non-smooth densities were presented in ¹⁷, ²¹. ¹⁰ obtained guarantees similar to those in ¹⁴ when the error is measured by the Kullback-Leibler divergence. Very recently, ¹¹ derived non asymptotic guarantees for the kinetic LMC which turned out to improve on the previously known results. Langevin dynamics was used in ⁴, ⁶ in order to approximate normalizing constants of target distributions. ¹⁹ established tight bounds in Wasserstein distance between the invariant distributions of two (Langevin) diffusions; the bounds involve mixing rates of the diffusions and the deviation in their drifts.

The goal of the present work is to push further the study of the LMC and its variants both by improving the existing guarantees and by extending them in some directions. Our main contributions can be summarized as follows:

1.

We state simplified guarantees in Wasserstein distance with improved constants both for the LMC and the LMCO when the step-size is constant, see Theorem 1 and Theorem 6.
2.

We propose a varying-step LMC which avoids a logarithmic factor in the number of iterations required to achieve a precision level $\epsilon$ , see Theorem 2.
3.

We extend the previous guarantees to the case where accurate evaluations of the gradient are unavailable. Thus, at each iteration of the algorithm, the gradient is computed within an error that has a deterministic and a stochastic component. Theorem 4 deals with functions $f$ satisfying (1), whereas Theorem 5 requires the additional assumption of the smoothness of the Hessian of $f$ .
4.

We propose a new second-order sampling algorithm termed LMCO’. It has a per-iteration computational cost comparable to that of the LMC and enjoys nearly the same guarantees as the LMCO, when the Hessian of $f$ is Lipschitz continuous, see Theorem 6.
5.

We provide a detailed discussion of the relations between, on the one hand, the sampling methods and guarantees of their convergence and, on the other hand, optimization methods and guarantees of their convergence (see Section 5).

We have to emphasize right away that Theorem 1 is a corrected version of ¹³ Theorem 1, whereas Theorem 4 extends ¹³ Theorem 3 to more general noise. In particular, Theorem 4 removes the unbiasedness and independence conditions. Furthermore, thanks to a shrewd use of a recursive inequality, the upper bound in Theorem 4 is tighter than the one in ¹³ Theorem 3.

As an illustration of the first two bullets mentioned in the above summary of our contributions, let us consider the following example. Assume that $m=10$ , $M=20$ and we have at our disposal an initial sampling distribution $\nu_{0}$ satisfying $W_{2}(\nu_{0},\pi)=p+(p/m)$ . The main inequalities in Theorem 1 and Theorem 2 imply that after $K$ iterations, the distribution $\nu_{K}$ obtained by the LMC algorithm satisfies

\displaystyle W_{2}(\nu_{K},\pi)\leq(1-mh)^{K}W_{2}(\nu_{0},\pi)+1.65(M/m)(hp)% ^{1/2}

(5)

for the constant step LMC and

\displaystyle W_{2}(\nu_{K},\pi)\leq\frac{3.5M\sqrt{p}}{m\sqrt{M+m+(\nicefrac{% {2}}{{3}})m(K-K_{1})}}

(6)

for the varying-step LMC, where $K_{1}$ is an integer the precise value of which is provided in Theorem 2. One can compare these inequalities with the corresponding bound in ¹⁵: adapted to the constant-step, it takes the form

	$\displaystyle W_{2}^{2}(\nu_{K},\pi)\leq$	$\displaystyle 2\Big{(}1-\frac{mMh}{m+M}\Big{)}^{K}W^{2}_{2}(\nu_{0},\pi)$		(7)
		$\displaystyle\ +\frac{Mhp}{m}(m+M)\Big{(}h+\frac{m+M}{2mM}\Big{)}\Big{(}2+% \frac{M^{2}h}{m}+\frac{M^{2}h^{2}}{6}\Big{)}.$		(8)

For any $\epsilon>0$ , we can derive from these guarantees the smallest number of iterations, $K_{\epsilon}$ , for which there is a $h>0$ such that the corresponding upper bound is smaller than $\epsilon$ . The logarithms of these values $K_{\epsilon}$ for varying $\epsilon\in\{0.001,0.005,0.02\}$ and $p\in\{25,\ldots,1000\}$ are plotted in Figure 1. We observe that for all the considered values of $\epsilon$ and $p$ , the number of iterations derived from (6) (referred to as Theorem 2) is smaller than those derived from (5) (referred to as Theorem 1) and from (8) (referred to as DM bound). The difference between the varying-step LMC and the constant step LMC becomes more important when the target precision level $\epsilon$ gets smaller. In average over all values of $p$ , when $\epsilon=0.001$ , the number of iterations derived from (8) is 4.6 times larger than that derived from (6), and almost $3$ times larger than the number of iterations derived from (5).

Refer to caption — Figure 1: Plots showing the logarithm of the number of iterations as function of dimension $p$ for several values of $\epsilon$ . The plotted values are derived from (5)-(8) using the data $m=10$ , $M=20$ , $W_{2}^{2}(\nu_{0},\pi)=p+(p/m)$ .

2 Guarantees in the Wasserstein distance with accurate gradient

The rationale behind the LMC (3) is simple: the Markov chain $\{\boldsymbol{\vartheta}_{k,\boldsymbol{h}}\}_{k\in\mathbb{N}}$ is the Euler discretization of a continuous-time diffusion process $\{\boldsymbol{L}_{t}:t\in\mathbb{R}_{+}\}$ , known as Langevin diffusion. The latter is defined by the stochastic differential equation

d\boldsymbol{L}_{t}=-\nabla f(\boldsymbol{L}_{t})\,dt+\sqrt{2}\;d\boldsymbol{W% }\!_{t},\qquad t\geq 0,

(9)

where $\{\boldsymbol{W}\!_{t}:t\geq 0\}$ is a $p$ -dimensional Brownian motion. When $f$ satisfies condition (1), equation (9) has a unique strong solution, which is a Markov process. Furthermore, the process $\boldsymbol{L}$ has $\pi$ as invariant density ⁵ Thm. 3.5. Let $\nu_{k}$ be the distribution of the $k$ -th iterate of the LMC algorithm, that is $\vartheta_{k,\boldsymbol{h}}\sim\nu_{k}$ . In what follows, we present user-friendly guarantees on the closeness of $\nu_{k}$ and $\pi$ , when $f$ is strongly convex.

2.1 Reminder on guarantees for the constant-step LMC

When the function $f$ is $m$ -strongly convex and $M$ -gradient Lipschitz, upper bounds on the sampling error measured in Wasserstein distance of the LMC algorithm have been established in ¹⁵, ¹³. We state below a slightly adapted version of their result, which will serve as a benchmark for the bounds obtained in this work.

Theorem 1.

Assume that $h\in(0,\nicefrac{{2}}{{M}})$ and $f$ satisfies condition (1). The following claims hold:

(a)

If $h\leq\nicefrac{{2}}{{(m+M)}}$ then $W_{2}(\nu_{K},\pi)\leq(1-mh)^{K}W_{2}(\nu_{0},\pi)+1.65(\frac{M}{m})(hp)^{1/2}$ .
(b)

If $h\geq\nicefrac{{2}}{{(m+M)}}$ then $W_{2}(\nu_{K},\pi)\leq\displaystyle(Mh-1)^{K}W_{2}(\nu_{0},\pi)+\frac{1.65Mh}{% 2-Mh}(hp)^{1/2}$ .

We refer the readers interested in the proof of this theorem either to ¹³ or to Section 7, where the latter is obtained as a direct consequence of Theorem 4. The factor $1.65$ is obtained by upper bounding $7\sqrt{2}/6$ .

In practice, a relevant approach to getting an accuracy of at most $\epsilon$ is to minimize the upper bound provided by Theorem 1 with respect to $h$ , for a fixed $K$ . Then, one can choose the smallest $K$ for which the obtained upper bound is smaller than $\epsilon$ . One useful observation is that the upper bound of case (b) is an increasing function of $h$ . Its minimum is always attained at $h=2/(m+M)$ , which means that one can always look for a step-size in the interval $(0,2/(m+M)]$ by minimizing the upper bound in (a). This can be done using standard line-search methods such as the bisection algorithm.

Note that if the initial value $\boldsymbol{\vartheta}_{0}=\boldsymbol{\theta}_{0}$ is deterministic then, using the notation $\boldsymbol{\theta}^{*}=\operatorname*{arg\,min}_{\theta\in\mathbb{R}^{p}}f(% \boldsymbol{\theta})$ , in view of ¹⁵ Proposition 1, we have

\displaystyle W_{2}(\nu_{0},\pi)^{2}

\displaystyle=\int_{\mathbb{R}^{p}}\|\boldsymbol{\theta}_{0}-\boldsymbol{% \theta}\|_{2}^{2}\pi(d\boldsymbol{\theta})\leq\|\boldsymbol{\theta}_{0}-% \boldsymbol{\theta}^{*}\|_{2}^{2}+p/m.

(10)

Finally, let us remark that if we choose $h$ and $K$ so that

h\leq\nicefrac{{2}}{{(m+M)}},\qquad e^{-mhK}W_{2}(\nu_{0},\pi)\leq\varepsilon/% 2,\quad 1.65(M/m)(hp)^{1/2}\leq\varepsilon/2,

(11)

then we have $W_{2}(\nu_{K},\pi)\leq\varepsilon$ . In other words, conditions (11) are sufficient for the density of the output of the LMC algorithm after $K$ iterations to be within the precision $\varepsilon$ of the target density when the precision is measured using the Wasserstein distance. This readily yields

h\leq\frac{m^{2}\varepsilon^{2}}{11M^{2}p}\wedge\frac{2}{m+M}\quad\text{and}% \quad hK\geq\frac{1}{m}\log\Big{(}\frac{2(\|\boldsymbol{\theta}_{0}-% \boldsymbol{\theta}^{*}\|_{2}^{2}+p/m)^{1/2}}{\varepsilon}\Big{)}

(12)

Assuming $m,M$ and $\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{*}\|_{2}^{2}/p$ to be constants, we can deduce from the last display that it suffices $K=C(p/\varepsilon^{2})\log(p/\varepsilon^{2})$ number of iterations in order to reach the precision level $\varepsilon$ . This fact has been first established in ¹⁴ for the LMC algorithm with a warm start and the total-variation distance. It was later improved by ^{16, 15}, where the authors showed that the same result holds for any starting point and established similar bounds for the Wasserstein distance. Theorem 1 above can be seen as a user-friendly version of the corresponding result established by ¹⁵.

Remark 2.1.

Although (10) is relevant for understanding the order of magnitude of $W_{2}(\nu_{0},\pi)$ , it has limited applicability since the distance $\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{*}\|$ might be hard to evaluate. As mentioned in ¹³, an attractive alternative to that bound is given by the inequality ¹¹1The second line follows from strong convexity whereas the third line is a consequence of the fact that $\boldsymbol{\theta}^{*}$ is a stationary point of $f$ .

$\displaystyle mW_{2}(\nu_{0},\pi)^{2}$	$\displaystyle\leq m\\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{*}\\|_{2}^{2}+p$	(13)
	$\displaystyle\leq 2\big{(}f(\boldsymbol{\theta}_{0})-f(\boldsymbol{\theta}^{}% )-\nabla f(\boldsymbol{\theta}^{})^{\top}(\boldsymbol{\theta}_{0}-\boldsymbol% {\theta})\big{)}+p$	(14)
	$\displaystyle=2\big{(}f(\boldsymbol{\theta}_{0})-f(\boldsymbol{\theta}^{*})% \big{)}+p.$	(15)

If $f$ is lower bounded by some known constant, for instance if $f\geq 0$ , the last inequality provides the computable upper bound $W_{2}(\nu_{0},\pi)^{2}\leq\big{(}2f(\boldsymbol{\theta}_{0})+p\big{)}/m$ .

2.2 Guarantees under strong convexity for the varying step LMC

The result of previous section provides a guarantee for the constant step LMC. One may wonder if using a variable step sizes $\boldsymbol{h}=\{h_{k}\}_{k\in\mathbb{N}}$ can improve the convergence. Note that in ¹⁵ Theorem 5, guarantees for the variable step LMC are established. However, they do not lead to a clear message on the choice of the step-sizes. The next result fills this gap by showing that an appropriate selection of step-sizes improves on the constant step LMC with an improvement factor logarithmic in $p/\epsilon^{2}$ .

Theorem 2.

Let us consider the LMC algorithm with varying step-size $h_{k+1}$ defined by

h_{k+1}=\frac{2}{M+m+(\nicefrac{{2}}{{3}})m(k-K_{1})_{+}},\qquad k=1,2,\ldots

(16)

where $K_{1}$ is the smallest non-negative integer satisfying²²2Combining the definition of $K_{1}$ and the upper bound in (10), one easily checks that if $\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{*}\|_{\infty}$ is bounded, then $K_{1}$ is upper bounded by a constant that does not depend on the dimension $p$ .

K_{1}\geq\frac{\ln\big{(}W_{2}(\nu_{0},\pi)/\sqrt{p}\big{)}+\ln(m/M)+(% \nicefrac{{1}}{{2}})\ln(M+m)}{\ln(1+\nicefrac{{2m}}{{M-m}})}.

(17)

If $f$ satisfies (1), then for every $k\geq K_{1}$ , we have

W_{2}(\nu_{k},\pi)\leq\frac{3.5M\sqrt{p}}{m\sqrt{M+m+(\nicefrac{{2}}{{3}})m(k-% K_{1})}}.

(18)

The step size (16) has two important advantages as compared to the constant steps. The first advantage is that it is independent of the target precision level $\epsilon$ . The second advantage is that we get rid of the logarithmic terms in the number of iterations required to achieve the precision level $\epsilon$ . Indeed, it suffices $K=K_{1}+(27M^{2}/2m^{3})(p/\epsilon^{2})$ iterations to get the right hand side of (18) smaller than $\epsilon$ , where $K_{1}$ depends neither on the dimension $p$ nor on the precision level $\epsilon$ .

Since the choice of $h_{k+1}$ in (16) might appear mysterious, we provide below a quick explanation of the main computations underpinning this choice. The main step of the proof of upper bounds on $W_{2}(\nu_{k},\pi)$ , is the following recursive inequality (see Proposition 2 in Section 7)

\displaystyle W_{2}(\nu_{k+1},\pi)\leq(1-mh_{k+1})W_{2}(\nu_{k},\pi)+1.65M% \sqrt{p}\,h_{k+1}^{3/2}.

(19)

Using the notation $B_{k}=\frac{2(m/3)^{3/2}}{1.65M\sqrt{p}}W_{2}(\nu_{k},\pi)$ , this inequality can be rewritten as

B_{k+1}\leq(1-mh_{k+1})B_{k}+2(mh_{k+1}/3)^{3/2}.

Minimizing the right hand side with respect to $h_{k+1}$ , we find that the minimum is attained at the stationary point

\displaystyle h_{k+1}=\frac{3}{m}B_{k}^{2}.

(20)

With this $h_{k+1}$ , one checks that the sequence $B_{k}$ satisfies the recursive inequality

B_{k+1}^{2}\leq B_{k}^{2}(1-B_{k}^{2})^{2}\leq\frac{B_{k}^{2}}{1+B_{k}^{2}}.

The function $g(x)=x/(1+x)$ being increasing in $(0,\infty)$ , we get

B_{k+1}^{2}\leq\frac{B_{k}^{2}}{1+B_{k}^{2}}\leq\frac{\frac{B_{k-1}^{2}}{1+B_{% k-1}^{2}}}{1+\frac{B_{k-1}^{2}}{1+B_{k-1}^{2}}}=\frac{B_{k-1}^{2}}{1+2B_{k-1}^% {2}}.

By repetitive application of the same argument, we get

B_{k+1}^{2}\leq\frac{B_{K_{1}}^{2}}{1+(k+1-K_{1})B_{K_{1}}^{2}}.

The integer $K_{1}$ was chosen so that $B_{K_{1}}^{2}\leq\frac{2m}{3(M+m)}$ , see (66). Inserting this upper bound in the right hand side of the last display, we get

B_{k+1}^{2}\leq\frac{2m}{3(M+m)+2m(k+1-K_{1})}.

Finally, replacing in (20) $B_{k}^{2}$ by its upper bound derived from the last display, we get the suggested value for $h_{k+1}$ .

2.3 Extension to mixtures of strongly log-concave densities

We describe here a simple setting in which a suitable version of the LMC algorithm yields efficient sampling algorithm for a target function which is not log-concave. Indeed, let us assume that

\displaystyle\pi(\boldsymbol{\theta})=\int_{H}\pi_{1}(\boldsymbol{\theta}|% \boldsymbol{\eta})\,\pi_{0}(d\boldsymbol{\eta}),

(21)

where $H$ is an arbitrary measurable space, $\pi_{0}$ is a probability distribution on $H$ and $\pi_{1}(\cdot|\cdot)$ is a Markov kernel on $\mathbb{R}^{p}\times H$ . This means that $\pi_{2}(d\boldsymbol{\theta},d\boldsymbol{\eta})=\pi_{1}(\boldsymbol{\theta}|% \boldsymbol{\eta})\,\pi_{0}(d\boldsymbol{\eta})d\boldsymbol{\theta}$ defines a probability measure on $\mathbb{R}^{p}\times H$ of which $\pi$ is the first marginal.

Theorem 3.

Assume that $\pi_{1}(\boldsymbol{\theta}|\boldsymbol{\eta})=\exp\{-f_{\boldsymbol{\eta}}(% \boldsymbol{\theta})\}$ so that for every $\boldsymbol{\eta}\in H$ , $f_{\boldsymbol{\eta}}$ satisfies assumption (1). Define the mixture LMC (MLMC) algorithm as follows: sample $\boldsymbol{\eta}\sim\pi_{0}$ and choose an initial value $\boldsymbol{\vartheta}_{0}\sim\nu_{0}$ , then compute

\displaystyle\boldsymbol{\vartheta}_{k+1}^{\rm MLMC}=\boldsymbol{\vartheta}^{% \rm MLMC}_{k}-h_{k+1}\nabla f_{\boldsymbol{\eta}}(\boldsymbol{\vartheta}_{k}^{% \rm MLMC})+\sqrt{2h_{k+1}}\;\boldsymbol{\xi}_{k+1};\qquad k=0,1,2,\ldots

(22)

where $h_{k}$ is defined by (16) and $\boldsymbol{\xi}_{1},\ldots,\boldsymbol{\xi}_{k},\ldots$ is a sequence of mutually independent, and independent of $(\boldsymbol{\eta},\boldsymbol{\vartheta}_{0})$ , centered Gaussian vectors with covariance matrices equal to identity. It holds that, for every positive integer $k\geq K_{1}$ (see eq. 17 for the definition of $K_{1}$ ),

W_{2}(\nu_{k},\pi)\leq\frac{3.5M\sqrt{p}}{m\sqrt{M+m+(\nicefrac{{2}}{{3}})m(k-% K_{1})}}.

(23)

This result extends the applicability of Langevin based techniques to a wider framework than the one of strongly log-concave distributions. The proof, postponed to Section 7, is a straightforward consequence of Theorem 2.

3 Guarantees for the inaccurate gradient version

In some situations, the precise evaluation of the gradient $\nabla f(\boldsymbol{\theta})$ is computationally expensive or practically impossible, but it is possible to obtain noisy evaluations of $\nabla f$ at any point. This is the setting considered in the present section. More precisely, we assume that at any point $\boldsymbol{\vartheta}_{k,h}\in\mathbb{R}^{p}$ of the LMC algorithm, we can observe the value

\boldsymbol{Y}_{k,h}=\nabla f(\boldsymbol{\vartheta}_{k,h})+\boldsymbol{\zeta}% _{k},

(24)

where $\{\boldsymbol{\zeta}_{k}:\,k=0,1,\ldots\}$ is a sequence of random (noise) vectors. The noisy LMC (nLMC) algorithm is defined as

\displaystyle\boldsymbol{\vartheta}_{k+1,h}=\boldsymbol{\vartheta}_{k,h}-h% \boldsymbol{Y}_{k,h}+\sqrt{2h}\;\boldsymbol{\xi}_{k+1};\qquad k=0,1,2,\ldots

(25)

where $h>0$ and $\boldsymbol{\xi}_{k+1}$ are as in (3). The noise $\{\boldsymbol{\zeta}_{k}:\,k=0,1,\ldots\}$ is assumed to satisfy the following condition.

Condition N: for some $\delta>0$ and $\sigma>0$ and for every $k\in\mathbb{N}$ ,

1.

(bounded bias) $\mathbf{E}\big{[}\big{\|}\mathbf{E}(\boldsymbol{\zeta}_{k}|\boldsymbol{% \vartheta}_{k,h})\big{\|}_{2}^{2}\big{]}\leq\delta^{2}p$ ,
2.

(bounded variance) $\mathbf{E}[\|\boldsymbol{\zeta}_{k}-\mathbf{E}(\boldsymbol{\zeta}_{k}|% \boldsymbol{\vartheta}_{k,h})\|_{2}^{2}]\leq\sigma^{2}p$ ,
3.

(independence of updates) $\boldsymbol{\xi}_{k+1}$ in (25) is independent of $(\boldsymbol{\zeta}_{0},\ldots,\boldsymbol{\zeta}_{k})$ .

We emphasize right away that the random vectors $\boldsymbol{\zeta}_{k}$ are not assumed to be independent, as opposed to what is done in ¹³. The next theorem extends the guarantees of Theorem 1 to the inaccurate-gradient setting and to the nLMC algorithm.

Theorem 4.

Let $\boldsymbol{\vartheta}_{K,h}$ be the $K$ -th iterate of the nLMC algorithm (25) and $\nu_{K}$ be its distribution. If the function $f$ satisfies condition (1) and $h\leq\nicefrac{{2}}{{(m+M)}}$ then

	$\displaystyle W_{2}(\nu_{K},\pi)\leq(1$	$\displaystyle-mh)^{K}W_{2}(\nu_{0},\pi)+1.65(M/m)(hp)^{1/2}$		(26)
		$\displaystyle+\frac{\delta\sqrt{p}}{m}+\frac{\sigma^{2}(hp)^{1/2}}{1.65M+% \sigma\sqrt{m}}\ .$		(27)

To the best of our knowledge, the first result providing guarantees for sampling from a distribution in the scenario when precise evaluations of the log-density or its gradient are not available has been established in ¹³. Prior to that work, some asymptotic results has been established in ³. The closely related problem of computing an average value with respect to a distribution, when the gradient of its log-density is known up to an additive noise, has been studied by ^{32, 33, 23, 9}. Note that these settings are of the same flavor as those of stochastic approximation, an active area of research in optimization and machine learning.

As compared to the analogous result in ¹³, Theorem 4 above has several advantages. First, it extends the applicability of the result to the case of a biased noise. In other words, it allows for $\boldsymbol{\zeta}_{k}$ with nonzero means. Second, it considerably relaxes the independence assumption on the sequence $\{\boldsymbol{\zeta}_{k}\}$ , by replacing it by the independence of the updates. Third, and perhaps the most important advantage of Theorem 4 is the improved dependence of the upper bound on $\sigma$ . Indeed, while the last term in the upper bound in Theorem 4 is $O(\sigma^{2})$ , when $\sigma\to 0$ , the corresponding term in ¹³ Th. 3 is only $O(\sigma)$ .

To understand the potential scope of applicability of Theorem 4, let us consider a generic example in which $f(\boldsymbol{\theta})$ is the average of $n$ functions defined through independent random variables $X_{1},\ldots,X_{n}$ :

f(\boldsymbol{\theta})=\frac{1}{n}\sum_{i=1}^{n}\ell(\boldsymbol{\theta},X_{i}).

When the gradient of $\ell(\boldsymbol{\theta},X_{i})$ with respect to parameter $\boldsymbol{\theta}$ is hard to compute, one can replace the evaluation of $\nabla f(\boldsymbol{\vartheta}_{k,h})$ at each step $k$ by that of $Y_{k}=\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\vartheta}_{k,h},X_{N_{k}})$ , where $N_{k}$ is a random variable uniformly distributed in $\{1,\ldots,n\}$ and independent of $\boldsymbol{\vartheta}_{k,h}$ . Under suitable assumptions, this random vector satisfies the conditions of Theorem 4 with $\delta=0$ and constant $\sigma^{2}$ . Therefore, if we analyze the upper bound provided by (26), we see that the last term, due to the subsampling, is of the same order of magnitude as the second term. Thus, using the subsampled gradient in the LMC algorithm does not cause a significant deterioration of the precision while reducing considerably the computational burden.

Note that Theorem 4 allows to handle situations in which the approximations of the gradient are biased. This bias is controlled by the parameter $\delta$ . Such a bias can appear when using deterministic approximations of integrals or differentials. For instance, in statistical models with latent variables, the gradient of the log-likelihood has often an integral form. Such integrals can be approximated using quadrature rules, yielding a bias term, or Monte Carlo methods, yielding a variance term.

In the preliminary version ¹³ of this work, we made a mistake by claiming that the stochastic gradient version of the LMC, introduced in ³⁴ and often referred to as Stochastic Gradient Langevin Dynamics (SGLD), has an error of the same order as the non-stochastic version of it. This claim is wrong, since when $f(\boldsymbol{\theta})=\sum_{i=1}^{n}\ell(\boldsymbol{\theta},X_{i})$ with a strongly convex function $\boldsymbol{\theta}\mapsto\ell(\boldsymbol{\theta},x)$ and iid variables $X_{1},\ldots,X_{n}$ , we have $m$ and $M$ proportional to $n$ . Therefore, choosing $Y_{k}=n\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{\vartheta}_{k,h},X_{N_{k}})$ as a noisy version of the gradient (where $N_{k}$ is a uniformly over $\{1,\ldots,n\}$ distributed random variable independent of $\boldsymbol{\vartheta}_{k,h}$ ), we get $\delta=0$ but $\sigma^{2}$ proportional to $n^{2}$ . Therefore, the last term in (26) is of order $(nhp)^{1/2}$ and dominates the other terms. Furthermore, replacing $Y_{k}$ by $Y_{k}=\frac{n}{s}\sum_{j=1}^{s}\nabla_{\boldsymbol{\theta}}\ell(\boldsymbol{% \vartheta}_{k,h},X_{N^{j}_{k}})$ with iid variables $N_{k}^{1},\ldots,N_{k}^{s}$ does not help, since then $\sigma^{2}$ is of order $n^{2}/s$ and the last term in (26) is of order $(nhp/s)^{1/2}$ , which is still larger than the term $(M/m)(hp)^{1/2}$ . This discussion shows that Theorem 4 applied to SGLD is of limited interest. For a more in-depth analysis of the SGLD, we refer the reader to ²³, ²⁵, ³⁶.

It is also worth mentioning here that another example of approximate gradient—based on a quadratic approximation of the log-likelihood of the generalized linear model—has been considered in ¹⁹ Section 5. It corresponds, in terms of condition N, to a situation in which the variance $\sigma^{2}$ vanishes but the bias $\delta$ is non-zero.

An important ingredient of the proof of Theorem 4 is the following simple result, which can be useful in other contexts as well (for a proof, see Lemma 7 in Section 7.7 below).

Lemma 1.

Let $A$ , $B$ and $C$ be given non-negative numbers such that $A\in(0,1)$ . Assume that the sequence of non-negative numbers $\{x_{k}\}_{k=0,1,2,\ldots}$ satisfies the recursive inequality

\displaystyle x^{2}_{k+1}

\displaystyle\leq[(1-A)x_{k}+C]^{2}+B^{2}

(28)

for every integer $k\geq 0$ . Then, for all integers $k\geq 0$ ,

\displaystyle x_{k}

\displaystyle\leq(1-A)^{k}x_{0}+\frac{C}{A}+\frac{B^{2}}{C+\sqrt{A}\,B}.

(29)

Thanks to this lemma, the upper bound on the Wasserstein distance provided by (26) is sharper than the one proposed in ¹³.

4 Guarantees under additional smoothness

When the function $f$ has Lipschitz continuous Hessian, one can get improved rates of convergence. This has been noted by ¹⁴, where the author proposed to use a modified version of the LMC algorithm, the LMC with Ozaki discretization, in order to take advantage of the smoothness of the Hessian. On the other hand, it has been proved in ¹, ² that the boundedness of the third order derivative of $f$ (equivalent to the boundedness of the second-order derivative of the drift of the Langevin diffusion) implies that the Wasserstein distance between the marginals of the Langevin diffusion and its Euler discretization are of order $h\sqrt{\log(1/h)}$ . Note however, that in ² there is no evaluation of the impact of the dimension on the quality of the Euler approximation. This evaluation has been done by ¹⁵ by showing that the Wasserstein error of the Euler approximation is of order $hp$ . This raises the following important question: is it possible to get advantage of the Lipschitz continuity of the Hessian of $f$ in order to improve the guarantees on the quality of sampling by the standard LMC algorithm. The answer of this question is affirmative and is stated in the next theorem.

In what follows, for any matrix $\mathbf{M}$ , we denote by $\|\mathbf{M}\|$ and $\|\mathbf{M}\|_{F}$ , respectively, the spectral norm and the Frobenius norm of $\mathbf{M}$ . We write $\mathbf{M}\preceq\mathbf{M}^{\prime}$ or $\mathbf{M}^{\prime}\succeq\mathbf{M}^{\prime}$ to indicate that the matrix $\mathbf{M}^{\prime}-\mathbf{M}$ is positive semi-definite.

Condition F: the function $f$ is twice differentiable and for some positive numbers $m$ , $M$ and $M_{2}$ ,

1.

(strong convexity) $\nabla^{2}f(\boldsymbol{\theta})\succeq m\mathbf{I}_{p}$ , for every $\boldsymbol{\theta}\in\mathbb{R}^{p}$ ,
2.

(bounded second derivative) $\nabla^{2}f(\boldsymbol{\theta})\preceq M\mathbf{I}_{p}$ , for every $\boldsymbol{\theta}\in\mathbb{R}^{p}$ ,
3.

(further smoothness) $\|\nabla^{2}f(\boldsymbol{\theta})-\nabla^{2}f(\boldsymbol{\theta}^{\prime})\|% \leq M_{2}\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\|_{2}$ , for every $\boldsymbol{\theta},\boldsymbol{\theta}^{\prime}\in\mathbb{R}^{p}$ .

Theorem 5.

Let $\boldsymbol{\vartheta}_{K,h}$ be the $K$ -th iterate of the nLMC algorithm (25) and $\nu_{K}$ be its distribution. Assume that conditions F and N are satisfied. Then, for every $h\leq\nicefrac{{2}}{{(m+M)}}$ , we have

	$\displaystyle W_{2}(\nu_{K},\pi)\leq(1$	$\displaystyle-mh)^{K}W_{2}(\nu_{0},\pi)+\frac{M_{2}hp}{2m}+\frac{11Mh\sqrt{Mp}% }{5m}$		(30)
		$\displaystyle+\frac{\delta\sqrt{p}}{m}+\frac{2\sigma^{2}\sqrt{hp}}{M_{2}\sqrt{% hp}+2\sigma\sqrt{m}}.$		(31)

In the last inequality, $11/5$ is an upper bound for $0.5+2\sqrt{2/3}\approx 2.133$ .

When applying the nLMC algorithm to sample from a target density, the user may usually specify four parameters: the step-size $h$ , the number of iterations $K$ , the tolerated precision $\delta$ of the deterministic approximation and the precision $\sigma$ of the stochastic approximation. An attractive feature of Theorem 5 is that the contributions of these four parameters are well separated, especially if we upper bound the last term by $2\sigma^{2}/M_{2}$ . As a consequence, in order to have an error of order $\epsilon$ in Wasserstein distance, we might choose: $\sigma$ at most of order $\sqrt{\epsilon}$ , $\delta$ at most of order $m\epsilon/\sqrt{p}$ , $h$ of order $\epsilon/p$ and $K$ of order $(p/m\epsilon)\log(p/\epsilon)$ . Akin to Theorem 2, one can use variable step-sizes to avoid the logarithmic factor; we leave these computations to the reader.

Note that if we instantiate Theorem 5 to the case of accurate gradient evaluations, that is when $\sigma=\delta=0$ , we recover the constant step-size version of ¹⁵ Theorem 8, with optimized constants. Indeed, for contant step-size, ¹⁵ Theorem 8 yields

\displaystyle W_{2}(\nu_{K},\pi)\leq\Big{\{}2(1-\bar{m}\,h)^{K}W_{2}(\nu_{0},% \pi)^{2}+2ph^{2}\Big{(}\frac{M^{2}}{\bar{m}}+\frac{M^{4}}{3m\bar{m}^{2}}+\frac% {M_{2}^{2}p}{3\bar{m}^{2}}+O(h)\Big{)}\Big{\}}^{1/2},

(32)

where $\bar{m}=\frac{mM}{m+M}\in[m/2,m)$ and the term $O(h)$ can be given explicitly. A visual comparison of the optimal number of iterations obtained from this bound to that obtained from Theorem 5 (with $\delta=\sigma=0$ ) is provided in Figure 2.

Under the assumption of Lipschitz continuity of the Hessian of $f$ , one may wonder whether second-order methods that make use of the Hessian in addition to the gradient are able to outperform the standard LMC algorithm. The most relevant candidate algorithms for this are the LMC with Ozaki discretization (LMCO) and a variant of it, LMCO’, a slightly modified version of an algorithm introduced in ¹⁴. The LMCO is a recursive algorithm the update rule of which is defined as follows: For every $k\geq 0$ , we set $\mathbf{H}_{k}=\nabla^{2}f(\boldsymbol{\vartheta}_{k,h}^{\rm LMCO})$ , which is an invertible $p\times p$ matrix since $f$ is strongly convex, and define

	$\displaystyle\mathbf{M}_{k}=\big{(}\mathbf{I}_{p}-e^{-h\mathbf{H}_{k}}\big{)}% \mathbf{H}_{k}^{-1},\qquad\boldsymbol{\Sigma}_{k}=\big{(}\mathbf{I}_{p}-e^{-2h% \mathbf{H}_{k}}\big{)}\mathbf{H}_{k}^{-1},$		(33)
	$\displaystyle\boldsymbol{\vartheta}_{k+1,h}^{\rm LMCO}=\boldsymbol{\vartheta}_% {k,h}^{\rm LMCO}-\mathbf{M}_{k}\nabla f\big{(}\boldsymbol{\vartheta}_{k,h}^{% \rm LMCO}\big{)}+\boldsymbol{\Sigma}_{k}^{1/2}\boldsymbol{\xi}_{k+1},$		(34)

where $\{\boldsymbol{\xi}_{k}:k\in\mathbb{N}\}$ is a sequence of independent random vectors distributed according to the $\mathcal{N}_{p}(0,\mathbf{I}_{p})$ distribution. The LMCO’ algorithm is based on approximating the matrix exponentials by linear functions, more precisely, for $\mathbf{H}_{k}^{\prime}=\nabla^{2}f(\boldsymbol{\vartheta}_{k,h}^{\rm LMCO^{% \prime}})$ ,

	$\displaystyle\boldsymbol{\vartheta}_{k+1,h}^{\rm LMCO^{\prime}}=\,$	$\displaystyle\boldsymbol{\vartheta}_{k,h}^{\rm LMCO^{\prime}}-h\Big{(}\mathbf{% I}_{p}-\frac{1}{2}h\mathbf{H}_{k}^{\prime}\Big{)}\nabla f\big{(}\boldsymbol{% \vartheta}_{k,h}^{\rm LMCO^{\prime}}\big{)}$		(35)
		$\displaystyle+\sqrt{2h}\Big{(}\mathbf{I}_{p}-h\mathbf{H}_{k}^{\prime}+\frac{1}% {3}h^{2}(\mathbf{H}_{k}^{\prime})^{2}\Big{)}^{1/2}\boldsymbol{\xi}_{k+1}.$		(36)

Let us mention right away that the stochastic perturbation present in the last display can be computed in practice without taking the matrix square-root. Indeed, it suffices to generate two independent standard Gaussian vectors ${\boldsymbol{\eta}}_{k+1}$ and ${\boldsymbol{\eta}}_{k+1}^{\prime}$ ; then the random vector

\displaystyle\big{(}\mathbf{I}_{p}-(\nicefrac{{1}}{{2}})h\mathbf{H}_{k}^{% \prime}\big{)}{\boldsymbol{\eta}}_{k+1}+(\nicefrac{{\sqrt{3}}}{{6}})\,h\,% \mathbf{H}_{k}^{\prime}{\boldsymbol{\eta}}^{\prime}_{k+1}

(37)

has exactly the same distribution as $\big{(}\mathbf{I}_{p}-h\mathbf{H}_{k}^{\prime}+(\nicefrac{{1}}{{3}})h^{2}(% \mathbf{H}_{k}^{\prime})^{2}\big{)}^{1/2}\boldsymbol{\xi}_{k+1}$ .

In the rest of this section, we provide guarantees for methods LMCO and LMCO’. Note that we consider only the case where the gradient and the Hessian of $f$ are computed exactly, that is without any approximation.

Theorem 6.

Let $\nu_{K}^{\rm LMCO}$ and $\nu_{K}^{\rm LMCO^{\prime}}$ be, respectively, the distributions of the $K$ -th iterate of the LMCO algorithm (34) and the LMCO’ algorithm (36) with an initial distribution $\nu_{0}$ . Assume that conditions F and N are satisfied. Then, for every $h\leq\nicefrac{{m}}{{M^{2}}}$ ,

\displaystyle W_{2}(\nu_{K}^{\rm LMCO},\pi)\leq(1-0.25mh)^{K}W_{2}(\nu_{0},\pi% )+\frac{11.5M_{2}h(p+1)}{m}.

(38)

If, in addition, $h\leq\nicefrac{{3m}}{{4M^{2}}}$ , then

\displaystyle W_{2}(\nu_{K}^{\rm LMCO^{\prime}},\pi)

\displaystyle\leq(1-0.25mh)^{K}W_{2}(\nu_{0},\pi)+\frac{1.3M^{2}h^{2}\sqrt{Mp}% }{m}+\frac{7.3M_{2}h(p+1)}{m}.

(39)

A very rough consequence of this theorem is that one has similar theoretical guarantees for the LMCO and the LMCO’ algorithms, since in most situations the middle term in the right hand side of (39) is smaller than the last term. On the other hand, the per-iteration cost of the modified algorithm LMCO’ is significantly smaller than the per-iteration cost of the original LMCO. Indeed, for the LMCO’ there is no need to compute matrix exponentials neither to invert matrices, one only needs to perform matrix-vector multiplication for $p\times p$ matrices. Note that for many matrices such a multiplication operation might be very cheap using the fast Fourier transform or other similar techniques. In addition, the computational complexity of the Hessian-vector product is provably of the same order as that of evaluating the gradient, see ¹⁸. Therefore, one iteration of the LMCO’ algorithm is not more costly than one iteration of the LMC. At the same time, the error bound (39) for the LMCO’ is smaller than the one for the LMC provided by Theorem 5. Indeed, the term $Mh\sqrt{Mp}$ present in the bound of Theorem 5 is generally of larger order than the term $(Mh)^{2}\sqrt{Mp}$ appearing in (39).

5 Relation with optimization

We have already mentioned that the LMC algorithm is very close to the gradient descent algorithm for computing the minimum $\boldsymbol{\theta}^{*}$ of the function $f$ . However, when we compare the guarantees of Theorem 1 with those available for the optimization problem, we remark the following striking difference. The approximate computation of $\boldsymbol{\theta}^{*}$ requires a number of steps of the order of $\log(1/\varepsilon)$ to reach the precision $\varepsilon$ , whereas, for reaching the same precision in sampling from $\pi$ , the LMC algorithm needs a number of iterations proportional to $(p/\varepsilon^{2})\log(p/\varepsilon)$ . The goal of this section is to explain that this, at first sight disappointing behavior of the LMC algorithm is, in fact, consistent with the exponential convergence of the gradient descent. Furthermore, the latter is obtained from the guarantees on the LMC by letting a temperature parameter go to zero.

The main ingredient for the explanation is that the function $f(\boldsymbol{\theta})$ and the function $f_{\tau}(\boldsymbol{\theta})=f(\boldsymbol{\theta})/\tau$ have the same point of minimum $\boldsymbol{\theta}^{*}$ , whatever the real number $\tau>0$ . In addition, if we define the density function $\pi_{\tau}(\boldsymbol{\theta})\propto\exp\big{(}-f_{\tau}(\boldsymbol{\theta}% )\big{)}$ , then the average value

\bar{\boldsymbol{\theta}}_{\tau}=\int_{\mathbb{R}^{p}}\boldsymbol{\theta}\,\pi% _{\tau}(\boldsymbol{\theta})\,d\boldsymbol{\theta}

tends to the minimum point $\boldsymbol{\theta}^{*}$ when $\tau$ goes to zero. Furthermore, the distribution $\pi_{\tau}(d\boldsymbol{\theta})$ tends to the Dirac measure at $\boldsymbol{\theta}^{*}$ . Clearly, $f_{\tau}$ satisfies (1) with the constants $m_{\tau}=m/\tau$ and $M_{\tau}=M/\tau$ . Therefore, on the one hand, we can apply to $\pi_{\tau}$ claim (a) of Theorem 1, which tells us that if we choose $h=1/M_{\tau}=\tau/M$ , then

W_{2}(\nu_{K},\pi_{\tau})\leq\Big{(}1-\frac{m}{M}\Big{)}^{K}W_{2}(\delta_{% \boldsymbol{\theta}_{0}},\pi_{\tau})+1.65\Big{(}\frac{M}{m}\Big{)}\Big{(}\frac% {p\tau}{M}\Big{)}^{1/2}.

(40)

On the other hand, the LMC algorithm with the step-size $h=\tau/M$ applied to $f_{\tau}$ reads as

\boldsymbol{\vartheta}_{k+1,h}=\boldsymbol{\vartheta}_{k,h}-\frac{1}{M}\nabla f% (\boldsymbol{\vartheta}_{k,h})+\sqrt{\frac{2\tau}{M}}\;\boldsymbol{\xi}_{k+1};% \qquad k=0,1,2,\ldots

(41)

When the parameter $\tau$ goes to zero, the LMC sequence (41) tends to the gradient descent sequence $\boldsymbol{\theta}_{k}$ . Therefore, the limiting case of (40) corresponding to $\tau\to 0$ writes as

\|\boldsymbol{\theta}^{(K)}-\boldsymbol{\theta}^{*}\|_{2}\leq\Big{(}1-\frac{m}% {M}\Big{)}^{K}\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{*}\|_{2},

(42)

which is a well-known result in Optimization. This clearly shows that Theorem 1 is a natural extension of the results of convergence from optimization to sampling.

Such an analogy holds true for the Newton method as well. Its counterpart in sampling is the LMCO algorithm. Indeed, one easily checks that if $f$ is replaced by $f_{\tau}$ with $\tau$ going to zero, then, for any fixed step-size $h$ , the matrix $\boldsymbol{\Sigma}_{k}$ in (34) tends to zero. This implies that the stochastic perturbation vanishes. On the other hand, the term $\mathbf{M}_{k,\tau}\nabla f_{\tau}(\boldsymbol{\vartheta}_{k,h}^{\rm LMCO})$ tends to $\{\nabla^{2}f(\boldsymbol{\vartheta}_{k,h}^{\rm LMCO})\}^{-1}\nabla f(% \boldsymbol{\vartheta}_{k,h}^{\rm LMCO})$ , as $\tau\to 0$ . Thus, the updates of the Newton algorithm can be seen as the limit case, when $\tau$ goes to zero, of the updates of the LMCO.

However, if we replace $f$ by $f_{\tau}$ in the upper bounds stated in Theorem 6 and we let $\tau$ go to zero, we do not retrieve the well-known guarantees for the Newton method. The main reason is that Theorem 6 describes the behavior of the LMCO algorithm in the regime of small step-sizes $h$ , whereas Newton’s method corresponds to (a limit case of) the LMCO with a fixed $h$ . Using arguments similar to those employed in the proof of Theorem 6, one can establish the following result, the proof of which is postponed to Section 7.

Proposition 1.

Let $\nu_{K}^{\rm LMCO}$ be the distributions of the $K$ -th iterate of the LMCO algorithm (34) with an initial distribution $\nu_{0}$ . Assume that condition F is satisfied. Then, for every $h>0$ and $K\in\mathbb{N}$ ,

\displaystyle W_{2}(\nu_{K}^{\rm LMCO},\pi)

\displaystyle\leq\frac{2m}{M_{2}}\big{(}w_{K}\exp(v_{K}w_{K}^{-2^{K}})\big{)}^% {2^{K}}

(43)

with

\displaystyle w_{K}

\displaystyle=\frac{M_{2}W_{2^{K+1}}(\nu_{0},\pi)}{2m}+\frac{1}{2}e^{-mh},\ % \text{and}\ v_{K}=\frac{2M_{2}M^{3/2}\sqrt{2p+2^{K}}}{m^{3}}+e^{-mh}.

(44)

If we replace in the right hand side of (43) the quantities $m$ , $M$ and $M_{2}$ , respectively, by $m_{\tau}=m/\tau$ , $M_{\tau}=M/\tau$ and $M_{2,\tau}=M_{2}/\tau$ , and we let $\tau$ go to zero, then it is clear that the term $v_{K}$ vanishes. On the other hand, if $\nu_{0}$ is the Dirac mass at some point $\boldsymbol{\theta}_{0}$ , then $w_{K}$ converges to $M_{2}\|\boldsymbol{\theta}_{0}-\boldsymbol{\theta}^{*}\|_{2}/(2m)$ . Therefore, for Newton’s algorithm as a limiting case of (43) we get

\displaystyle\|\boldsymbol{\theta}_{K}^{\rm Newton}-\boldsymbol{\theta}^{*}\|_% {2}\leq\frac{2m}{M_{2}}\bigg{(}\frac{M_{2}\|\boldsymbol{\theta}_{0}-% \boldsymbol{\theta}^{*}\|_{2}}{2m}\bigg{)}^{2^{K}}.

(45)

The latter provides the so called quadratic rate of convergence, which is a well-known result that can be found in many textbooks; see, for instance, ¹² Theorem 9.1.

A particularly promising remark made in Section 2.3 is that all the results established for the problem of approximate sampling from a log-concave distribution can be carried over the distributions that can be written as a mixture of (strongly) log-concave distributions. The only required condition is to be able to sample from the mixing distribution. This provides a well identified class of (posterior) distributions for which the problem of finding the mode is difficult (because of nonconvexity) whereas the sampling problem can be solved efficiently.

There are certainly other interesting connections to uncover between sampling and optimization. In particular, in ²², it was shown that in the case of mixture distributions, sampling algorithms scale linearly with the model dimension, as opposed to those of optimization, which have exponential scaling. One can think of lower bounds for sampling or finding a sampling counterpart of Nesterov acceleration. Some recent advances on the gradient flow ³⁵ might be useful for achieving these goals.

6 Conclusion

We have presented easy-to-use finite-sample guarantees for sampling from a strongly log-concave density using the Langevin Monte-Carlo algorithm with a fixed step-size and extended it to the case where the gradient of the log-density can be evaluated up to some error term. Our results cover both deterministic and random error terms. We have also demonstrated that if the log-density $f$ has a Lipschitz continuous second-order derivative, then one can choose a larger step-size and obtain improved convergence rate.

We have also uncovered some analogies between sampling and optimization. The underlying principle is that an optimization algorithm may be seen as a limit case of a sampling algorithm. Therefore, the results characterizing the convergence of the optimization schemes should have their counterparts for sampling strategies. We have described these analogues for the steepest gradient descent and for the Newton algorithm. However, while in the optimization the relevant characteristics of the problem are the dimension $p$ , the desired accuracy $\epsilon$ and the condition number $M/m$ , the problem sampling involves an additional characteristic which is the scale given by the strong-convexity constant $m$ . Indeed, if we increase $m$ by kee** the condition number $M/m$ constant, the number of iterations for the LMC to reach the precision $\epsilon$ will decrease. In this respect, we have shown that the LMC with Ozaki discretization, termed LMCO, has a better dependence on the overall scale of $f$ than the original LMC algorithm. However, the weakness of the LMCO is the high computational cost of each iteration. Therefore, we have proposed a new algorithm, LMCO’, that improves the LMC in terms of its dependence on the scale and each iteration of LMCO’ is computationally much cheaper than each iteration of the LMCO.

Another interesting finding is that, in the case of accurate gradient evaluations (i.e., when there is no error in the gradient computation), a suitably chosen variable step-size leads to logarithmic improvement in the convergence rate of the LMC algorithm.

Interesting directions for future research are establishing lower bounds in the spirit of those existing in optimization, obtaining user-friendly guarantees for computing the posterior mean or for sampling from a non-smooth density. Some of these problems have already been tackled in several papers mentioned in previous sections, but we believe that the techniques developed in the present work might be helpful for revisiting and deepening the existing results.

7 Proofs

The basis of the proofs of all the theorems stated in previous sections is a recursive inequality that upper bounds the error at the step $k+1$ , $W_{2}(\nu_{k+1},\pi)$ , by an expression involving the error of the previous step, $W_{2}(\nu_{k},\pi)$ . To this end, we use the fact that for a suitably chosen Langevin diffusion, $\boldsymbol{L}$ , in stationary regime, we have $W_{2}(\nu_{k},\pi)^{2}=\mathbf{E}[\|\vartheta_{k}-\boldsymbol{L}_{kh}\|_{2}^{2}]$ and $W_{2}(\nu_{k+1},\pi)^{2}\leq\mathbf{E}[\|\vartheta_{k+1}-\boldsymbol{L}_{(k+1)% h}\|_{2}^{2}]$ . The goal is then to upper bound the latter by an expression that involves the former and some suitably controlled remainder terms. This leads to a recursive inequality and the last step of the proof is to unfold the recursion. Since different chains $\vartheta_{k,h}$ are considered in this paper, we get different recursive inequalities. Lemma 7 and Lemma 8 are the new technical tools that are used for solving the encountered recursive inequalities. The remainder terms appearing in the recursive inequalities are evaluated by using stochastic calculus and the smoothness properties of $f$ . The main building blocks for these evaluations are Lemma 3, Lemma 4 and Lemma 6, the latter being used only in the results assuming the Hessian-Lipschitz condition.

We will also make repeated use of the Minkowski inequality and its integral version

\displaystyle\bigg{\{}\mathbf{E}\bigg{[}\bigg{(}\int_{a}^{b}X_{t}\,dt\bigg{)}^% {p}\bigg{]}\bigg{\}}^{1/p}\leq\int_{a}^{b}\big{\{}\mathbf{E}\big{[}|X_{t}|^{p}% \big{]}\big{\}}^{1/p}\,dt,\qquad\forall p\in\mathbb{N}^{*},

(46)

where $X$ is a random process almost all paths of which are integrable over the interval $[a,b]$ . Furthermore, for any random vector $\boldsymbol{X}$ , we define the norm $\|\boldsymbol{X}\|_{L_{2}}=(\mathbf{E}[\|\boldsymbol{X}\|_{2}^{2}])^{1/2}$ .

The next result is the central ingredient of the proofs of Theorems 1, 2 and 4. Readers interested only in the proof of Theorems 1 and 2, are invited—in the next proof—to consider the random vectors $\boldsymbol{\zeta}_{k}$ as equal to $\mathbf{0}$ and $\boldsymbol{Y}_{k,\boldsymbol{h}}$ as equal to $\nabla f(\boldsymbol{\vartheta}_{k,\boldsymbol{h}})$ . This implies, in particular, that $\sigma=\delta=0$ .

Proposition 2.

Let us introduce $\varrho_{k+1}=\max(1-mh_{k+1},Mh_{k+1}-1)$ (since $h\in(0,\nicefrac{{2}}{{M}})$ , this value $\varrho$ satisfies $0<\varrho<1$ ). If $f$ satisfies (1) and $h_{k+1}\leq 2/M$ , then

\displaystyle W_{2}(\nu_{k+1},\pi)^{2}

\displaystyle\leq\big{\{}\varrho_{k+1}W_{2}(\nu_{k},\pi)+\alpha M(h_{k+1}^{3}p% )^{1/2}+h_{k+1}\delta\sqrt{p}\big{\}}^{2}+\sigma^{2}h_{k+1}^{2}p,

(47)

with $\alpha=7\sqrt{2}/6\leq 1.65$ .

Proof.

To simplify notation, and since there is no risk of confusion, we will write $h$ instead of $h_{k+1}$ . The main steps of the proof are the following. We use a synchronous coupling for approximating the distribution of the LMC sequence by that of a continuous-time Langevin diffusion. We then take advantage of the strong convexity of $f$ for showing that, for $h$ small enough, the error at round $k+1$ is upper bounded, up to a additive remainder term, by the error at round k multiplied by a factor strictly smaller than one, see Lemma 2. The smoothness of the gradient of $f$ ensures that the aforementioned remainder term is small, see Lemma 3 and Lemma 4 below.

Let $\boldsymbol{L}_{0}$ be a random vector drawn from $\pi$ such that $W_{2}(\nu_{k},\pi)=\|\boldsymbol{L}_{0}-\boldsymbol{\vartheta}_{k,\boldsymbol{% h}}\|_{L_{2}}$ and $\mathbf{E}[\boldsymbol{\zeta}_{k}|\boldsymbol{\vartheta}_{k,\boldsymbol{h}},% \boldsymbol{L}_{0}]=\mathbf{E}[\boldsymbol{\zeta}_{k}|\boldsymbol{\vartheta}_{% k,\boldsymbol{h}}]$ . Let $\boldsymbol{W}\!$ be a $p$ -dimensional Brownian Motion independent of $(\boldsymbol{\vartheta}_{k,\boldsymbol{h}},\boldsymbol{L}_{0},\boldsymbol{% \zeta}_{k})$ , such that $\boldsymbol{W}\!_{h}=\sqrt{h}\,\boldsymbol{\xi}_{k+1}$ . We define the stochastic process $\boldsymbol{L}$ so that

\displaystyle\boldsymbol{L}_{t}

\displaystyle=\boldsymbol{L}_{0}-\int_{0}^{t}\nabla f(\boldsymbol{L}_{s})\,ds+% \sqrt{2}\,\boldsymbol{W}\!_{t},\qquad\forall\,t>0.

(48)

It is clear that this equation implies that

	$\displaystyle\displaystyle\boldsymbol{L}_{h}$	$\displaystyle=\boldsymbol{L}_{0}-\int_{0}^{h}\nabla f(\boldsymbol{L}_{s})\,ds+% \sqrt{2}\,\boldsymbol{W}\!_{h}$		(49)
		$\displaystyle=\boldsymbol{L}_{0}-\int_{0}^{h}\nabla f(\boldsymbol{L}_{s})\,ds+% \sqrt{2h}\,\boldsymbol{\xi}_{k+1}.$		(50)

Furthermore, $\{\boldsymbol{L}_{t}:t\geq 0\}$ is a diffusion process having $\pi$ as the stationary distribution. Since the initial value $\boldsymbol{L}_{0}$ is drawn from $\pi$ , we have $\boldsymbol{L}_{t}\sim\pi$ for every $t\geq 0$ .

Let us denote $\boldsymbol{\Delta}_{k}=\boldsymbol{L}_{0}-\boldsymbol{\vartheta}_{k,% \boldsymbol{h}}$ and $\boldsymbol{\Delta}_{k+1}=\boldsymbol{L}_{h}-\boldsymbol{\vartheta}_{k+1,% \boldsymbol{h}}$ . We have

$\displaystyle\boldsymbol{\Delta}_{k+1}$	$\displaystyle=\boldsymbol{\Delta}_{k}+h\boldsymbol{Y}_{k,\boldsymbol{h}}-\int_% {0}^{h}\nabla f(\boldsymbol{L}_{t})\,dt$	(51)
	$\displaystyle=\boldsymbol{\Delta}_{k}-h\big{(}\underbrace{\nabla f(\boldsymbol% {\vartheta}_{k,\boldsymbol{h}}+\boldsymbol{\Delta}_{k})-\nabla f(\boldsymbol{% \vartheta}_{k,\boldsymbol{h}})}_{:=\boldsymbol{U}}\big{)}+h\boldsymbol{\zeta}_% {k}$	(52)
	$\displaystyle\qquad-\underbrace{\int_{0}^{h}\big{(}\nabla f(\boldsymbol{L}_{t}% )-\nabla f(\boldsymbol{L}_{0})\big{)}\,dt}_{:=\boldsymbol{V}}.$	(53)

Using the equalities $\mathbf{E}[\boldsymbol{\zeta}_{k}|\boldsymbol{\Delta}_{k},\boldsymbol{U},% \boldsymbol{V}]=\mathbf{E}[\boldsymbol{\zeta}_{k}|\boldsymbol{\vartheta}_{k,% \boldsymbol{h}},\boldsymbol{L}_{0},\boldsymbol{W}\!]=\mathbf{E}[\boldsymbol{% \zeta}_{k}|\boldsymbol{\vartheta}_{k,\boldsymbol{h}},\boldsymbol{L}_{0}]=% \mathbf{E}[\boldsymbol{\zeta}_{k}|\boldsymbol{\vartheta}_{k,\boldsymbol{h}}]$ , we get

$\displaystyle\\|\boldsymbol{\Delta}_{k+1}\\|_{L_{2}}^{2}$	$\displaystyle=\big{\\|}\boldsymbol{\Delta}_{k}-h\boldsymbol{U}-\boldsymbol{V}+h% \mathbf{E}[\boldsymbol{\zeta}_{k}\|\boldsymbol{\vartheta}_{k,\boldsymbol{h}}]% \big{\\|}_{L_{2}}^{2}+h^{2}\big{\\|}\boldsymbol{\zeta}_{k}-\mathbf{E}[% \boldsymbol{\zeta}_{k}\|\boldsymbol{\vartheta}_{k,\boldsymbol{h}}]\big{\\|}_{L_{% 2}}^{2}$	(54)
	$\displaystyle\leq\big{\\|}\boldsymbol{\Delta}_{k}-h\boldsymbol{U}-\boldsymbol{V% }+h\mathbf{E}[\boldsymbol{\zeta}_{k}\|\boldsymbol{\vartheta}_{k,\boldsymbol{h}}% ]\big{\\|}_{L_{2}}^{2}+\sigma^{2}h^{2}p$	(55)
	$\displaystyle\leq\big{\{}\\|\boldsymbol{\Delta}_{k}-h\boldsymbol{U}\\|_{L_{2}}+h% \delta\sqrt{p}+\\|\boldsymbol{V}\\|_{L_{2}}\big{\}}^{2}+\sigma^{2}h^{2}p.$	(56)

We need now three technical lemmas. Lemma 2 and Lemma 3 are borrowed from ¹³, whereas Lemma 4 is an improved version of ¹³ Lemma 3. For the sake of self-containedness, we provide proofs of these lemmas in Section 7.7.

Lemma 2.

Let $f$ be $m$ -strongly convex and the gradient of $f$ be Lipschitz with constant $M$ . If $h<2/M$ , then the map** $(\mathbf{I}_{p}-h\nabla f)$ is a contraction in the sense that

\big{\|}\boldsymbol{x}-\boldsymbol{y}-h\big{(}\nabla f(\boldsymbol{x})-\nabla f% (\boldsymbol{y})\big{)}\big{\|}_{2}\leq\big{\{}(1-mh)\vee(Mh-1)\big{\}}\|% \boldsymbol{x}-\boldsymbol{y}\|_{2},

(57)

for all $\boldsymbol{x},\boldsymbol{y}\in\mathbb{R}^{p}$ . In particular, using notations in (53), it holds that $\|\boldsymbol{\Delta}_{k}-h\boldsymbol{U}\|_{2}\leq\varrho\|\boldsymbol{\Delta% }_{k}\|_{2}$ .

Lemma 3.

If the function $f$ is continuously differentiable and the gradient of $f$ is Lipschitz with constant $M$ , then $\int_{\mathbb{R}^{p}}\|\nabla f(\boldsymbol{x})\|_{2}^{2}\,\pi(\boldsymbol{x})% \,d\boldsymbol{x}\leq Mp$ .

Lemma 4.

If the function $f$ and its gradient is Lipschitz with constant $M$ , $\boldsymbol{L}$ is the Langevin diffusion (48) and $\boldsymbol{V}(a)=\int_{a}^{a+h}\big{(}\nabla f(\boldsymbol{L}_{t})-\nabla f(% \boldsymbol{L}_{a})\big{)}\,dt$ for some $a\geq 0$ , then

\displaystyle\|\boldsymbol{V}(a)\|_{L_{2}}

\displaystyle\leq\frac{1}{2}\big{(}h^{4}M^{3}p\big{)}^{1/2}+\frac{2}{3}(2h^{3}% p)^{1/2}M.

(58)

Using Lemma 2 and Lemma 4 above, as well as the inequality $W_{2}(\nu_{k+1},\pi)^{2}\leq\mathbf{E}[\|\boldsymbol{\Delta}_{k+1}\|_{2}^{2}]$ , we get the recursion

$\displaystyle W_{2}(\nu_{k+1},\pi)^{2}$	$\displaystyle\leq\big{\{}\varrho W_{2}(\nu_{k},\pi)+(\nicefrac{{1}}{{2}})\big{% (}h^{4}M^{3}p\big{)}^{1/2}+(\nicefrac{{2}}{{3}})(2h^{3}p)^{1/2}M+h\delta\sqrt{% p}\big{\}}^{2}+\sigma^{2}h^{2}p$	(59)
	$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\big{\{}\varrho W_{2}(\nu_{k}% ,\pi)+(\nicefrac{{1}}{{2}})\big{(}2h^{3}M^{2}p\big{)}^{1/2}+(\nicefrac{{2}}{{3% }})(2h^{3}p)^{1/2}M+h\delta\sqrt{p}\big{\}}^{2}+\sigma^{2}h^{2}p$	(60)
	$\displaystyle\stackrel{{\scriptstyle(b)}}{{\leq}}\big{\{}\varrho W_{2}(\nu_{k}% ,\pi)+\alpha M\big{(}h^{3}p\big{)}^{1/2}+h\delta\sqrt{p}\big{\}}^{2}+\sigma^{2% }h^{2}p,$	(61)

where in $(a)$ we have used the condition $h\leq 2/M$ whereas in $(b)$ we have put $\alpha=7\sqrt{2}/6\leq 1.65$ . ∎

7.1 Proof of Theorem 1

Using Proposition 2 with $\sigma=\delta=0$ , we get $W_{2}(\nu_{k+1},\pi)\leq\varrho\,W_{2}(\nu_{k},\pi)+\|\boldsymbol{V}\|_{L_{2}}$ for all $k\in\mathbb{N}$ . In view of Lemma 4, this yields

\displaystyle W_{2}(\nu_{k+1},\pi)\leq\varrho\,W_{2}(\nu_{k},\pi)+\alpha M(h^{% 3}p)^{1/2}.

(62)

Using this inequality repeatedly for $k+1,k,k-1,\ldots,1$ , we get

	$\displaystyle W_{2}(\nu_{k+1},\pi)$	$\displaystyle\leq\varrho^{k+1}\,W_{2}(\nu_{0},\pi)+\alpha M(h^{3}p)^{1/2}(1+% \varrho+\ldots+\varrho^{k})$		(63)
		$\displaystyle\leq\varrho^{k+1}\,W_{2}(\nu_{0},\pi)+\alpha M(h^{3}p)^{1/2}(1-% \varrho)^{-1}.$		(64)

This completes the proof.

7.2 Proof of Theorem 2

Recall that $\alpha=7\sqrt{2}/6\leq 1.65$ . Theorem 1 implies that using the step-size $h_{k}=2/(M+m)$ for $k=1,\ldots,K_{1}$ , we get

	$\displaystyle W_{2}(\nu_{K_{1}},\pi)$	$\displaystyle\leq\Big{(}1+\frac{2m}{M-m}\Big{)}^{-K_{1}}W_{2}(\nu_{0},\pi)+% \frac{\alpha M}{m}\Big{(}\frac{2p}{m+M}\Big{)}^{1/2}$		(65)
		$\displaystyle\leq\frac{3.5M}{m}\Big{(}\frac{p}{M+m}\Big{)}^{1/2}.$		(66)

Starting from this iteration $K_{1}$ , we use a decreasing step-size

\displaystyle h_{k+1}=\frac{2}{M+m+(\nicefrac{{2}}{{3}})m(k-K_{1})}.

(67)

Let us show by induction over $k$ that

\displaystyle W_{2}(\nu_{k},\pi)

\displaystyle\leq\frac{3.5M}{m}\bigg{(}\frac{p}{M+m+(\nicefrac{{2}}{{3}})m(k-K% _{1})}\bigg{)}^{1/2},\qquad\forall\,k\geq K_{1}.

(68)

For $k=K_{1}$ , this inequality is true in view of (66). Assume now that (68) is true for some $k$ . For $k+1$ , we have

$\displaystyle W_{2}(\nu_{k+1},\pi)$	$\displaystyle\leq(1-mh_{k+1})W_{2}(\nu_{k},\pi)+\alpha M\sqrt{p}\;h_{k+1}^{3/2}$	(69)
	$\displaystyle\leq(1-mh_{k+1})\frac{3.5M\sqrt{p}\,(h_{k+1}/2)^{1/2}}{m}+\alpha M% \sqrt{p}\;h_{k+1}^{3/2}$	(70)
	$\displaystyle\leq(1-\frac{1}{3}mh_{k+1})\frac{3.5M\sqrt{p}\,(h_{k+1}/2)^{1/2}}% {m}.$	(71)

One can check that

$\displaystyle(1-\frac{1}{3}mh_{k+1})(h_{k+1}/2)^{1/2}$	$\displaystyle=\frac{\sqrt{3}\,[m+3M+2m(k-K_{1})]}{[3m+3M+2m(k-K_{1})]^{3/2}}$	(72)
	$\displaystyle\leq\frac{\sqrt{3}\,[m+3M+2m(k-K_{1})]^{1/2}}{3m+3M+2m(k-K_{1})}$	(73)
	$\displaystyle\leq\frac{\sqrt{3}}{[3m+3M+2m(k+1-K_{1})]^{1/2}}.$	(74)

This completes the proof of the theorem.

7.3 Proof of Theorem 3

Let us denote by $\nu_{k}(\cdot|\boldsymbol{x})$ the conditional distribution of $\vartheta^{\rm MLMC}_{k}$ given $\boldsymbol{\eta}=\boldsymbol{x}$ . In view of Theorem 2, we have

W_{2}\big{(}\nu_{k}(\cdot|\boldsymbol{x}),\pi_{1}(\cdot|\boldsymbol{x})\big{)}% \leq\frac{3.5M\sqrt{p}}{m\sqrt{M+m+(\nicefrac{{2}}{{3}})m(k-K_{1})}},\qquad% \forall\boldsymbol{x}\in H.

(75)

This readily yields

\int_{H}W_{2}\big{(}\nu_{k}(\cdot|\boldsymbol{x}),\pi_{1}(\cdot|\boldsymbol{x}% )\big{)}\,\pi_{0}(d\boldsymbol{x})\leq\frac{3.5M\sqrt{p}}{m\sqrt{M+m+(% \nicefrac{{2}}{{3}})m(k-K_{1})}}.

(76)

The last step is to apply the convexity of the Wasserstein distance, which means that for any probability measure $\pi_{0}$ , we have

\int_{H}W_{2}\big{(}\nu_{k}(\cdot|\boldsymbol{x}),\pi_{1}(\cdot|\boldsymbol{x}% )\big{)}\,\pi_{0}(d\boldsymbol{x})\geq W_{2}\bigg{(}\int_{H}\nu_{k}(\cdot|% \boldsymbol{x})\,\pi_{0}(d\boldsymbol{x}),\int_{H}\pi_{1}(\cdot|\boldsymbol{x}% )\,\pi_{0}(d\boldsymbol{x})\bigg{)}=W_{2}(\nu_{k},\pi).

7.4 Proof of Theorem 4

As explained in Section 3, the main new ingredient of the proof is Lemma 1, that has to be combined with Proposition 2. We postpone the proof of Lemma 1 to Section 7.7 and do it in a more general form (see Lemma 7).

In view of Proposition 2, we have

\displaystyle W_{2}(\nu_{k+1},\pi)^{2}

\displaystyle\leq\big{\{}(1-mh)W_{2}(\nu_{k},\pi)+\alpha M(h^{3}p)^{1/2}+h% \delta\sqrt{p}\big{\}}^{2}+\sigma^{2}h^{2}p.

(77)

We apply now Lemma 1 with $A=mh$ , $B=\sigma h\sqrt{p}$ and $C=\alpha M(h^{3}p)^{1/2}+h\delta\sqrt{p}$ , which implies that $W_{2}(\nu_{k},\pi)$ is less than or equal to

\displaystyle(1-mh)^{k}W_{2}(\nu_{0},\pi)+\frac{\alpha M(hp)^{1/2}+\delta\sqrt% {p}}{m}+\frac{\sigma^{2}h\sqrt{p}}{\alpha Mh^{1/2}+\delta+(mh)^{1/2}\,\sigma}.

(78)

This completes the proof of the theorem.

7.5 Proof of Theorem 5

Using the same construction and the same definitions as in the proof of Proposition 2, for $\boldsymbol{\Delta}_{k}=\boldsymbol{L}_{0}-\boldsymbol{\vartheta}_{k,% \boldsymbol{h}}$ , we have

$\displaystyle\boldsymbol{\Delta}_{k+1}-\boldsymbol{\Delta}_{k}$	$\displaystyle=h\boldsymbol{Y}_{k,\boldsymbol{h}}-\int_{I_{k}}\nabla f(% \boldsymbol{L}_{t})\,dt$	(79)
	$\displaystyle=-h\big{(}\underbrace{\nabla f(\boldsymbol{\vartheta}_{k,% \boldsymbol{h}}+\boldsymbol{\Delta}_{k})-\nabla f(\boldsymbol{\vartheta}_{k,% \boldsymbol{h}})}_{:=\boldsymbol{U}}\big{)}$	(80)
	$\displaystyle\qquad-\sqrt{2}\,\underbrace{\int_{0}^{h}\int_{0}^{t}\nabla^{2}f(% \boldsymbol{L}_{s})d\boldsymbol{W}\!_{s}\,dt}_{:=\boldsymbol{S}}+h\boldsymbol{% \zeta}_{k}$	(81)
	$\displaystyle\qquad-\underbrace{\int_{0}^{h}\big{(}\nabla f(\boldsymbol{L}_{t}% )-\nabla f(\boldsymbol{L}_{0})-\sqrt{2}\,\int_{0}^{t}\nabla^{2}f(\boldsymbol{L% }_{s})d\boldsymbol{W}\!_{s}\big{)}\,dt}_{:=\bar{\boldsymbol{V}}}.$	(82)

Using the following equalities of conditional expectations $\mathbf{E}[\boldsymbol{\zeta}_{k}|\boldsymbol{\Delta}_{k},\boldsymbol{U},\bar{% \boldsymbol{V}}]=\mathbf{E}[\boldsymbol{\zeta}_{k}|\boldsymbol{\vartheta}_{k,% \boldsymbol{h}},\boldsymbol{L}_{0},\boldsymbol{W}\!]=\mathbf{E}[\boldsymbol{% \zeta}_{k}|\boldsymbol{\vartheta}_{k,\boldsymbol{h}},\boldsymbol{L}_{0}]=% \mathbf{E}[\boldsymbol{\zeta}_{k}|\boldsymbol{\vartheta}_{k,\boldsymbol{h}}]$ and $\mathbf{E}[\boldsymbol{S}_{h}|\boldsymbol{\vartheta}_{k,\boldsymbol{h}},% \boldsymbol{L}_{0}]=0$ , we get

	$\displaystyle\\|\boldsymbol{\Delta}_{k+1}\\|_{L_{2}}^{2}$	$\displaystyle\leq\big{\\|}\boldsymbol{\Delta}_{k}-h\boldsymbol{U}-\bar{% \boldsymbol{V}}-\sqrt{2}\boldsymbol{S}_{h}+h\mathbf{E}[\boldsymbol{\zeta}_{k}\|% \boldsymbol{\vartheta}_{k,\boldsymbol{h}}]\big{\\|}_{L_{2}}^{2}+\sigma^{2}h^{2}p$		(83)
		$\displaystyle\leq\big{\{}\big{(}\\|\boldsymbol{\Delta}_{k}-h\boldsymbol{U}\\|_{L% _{2}}^{2}+2\\|\boldsymbol{S}_{h}\\|_{L_{2}}^{2}\big{)}^{1/2}+h\delta\sqrt{p}+\\|% \bar{\boldsymbol{V}}\\|_{L_{2}}\big{\}}^{2}+\sigma^{2}h^{2}p.$		(84)

In addition, we have

	$\displaystyle\\|\boldsymbol{S}_{h}\\|_{L_{2}}^{2}$	$\displaystyle=\Big{\\|}\int_{0}^{h}(h-s)\nabla^{2}f(\boldsymbol{L}_{s})\,d% \boldsymbol{W}\!_{s}\Big{\\|}_{L_{2}}^{2}$		(85)
		$\displaystyle=\int_{0}^{h}(h-s)^{2}\mathbf{E}[\\|\nabla^{2}f(\boldsymbol{L}_{s}% )\\|_{F}^{2}]\,ds\leq(\nicefrac{{1}}{{3}})\,M^{2}h^{3}p.$		(86)

Setting $x_{k}=\|\boldsymbol{\Delta}_{k}\|_{L_{2}}=W_{2}(\nu_{k},\pi)$ and using Lemma 2, this yields

\displaystyle x_{k+1}^{2}

\displaystyle\leq\big{\{}\big{(}(1-mh)^{2}x_{k}^{2}+(\nicefrac{{2}}{{3}})\,M^{% 2}h^{3}p\big{)}^{1/2}+h\delta\sqrt{p}+\|\bar{\boldsymbol{V}}\|_{L_{2}}\big{\}}% ^{2}+\sigma^{2}h^{2}p.

(87)

Let us define $A=mh$ , $F=(\nicefrac{{2}}{{3}})\,M^{2}h^{3}p$ , $G=\sigma^{2}h^{2}p$ and³³3In view of Lemma 6 in Section 7.7, we have $h\delta\sqrt{p}+\|\bar{\boldsymbol{V}}\|_{L_{2}}\leq C$ .

C=h\delta\sqrt{p}+0.5M_{2}h^{2}p+0.5M^{3/2}h^{2}\sqrt{p}.

Then

x_{k+1}^{2}\leq\big{\{}\big{(}(1-A)^{2}x_{k}^{2}+F\big{)}^{1/2}+C\big{\}}^{2}+G.

One can deduce from this inequality that $x_{k+1}^{2}\leq\big{(}(1-A)x_{k}+C\big{)}^{2}+F+G+2C\sqrt{F}$ . Therefore, using (225) of Lemma 7 below, we get

	$\displaystyle x_{k}$	$\displaystyle\leq(1-A)^{k}x_{0}+\frac{C}{A}+\frac{F+G+2C\sqrt{F}}{C+\big{(}A(F% +G+2C\sqrt{F})\big{)}^{1/2}}$		(88)
		$\displaystyle\leq(1-A)^{k}x_{0}+(C/A)+2(F/A)^{1/2}+\frac{G}{C+\sqrt{AG}}.$		(89)

Replacing $A,C,F$ and $G$ by their respective expressions, we get the claim of the theorem.

7.6 Proof of Theorem 6

To ease notation, throughout this proof, we will write $\nu_{k}$ and $\nu_{k}^{\prime}$ instead of $\nu_{k}^{\rm LMCO}$ and $\nu_{k}^{\rm LMCO^{\prime}}$ , respectively.

Let $\boldsymbol{D}_{0}\sim\nu_{k}$ and $\boldsymbol{L}_{0}\sim\pi$ be two random variables such that $\|\boldsymbol{D}_{0}-\boldsymbol{L}_{0}\|_{L_{2}}^{2}=W_{2}(\nu_{k},\pi)$ . Let $\boldsymbol{W}\!$ be a $p$ -dimensional Brownian motion independent of $(\boldsymbol{D}_{0},\boldsymbol{L}_{0})$ . We define $\boldsymbol{L}$ to be the Langevin diffusion process (48) driven by $\boldsymbol{W}\!$ and starting at $\boldsymbol{L}_{0}$ , whereas $\boldsymbol{D}$ is the process starting at $\boldsymbol{D}_{0}$ and satisfying the stochastic differential equation

d\boldsymbol{D}_{t}=-[\nabla f(\boldsymbol{D}_{0})+\nabla^{2}f(\boldsymbol{D}_% {0})(\boldsymbol{D}_{t}-\boldsymbol{D}_{0})]\,dt+\sqrt{2}\,d\boldsymbol{W}\!_{% t},\quad t\geq 0.

(90)

This is an Ornstein-Uhlenbeck process. It can be expressed explicitly as a function of $\boldsymbol{D}_{0}$ and $\boldsymbol{W}\!$ . The corresponding expression implies that $\boldsymbol{D}_{h}\sim\nu_{k+1}$ and, hence, $W_{2}(\nu_{k+1},\pi)\leq\|\boldsymbol{D}_{h}-\boldsymbol{L}_{h}\|_{L_{2}}^{2}$ .

An important ingredient of our proof is the following version of the Gronwall lemma, the proof of which is postponed to Section 7.7.

Lemma 5.

Let $\boldsymbol{\alpha}:[0,T]\times\Omega\to\mathbb{R}^{p}$ be a continuous semi-martingale and $\mathbf{H}:[0,T]\times\Omega\to\mathbb{R}^{p\times p}$ be a random process with continuous paths in the space of all symmetric $p\times p$ matrices such that $\mathbf{H}_{s}\mathbf{H}_{t}=\mathbf{H}_{t}\mathbf{H}_{s}$ for every $s,t\in[0,T]$ . If $\boldsymbol{x}:[0,T]\times\Omega\to\mathbb{R}^{p}$ is a semi-martingale satisfying the identity

\displaystyle\boldsymbol{x}_{t}=\boldsymbol{\alpha}_{t}-\int_{0}^{t}\mathbf{H}% _{s}\boldsymbol{x}_{s}\,ds,\qquad\forall t\in[0,T],

(91)

then, for every $t\in[0,T]$ ,

\displaystyle\boldsymbol{x}_{t}=\exp\Big{\{}-\int_{0}^{t}\mathbf{H}_{s}\,ds% \Big{\}}\boldsymbol{\alpha}_{0}+\int_{0}^{t}\exp\Big{\{}-\int_{s}^{t}\mathbf{H% }_{u}\,du\Big{\}}d\boldsymbol{\alpha}_{s}.

(92)

We denote $\boldsymbol{X}_{t}=\boldsymbol{L}_{t}-\boldsymbol{L}_{0}-(\boldsymbol{D}_{t}-% \boldsymbol{D}_{0})$ , where $\boldsymbol{D}_{t}$ is the random process defined in (90) and $\boldsymbol{L}_{t}$ is the Langevin diffusion driven by the same Wiener process $\boldsymbol{W}\!$ and with initial condition $\boldsymbol{L}_{0}\sim\pi$ . It is clear that

	$\displaystyle\boldsymbol{X}_{t}$	$\displaystyle=-\int_{0}^{t}\nabla f(\boldsymbol{L}_{s})\,ds+\int_{0}^{t}[% \nabla f(\boldsymbol{D}_{0})+\nabla^{2}f(\boldsymbol{D}_{0})(\boldsymbol{D}_{s% }-\boldsymbol{D}_{0})]\,ds$		(93)
		$\displaystyle=-\int_{0}^{t}\big{\{}\nabla f(\boldsymbol{L}_{s})-\nabla f(% \boldsymbol{D}_{0})-\nabla^{2}f(\boldsymbol{D}_{0})(\boldsymbol{L}_{s}-% \boldsymbol{L}_{0})\big{\}}\,ds-\int_{0}^{t}\nabla^{2}f(\boldsymbol{D}_{0})% \boldsymbol{X}_{s}\,ds.$		(94)

Using Lemma 5, we get

$\displaystyle\boldsymbol{X}_{t}$	$\displaystyle=-\int_{0}^{t}e^{-s\nabla^{2}f(\boldsymbol{D}_{0})}\big{\{}\nabla f% (\boldsymbol{L}_{s})-\nabla f(\boldsymbol{D}_{0})-\nabla^{2}f(\boldsymbol{D}_{% 0})(\boldsymbol{L}_{s}-\boldsymbol{L}_{0})\big{\}}\,ds$	(95)
	$\displaystyle=\int_{0}^{t}e^{-s\nabla^{2}f(\boldsymbol{D}_{0})}\,ds[\nabla f(% \boldsymbol{D}_{0})-\nabla f(\boldsymbol{L}_{0})]$	(96)
	$\displaystyle\qquad-\int_{0}^{t}e^{-s\nabla^{2}f(\boldsymbol{D}_{0})}\big{\{}% \nabla f(\boldsymbol{L}_{s})-\nabla f(\boldsymbol{L}_{0})-\nabla^{2}f(% \boldsymbol{L}_{0})(\boldsymbol{L}_{s}-\boldsymbol{L}_{0})\big{\}}\,ds$	(97)
	$\displaystyle\qquad-\int_{0}^{t}e^{-s\nabla^{2}f(\boldsymbol{D}_{0})}[\nabla^{% 2}f(\boldsymbol{D}_{0})-\nabla^{2}f(\boldsymbol{L}_{0})]\int_{0}^{s}\nabla f(% \boldsymbol{L}_{u})\,du\,ds$	(98)
	$\displaystyle\qquad+\sqrt{2}\int_{0}^{t}e^{-s\nabla^{2}f(\boldsymbol{D}_{0})}[% \nabla^{2}f(\boldsymbol{D}_{0})-\nabla^{2}f(\boldsymbol{L}_{0})]\boldsymbol{W}% \!_{s}\,ds.$	(99)

Let us set $\boldsymbol{\Delta}_{t}=\boldsymbol{L}_{t}-\boldsymbol{D}_{t}$ . We have $\boldsymbol{X}_{t}=\boldsymbol{\Delta}_{t}-\boldsymbol{\Delta}_{0}=A_{t}-B_{t}% -C_{t}+S_{t}$ , where $A_{t}$ , $B_{t}$ , $C_{t}$ and $S_{t}$ stand for the four integrals in (99). We now evaluate these terms separately. For the first one, using the notation $\mathbf{H}_{0}=\nabla^{2}f(\boldsymbol{D}_{0})$ and the identity $\nabla f(\boldsymbol{L}_{0})-\nabla f(\boldsymbol{D}_{0})=\int_{0}^{1}\nabla^{% 2}f(\boldsymbol{D}_{0}+x\boldsymbol{\Delta}_{0})\,dx\boldsymbol{\Delta}_{0}$ , we get

$\displaystyle\\|\boldsymbol{\Delta}_{0}+A_{t}\\|_{2}$	$\displaystyle\leq\\|\boldsymbol{\Delta}_{0}-t\big{(}\nabla f(\boldsymbol{L}_{0}% )-\nabla f(\boldsymbol{D}_{0})\big{)}\\|_{2}$	(100)
	$\displaystyle\qquad+\int_{0}^{t}\\|\mathbf{I}-e^{-s\mathbf{H}_{0}}\\|\,ds\big{\\|% }\nabla f(\boldsymbol{L}_{0})-\nabla f(\boldsymbol{D}_{0})\big{\\|}_{2}$	(101)
	$\displaystyle\leq(1-mt+0.5M^{2}t^{2})\\|\boldsymbol{\Delta}_{0}\\|_{2}.$	(102)

For the term $B_{t}$ with $t\leq h\leq m/M^{2}\leq 1/M$ , we can apply (195) to infer that

\displaystyle\|B_{t}\|_{L_{2}}^{2}

\displaystyle\leq 0.88M_{2}t^{2}(p^{2}+2p)^{1/2}.

(103)

As for $C_{t}$ , in view of the inequality $\|\nabla^{2}f(\boldsymbol{L}_{0})-\nabla^{2}f(\boldsymbol{D}_{0})\|\leq M_{2}% \|\boldsymbol{\Delta}_{0}\|_{2}\wedge M\leq\sqrt{MM_{2}\|\boldsymbol{\Delta}_{% 0}\|_{2}}$ , we have

	$\displaystyle\\|C_{t}\\|_{2}$	$\displaystyle\leq\sqrt{MM_{2}\\|\boldsymbol{\Delta}_{0}\\|_{2}}\int_{0}^{t}\int_% {0}^{s}\\|\nabla f(\boldsymbol{L}_{u})\\|_{2}\,du\,ds$		(104)
		$\displaystyle\leq\mu\\|\boldsymbol{\Delta}_{0}\\|_{2}+(4\mu)^{-1}MM_{2}\bigg{(}% \int_{0}^{t}(t-u)\\|\nabla f(\boldsymbol{L}_{u})\\|_{2}\,du\bigg{)}^{2}.$		(105)

On the other hand, the fact that $\mathbf{E}[\|\nabla f(\boldsymbol{L}_{u})\|_{2}^{4}]\leq M^{2}(p^{2}+2p)$ yields

\displaystyle\bigg{(}\int_{0}^{t}(t-u)(\mathbf{E}[\|\nabla f(\boldsymbol{L}_{u% })\|_{2}^{4}])^{1/4}\,du\bigg{)}^{2}\leq\frac{Mt^{4}(p^{2}+2p)^{1/2}}{4}.

(106)

This implies the inequality

\displaystyle\|C_{t}\|_{L_{2}}

\displaystyle\leq\mu W_{2}(\nu_{k},\pi)+(16\mu)^{-1}M^{2}M_{2}t^{4}(p+1).

(107)

Finally, using the integration by parts formula for semi-martingales, one can easily write $S_{t}$ as a stochastic integral with respect to $\boldsymbol{W}\!$ and derive from that representation the inequality

	$\displaystyle\\|S_{t}\\|_{L_{2}}^{2}$	$\displaystyle\leq 2\mathbf{E}\bigg{[}\int_{0}^{t}\bigg{\\|}\int_{u}^{t}e^{-s% \mathbf{H}_{0}}\,ds\big{(}\nabla^{2}f(\boldsymbol{L}_{0})-\nabla^{2}f(% \boldsymbol{D}_{0})\big{)}\bigg{\\|}_{F}^{2}\,du\bigg{]}$		(108)
		$\displaystyle\leq 2p\mathbf{E}[(M_{2}\\|\boldsymbol{\Delta}_{0}\\|_{2}\wedge M)^% {2}]\int_{0}^{t}(t-u)^{2}\,du\leq(\nicefrac{{2}}{{3}})M_{2}Mpt^{3}\\|% \boldsymbol{\Delta}_{0}\\|_{L_{2}}^{2}.$		(109)

Putting all these pieces together, taking the expectation, using the Minkowski inequality, the equality $\mathbf{E}[(\boldsymbol{\Delta}_{0}+A_{h})^{\top}S_{h}]=0$ and the inequality $\sqrt{a^{2}+b}\leq a+b/(2a)$ , we get

$\displaystyle\\|\boldsymbol{\Delta}_{h}\\|_{L_{2}}^{2}$	$\displaystyle=\\|\boldsymbol{\Delta}_{0}+A_{h}-B_{h}-C_{h}+S_{h}\\|_{L_{2}}^{2}$	(110)
	$\displaystyle\leq\big{(}\\|\boldsymbol{\Delta}_{0}+A_{h}\\|_{L_{2}}^{2}+\\|S_{h}% \\|_{L_{2}}^{2}\big{)}^{1/2}+\\|B_{h}\\|_{L_{2}}^{2}+\\|C_{h}\\|_{L_{2}}^{2}$	(111)
	$\displaystyle\leq\big{(}1-mh+0.5M^{2}h^{2}+\mu\big{)}\\|\boldsymbol{\Delta}_{0}% \\|_{L_{2}}^{2}+\frac{M_{2}Mph^{3}}{3(1-mh+0.5M^{2}h^{2})}$	(112)
	$\displaystyle\qquad+0.88M_{2}h^{2}(p^{2}+2p)^{1/2}+\frac{M^{2}M_{2}h^{4}}{16% \mu}(p+1).$	(113)

Let $\mu$ be any real number smaller than $0.5h(m-0.5M^{2}h)$ ; Eq. (113) and the inequality $p^{2}+2p\leq(p+1)^{2}$ yield

	$\displaystyle W_{2}(\nu_{k+1},\pi)$	$\displaystyle\leq(1-\mu)W_{2}(\nu_{k},\pi)+\frac{M_{2}Mph^{3}}{3(1-2\mu)}+0.88% M_{2}h^{2}(p+1)$		(114)
		$\displaystyle\qquad+\frac{M^{2}M_{2}h^{4}}{16\mu}(p+1).$		(115)

Since $h\leq m/M^{2}$ , we can choose $\mu=0.25mh$ so that $1-2\mu=1-0.5mh\geq 0.5$ and

$\displaystyle W_{2}(\nu_{k+1},\pi)$	$\displaystyle\leq(1-0.25mh)W_{2}(\nu_{k},\pi)+\frac{2M_{2}Mph^{3}}{3}+0.88M_{2% }h^{2}(p+1)$	(116)
	$\displaystyle\qquad+\frac{M^{2}M_{2}h^{3}}{4m}(p+1)$	(117)
	$\displaystyle\leq(1-0.25mh)W_{2}(\nu_{k},\pi)+1.8M_{2}h^{2}(p+1).$	(118)

This recursion implies the inequality

	$\displaystyle W_{2}(\nu_{k},\pi)$	$\displaystyle\leq(1-0.25mh)^{k}W_{2}(\nu_{0},\pi)+\frac{1.8M_{2}h(p+1)}{0.25m}$		(119)
		$\displaystyle=(1-0.25mh)^{k}W_{2}(\nu_{0},\pi)+\frac{7.2M_{2}h(p+1)}{m}.$		(120)

This completes the proof of claim (38) of the theorem.

To establish inequality (39), we follow the same steps as in the proof of (38), with a slightly different choice of the process $\boldsymbol{D}$ . More precisely, we define $\boldsymbol{D}$ by

\displaystyle\boldsymbol{D}_{t}-\boldsymbol{D}_{0}

\displaystyle=-(t\mathbf{I}_{p}-0.5t^{2}\nabla^{2}f(\boldsymbol{D}_{0}))\nabla f% (\boldsymbol{D}_{0})+\sqrt{2}\int_{0}^{t}(\mathbf{I}-(t-u)\nabla^{2}f(% \boldsymbol{D}_{0}))\,d\boldsymbol{W}\!_{u}.

(121)

One can check that the conditional distribution of $\boldsymbol{D}_{h}$ given $\boldsymbol{D}_{0}=\boldsymbol{x}$ coincides with the conditional distribution of $\boldsymbol{\vartheta}_{k+1,h}^{\rm LMCO^{\prime}}$ given $\boldsymbol{\vartheta}_{k,h}^{\rm LMCO^{\prime}}=\boldsymbol{x}$ . Therefore, if $\boldsymbol{D}_{0}\sim\nu^{\prime}_{k}$ , then $\boldsymbol{D}_{h}\sim\nu^{\prime}_{k+1}$ and, consequently, $W_{2}(\nu^{\prime}_{k+1},\pi)^{2}\leq\mathbf{E}[\|\boldsymbol{D}_{h}-% \boldsymbol{L}_{h}\|_{2}^{2}]$ .

To ease notation, we set $\mathbf{H}_{0}=\nabla^{2}f(\boldsymbol{D}_{0})$ . The process $\boldsymbol{D}$ satisfies the SDE

\displaystyle d\boldsymbol{D}_{t}

\displaystyle=-\big{[}(\mathbf{I}_{p}-t\nabla^{2}f(\boldsymbol{D}_{0}))\nabla f% (\boldsymbol{D}_{0})+\sqrt{2}\,\mathbf{H}_{0}\boldsymbol{W}\!_{t}\big{]}\,dt+% \sqrt{2}\,d\boldsymbol{W}\!_{t},

(122)

which implies that

	$\displaystyle d\boldsymbol{D}_{t}=$	$\displaystyle-\big{[}\nabla f(\boldsymbol{D}_{0})+\nabla^{2}f(\boldsymbol{D}_{% 0})(\boldsymbol{D}_{t}-\boldsymbol{D}_{0})\big{]}\,dt+\sqrt{2}\,d\boldsymbol{W% }\!_{t}$		(123)
		$\displaystyle-0.5t^{2}\mathbf{H}_{0}^{2}\nabla f(\boldsymbol{D}_{0})\,dt-\sqrt% {2}\,\mathbf{H}_{0}^{2}\int_{0}^{t}(t-u)\,d\boldsymbol{W}\!_{u}\,dt.$		(124)

Proceeding in the same way as for getting (99), we arrive at the decomposition $\boldsymbol{X}_{t}=\boldsymbol{\Delta}_{t}-\boldsymbol{\Delta}_{0}=A_{t}-B_{t}% -C_{t}+S_{t}-E_{t}-F_{t}$ , where $A_{t}$ , $B_{t}$ , $C_{t}$ and $S_{t}$ stand for the four integrals in (99) whereas $E_{t}$ and $F_{t}$ are

	$\displaystyle E_{t}$	$\displaystyle=0.5\int_{0}^{t}e^{-s\mathbf{H}_{0}}s^{2}\,ds\,\mathbf{H}_{0}^{2}% \nabla f(\boldsymbol{D}_{0})$		(125)
	$\displaystyle F_{t}$	$\displaystyle=\sqrt{2}\,\mathbf{H}_{0}^{2}\int_{0}^{t}e^{-s\mathbf{H}_{0}}\int% _{0}^{s}(s-u)\,d\boldsymbol{W}\!_{u}\,ds.$		(126)

Using the properties of the stochastic integral, we get

$\displaystyle\mathbf{E}[\\|F_{h}\\|_{2}^{2}]$	$\displaystyle=2\mathbf{E}\Big{[}\Big{\\|}\mathbf{H}_{0}^{2}\int_{0}^{h}e^{-s% \mathbf{H}_{0}}\int_{0}^{s}(s-u)\,d\boldsymbol{W}\!_{u}\,ds\Big{\\|}_{2}^{2}% \Big{]}$	(127)
	$\displaystyle=2\mathbf{E}\Big{[}\Big{\\|}\int_{0}^{h}\int_{u}^{h}\mathbf{H}_{0}% ^{2}e^{-s\mathbf{H}_{0}}(s-u)\,ds\,d\boldsymbol{W}\!_{u}\Big{\\|}_{2}^{2}\Big{]}$	(128)
	$\displaystyle=2\int_{0}^{h}\Big{\\|}\int_{u}^{h}\mathbf{H}_{0}^{2}e^{-s\mathbf{% H}_{0}}(s-u)\,ds\Big{\\|}_{F}^{2}\,du$	(129)
	$\displaystyle\leq 2M^{4}p\int_{0}^{h}\Big{(}\int_{u}^{h}(s-u)\,ds\Big{)}^{2}\,% du=\frac{M^{4}h^{5}p}{10}.$	(130)

On the other hand,

\displaystyle\|E_{h}\|_{2}

\displaystyle\leq 0.5M^{2}\int_{0}^{h}s^{2}\,ds\|\nabla f(\boldsymbol{D}_{0})% \|_{2}\leq\frac{M^{2}h^{3}}{6}\big{(}\|\nabla f(\boldsymbol{L}_{0})\|_{2}+M\|% \boldsymbol{\Delta}_{0}\|_{2}\big{)},

(131)

which, in view of Lemma 3, implies that

\displaystyle\|E_{h}\|_{L_{2}}^{2}\leq\frac{M^{2}h^{3}}{6}\big{(}\sqrt{Mp}+MW_% {2}(\nu^{\prime}_{k},\pi)\big{)}.

(132)

Proceeding as in (113) and using (106), we get

$\displaystyle{\\|\boldsymbol{\Delta}_{h}\\|}_{L_{2}}$	$\displaystyle=\\|\boldsymbol{\Delta}_{0}+A_{h}-B_{h}-C_{h}+S_{h}-E_{h}-F_{h}\\|_% {L_{2}}$	(133)
	$\displaystyle\leq\\|\boldsymbol{\Delta}_{0}+A_{h}+S_{h}-F_{h}\\|_{L_{2}}+\\|B_{h}% \\|_{L_{2}}+\\|C_{h}\\|_{L_{2}}+\\|E_{h}\\|_{L_{2}}$	(134)
	$\displaystyle\leq(\\|\boldsymbol{\Delta}_{0}+A_{h}\\|_{L_{2}}^{2}+\\|S_{h}-F_{h}% \\|_{L_{2}}^{2})^{1/2}+\\|B_{h}\\|_{L_{2}}+\\|C_{h}\\|_{L_{2}}+\\|E_{h}\\|_{L_{2}}.$	(135)

Using the last but one estimate in (109), in conjunction with (130), we get inequalities

	$\displaystyle\\|S_{h}\\|_{L_{2}}^{2}$	$\displaystyle\leq(\nicefrac{{2}}{{3}})M_{2}Mh^{3}pW_{2}(\nu_{k}^{\prime},\pi)$		(136)
	$\displaystyle\|\mathbf{E}[S_{h}^{\top}F_{h}]\|$	$\displaystyle\leq(\nicefrac{{1}}{{\sqrt{15}}})M^{2}M_{2}h^{4}pW_{2}(\nu_{k}^{% \prime},\pi),$		(137)

which, for $h\leq 3m/(4M^{2})$ , imply that $\|S_{h}-F_{h}\|_{L_{2}}^{2}$ is less than or equal to

	$\displaystyle(\nicefrac{{2}}{{3}})M_{2}Mh^{3}pW_{2}(\nu_{k}^{\prime},\pi)$	$\displaystyle+(\nicefrac{{2}}{{\sqrt{15}}})M^{2}M_{2}h^{4}pW_{2}(\nu_{k}^{% \prime},\pi)+(\nicefrac{{1}}{{10}})M^{4}h^{5}p$		(138)
		$\displaystyle\leq 1.06M_{2}Mh^{3}pW_{2}(\nu_{k}^{\prime},\pi)+0.1M^{4}h^{5}p.$		(139)

Injecting this bound, (102), (103), (107) and (132) in (135), we arrive at

	$\displaystyle{\\|\boldsymbol{\Delta}_{h}\\|}_{L_{2}}$	$\displaystyle\leq\big{\{}\big{[}(1-mh+0.5M^{2}h^{2})^{2}W_{2}(\nu_{k}^{\prime}% ,\pi)^{2}+1.06M_{2}Mh^{3}pW_{2}(\nu_{k}^{\prime},\pi)+0.1M^{4}h^{5}p\big{\}}^{% 1/2}$		(140)
		$\displaystyle\quad+0.88M_{2}h^{2}(p+1)+\Big{(}\mu+\frac{M^{3}h^{3}}{6}\Big{)}W% _{2}(\nu_{k}^{\prime},\pi)+\frac{M^{2}M_{2}h^{4}(p+1)}{16\mu}+\frac{M^{5/2}h^{% 3}\sqrt{p}}{6}.$		(141)

In view of the inequality $\sqrt{a^{2}+b+c}\leq\sqrt{a^{2}+c}+(\nicefrac{{b}}{{2a}})$ , the last display leads to

$\displaystyle W_{2}(\nu_{k+1}^{\prime},\pi)$	$\displaystyle\leq\big{\{}\big{[}(1-mh+0.5M^{2}h^{2})^{2}W_{2}(\nu_{k}^{\prime}% ,\pi)^{2}+0.1M^{4}h^{5}p\big{\}}^{1/2}$	(142)
	$\displaystyle\quad+\frac{0.53M_{2}Mh^{3}p}{1-mh+0.5M^{2}h^{2}}+0.88M_{2}h^{2}(% p+1)+\Big{(}\mu+\frac{M^{3}h^{3}}{6}\Big{)}W_{2}(\nu_{k}^{\prime},\pi)$	(143)
	$\displaystyle\quad+\frac{M^{2}M_{2}h^{4}(p+1)}{16\mu}+\frac{M^{5/2}h^{3}\sqrt{% p}}{6}.$	(144)

For $h\leq 3m/(4M^{2})$ and $\mu=0.25mh$ , we can use the inequality $1-mh+0.5M^{2}h^{2}\geq 17/32$ and simplify the last display as follows:

$\displaystyle W_{2}(\nu_{k+1}^{\prime},\pi)$	$\displaystyle\leq\big{\{}\big{[}(1-mh+0.5M^{2}h^{2})^{2}W_{2}(\nu_{k}^{\prime}% ,\pi)^{2}+0.1M^{4}h^{5}p\big{\}}^{1/2}$	(145)
	$\displaystyle\quad+\frac{0.3975M_{2}h^{2}(p+1)}{1-mh+0.5M^{2}h^{2}}+0.88M_{2}h% ^{2}(p+1)+\Big{(}\mu+\frac{M^{3}h^{3}}{6}\Big{)}W_{2}(\nu_{k}^{\prime},\pi)$	(146)
	$\displaystyle\quad+\frac{3M_{2}h^{2}(p+1)}{16}+\frac{M^{5/2}h^{3}\sqrt{p}}{6}$	(147)
	$\displaystyle\leq\big{\{}(1-mh+0.5M^{2}h^{2})^{2}W_{2}(\nu_{k}^{\prime},\pi)^{% 2}+0.1M^{4}h^{5}p\big{\}}^{1/2}$	(148)
	$\displaystyle\quad+\Big{(}0.25mh+\frac{M^{3}h^{3}}{6}\Big{)}W_{2}(\nu_{k}^{% \prime},\pi)+1.82M_{2}h^{2}(p+1)+\frac{M^{5/2}h^{3}\sqrt{p}}{6}.$	(149)

We apply Lemma 9 to the sequence $x_{k}=W_{2}(\nu_{k}^{\prime},\pi)$ with $A=mh-0.5M^{2}h^{2}$ and $D=0.25mh+M^{3}h^{3}/6$ . For $h\leq 3m/(4M^{2})$ we have $A-D=0.75mh-0.5M^{2}h^{2}-(Mh)^{3}/6\geq 0.25mh$ and $A+D\leq 1.25mh-(3/8)M^{2}h^{2}\leq 0.727$ . This yields

	$\displaystyle W_{2}(\nu_{k+1}^{\prime},\pi)$	$\displaystyle\leq(1-0.25mh)^{k}W_{2}(\nu^{\prime}_{0},\pi)+\frac{7.28M_{2}h(p+% 1)}{m}+\frac{2M^{5/2}h^{2}\sqrt{p}}{3m}+\frac{2\sqrt{0.1}\,M^{2}h^{2}\sqrt{p}}% {\sqrt{1.273m}}$		(150)
		$\displaystyle\leq(1-0.25mh)^{k}W_{2}(\nu^{\prime}_{0},\pi)+\frac{7.28M_{2}h(p+% 1)}{m}+\frac{1.23M^{5/2}h^{2}\sqrt{p}}{m}.$		(151)

This completes the proof of (39) and that of the theorem.

Proof of Proposition 1.

Let us denote $\mathbf{M}_{k}=\int_{0}^{h}e^{-s\mathbf{H}_{k}}\,ds\int_{0}^{1}\nabla^{2}f(% \boldsymbol{D}_{kh}+x\boldsymbol{\Delta}_{k})\,dx$ . From (99), we have $\boldsymbol{\Delta}_{k+1}=\boldsymbol{\Delta}_{k}+A_{k,h}+G_{k,h}$ with

	$\displaystyle A_{k,h}$	$\displaystyle=\int_{0}^{h}e^{-s\mathbf{H}_{k}}\,ds\big{(}\nabla f(\boldsymbol{% D}_{kh})-\nabla f(\boldsymbol{L}_{kh})\big{)}=-\mathbf{M}_{k}\boldsymbol{% \Delta}_{k},$		(152)
	$\displaystyle G_{k,h}$	$\displaystyle=\int_{0}^{h}e^{-s\mathbf{H}_{k}}\big{(}\nabla f(\boldsymbol{L}_{% kh})-\nabla f(\boldsymbol{L}_{s})+\mathbf{H}_{k}(\boldsymbol{L}_{s}-% \boldsymbol{L}_{kh})\big{)}\,ds.$		(153)

Using the fact that

\displaystyle\bigg{\|}\int_{0}^{1}\nabla^{2}f(\boldsymbol{D}_{kh}+x\boldsymbol% {\Delta}_{k})\,dx-\mathbf{H}_{k}\bigg{\|}

\displaystyle\leq\int_{0}^{1}\big{\|}\nabla^{2}f(\boldsymbol{D}_{kh}+x% \boldsymbol{\Delta}_{k})-\mathbf{H}_{k}\big{\|}\,dx\leq\frac{M_{2}}{2}\,\|% \boldsymbol{\Delta}_{k}\|_{2},

(154)

we get $\|\boldsymbol{\Delta}_{k}+A_{k,h}\|_{2}=\|(\mathbf{I}-\mathbf{M}_{k})% \boldsymbol{\Delta}_{k}\|_{2}\leq\frac{M_{2}}{2m}\,\|\boldsymbol{\Delta}_{k}\|% _{2}^{2}+e^{-mh}\|\boldsymbol{\Delta}_{k}\|_{2}$ . This further leads to the recursive inequality

\displaystyle\|\boldsymbol{\Delta}_{k+1}\|_{2}

\displaystyle\leq\frac{M_{2}}{2m}\,\|\boldsymbol{\Delta}_{k}\|_{2}^{2}+e^{-mh}% \|\boldsymbol{\Delta}_{k}\|_{2}+\|G_{k,h}\|_{2}.

(155)

In view of the Minkowski inequality, this yields

\displaystyle(\mathbf{E}[\|\boldsymbol{\Delta}_{k+1}\|_{2}^{q}])^{1/q}

\displaystyle\leq\frac{M_{2}}{2m}\,\mathbf{E}[\|\boldsymbol{\Delta}_{k}\|_{2}^% {2q}]^{1/q}+e^{-mh}\mathbf{E}[\|\boldsymbol{\Delta}_{k}\|_{2}^{2q}]^{1/2q}+% \mathbf{E}[\|G_{k,h}\|_{2}^{q}]^{1/q}.

(156)

We choose some $K\in\mathbb{N}$ and define the sequence $\{x_{0},\ldots,x_{K}\}$ by setting $x_{k}^{2^{K+1-k}}=\mathbf{E}[\|\boldsymbol{\Delta}_{k}\|_{2}^{2^{K+1-k}}]$ . Choosing in (156) $q=2^{K-k}$ , we get

\displaystyle x_{k+1}

\displaystyle\leq\frac{M_{2}}{2m}\,x_{k}^{2}+e^{-mh}x_{k}+\mathbf{E}[\|G_{k,h}% \|_{2}^{2^{K-k}}]^{2^{k-K}},\quad k=0,1,\ldots,K-1.

(157)

We are in a position to apply Lemma 8 to the sequence $\{x_{k}\}_{k=0,\ldots,K}$ . This yields

\displaystyle x_{K}

\displaystyle\leq\frac{2m}{M_{2}}\bigg{(}\frac{M_{2}x_{0}}{2m}+\frac{1}{2}e^{-% mh}\bigg{)}^{2^{K}}\exp\bigg{\{}2^{K}\frac{M_{2}\max_{k}\mathbf{E}[\|G_{k,h}\|% _{2}^{2^{K}}]^{2^{-K}}+me^{-mh}}{m(\frac{M_{2}x_{0}}{2m}+\frac{1}{2}e^{-mh})^{% 2^{K}}}\bigg{\}},

(158)

where $\max_{k}$ is a short notation for $\max_{k=0,1,\ldots,K-1}$ . It suffices now to upper bound the moments of $\|G_{k,h}\|_{2}$ . We have

$\displaystyle\mathbf{E}[\\|G_{k,h}\\|_{2}^{q}]^{1/q}$	$\displaystyle\leq M\int_{0}^{h}e^{-sm}\big{(}\mathbf{E}[\\|\boldsymbol{L}_{kh+s% }-\boldsymbol{L}_{kh}\\|_{2}^{q}]\big{)}^{1/q}\,ds$	(159)
	$\displaystyle\leq M\int_{0}^{h}e^{-sm}\Big{\{}\big{(}\mathbf{E}[\\|\int_{0}^{s}% \nabla f(\boldsymbol{L}_{kh+u})\,du\\|_{2}^{q}]\big{)}^{1/q}+\sqrt{2}\big{(}% \mathbf{E}[\\|\boldsymbol{W}\!_{s}\\|_{2}^{q}]\big{)}^{1/q}\Big{\}}\,ds$	(160)
	$\displaystyle\leq M\int_{0}^{h}e^{-sm}s\,ds\big{(}\mathbf{E}[\\|\nabla f(% \boldsymbol{L}_{0})\\|_{2}^{q}]\big{)}^{1/q}+M\sqrt{2p+q-2}\int_{0}^{s}e^{-sm}% \sqrt{s}\,ds$	(161)
	$\displaystyle\leq\frac{M}{m^{2}}\big{(}\mathbf{E}[\\|\nabla f(\boldsymbol{L}_{0% })\\|_{2}^{q}]\big{)}^{1/q}+\frac{M}{2m^{3/2}}\sqrt{(2p+q-2)\pi}.$	(162)

On the other hand, by integration by parts, for every $q\in 2\mathbb{N}$ , we have

$\displaystyle\mathbf{E}[\\|\nabla f(\boldsymbol{L}_{0})\\|_{2}^{q}]$	$\displaystyle=-\int_{\mathbb{R}^{p}}\\|\nabla f(\boldsymbol{x})\\|_{2}^{q-2}\,% \nabla f(\boldsymbol{x})\!^{\top}d\pi(\boldsymbol{x})$	(163)
	$\displaystyle=\sum_{\ell=1}^{p}\int_{\mathbb{R}^{p}}\partial_{\ell}\Big{(}\\|% \nabla f(\boldsymbol{x})\\|_{2}^{q-2}\,\partial_{\ell}f(\boldsymbol{x})\Big{)}% \pi(\boldsymbol{x})\,d\boldsymbol{x}$	(164)
	$\displaystyle\leq M(p+q-2)\mathbf{E}[\\|\nabla f(\boldsymbol{L}_{0})\\|_{2}^{q-2% }].$	(165)

This yields $(\mathbf{E}[\|\nabla f(\boldsymbol{L}_{0})\|_{2}^{q}])^{1/q}\leq\sqrt{M(p+0.5q% -1)}$ . Combining all these estimates, we arrive at

\mathbf{E}[\|G_{k,h}\|_{2}^{q}]^{1/q}\leq\frac{1.6M^{3/2}\sqrt{2p+q-2}}{m^{2}}.

Combining this inequality with (158) and replacing $x_{K}$ by $(\mathbf{E}[\|\boldsymbol{\Delta}_{K}\|_{2}^{2}])^{1/2}$ , we get

\displaystyle(\mathbf{E}[\|\boldsymbol{\Delta}_{K}\|_{2}^{2}])^{1/2}

\displaystyle\leq\frac{2m}{M_{2}}\bigg{(}\frac{M_{2}x_{0}}{2m}+\frac{1}{2}e^{-% mh}\bigg{)}^{2^{K}}\exp\bigg{\{}2^{K}\frac{1.6M_{2}M^{3/2}\sqrt{2p+2^{K-1}-2}+% m^{3}e^{-mh}}{m^{3}(\frac{M_{2}x_{0}}{2m}+\frac{1}{2}e^{-mh})^{2^{K}}}\bigg{\}}.

(166)

This completes the proof of the proposition. ∎

7.7 Proofs of lemmas

Here we provide the proofs of Lemma 2, Lemma 3 and Lemma 4.

Proof of Lemma 2.

We start by recalling the following inequality ²⁴ Theorem 2.12, true for any $m$ -strongly convex and $M$ -gradient Lipschitz function $f$ :

(\boldsymbol{y}-\boldsymbol{x})^{\top}\left(\nabla f(\boldsymbol{y})-\nabla f(% \boldsymbol{x})\right)\geq\frac{mM}{m+M}\|\boldsymbol{y}-\boldsymbol{x}\|^{2}_% {2}+\frac{1}{m+M}\left\|\nabla f(\boldsymbol{y})-\nabla f(\boldsymbol{x})% \right\|_{2}^{2},

(167)

for all vectors $\boldsymbol{x}$ and $y$ from $\mathbb{R}^{p}$ . This yields

$\displaystyle\\|\boldsymbol{y}-\boldsymbol{x}-h(\nabla f(\boldsymbol{y})-$	$\displaystyle\nabla f(\boldsymbol{x}))\\|_{2}^{2}$	(168)
	$\displaystyle=\\|\boldsymbol{y}-\boldsymbol{x}\\|^{2}_{2}-2h(\boldsymbol{y}-% \boldsymbol{x})^{\top}(\nabla f(\boldsymbol{y})-\nabla f(\boldsymbol{x}))+h^{2% }\\|\nabla f(\boldsymbol{y})-\nabla f(\boldsymbol{x})\\|_{2}^{2}$	(169)
	$\displaystyle\leq\left(1-\frac{2hmM}{m+M}\right)\\|\boldsymbol{y}-\boldsymbol{x% }\\|_{2}^{2}+h\left(h-\frac{2}{m+M}\right)\\|\nabla f(\boldsymbol{y})-\nabla f(% \boldsymbol{x})\\|_{2}^{2}.$	(170)

Since $f$ is $m$ -strongly convex, we have (²⁴, Theorem 2.1.9)

\|\nabla f(\boldsymbol{y})-\nabla f(\boldsymbol{x})\|_{2}\geq m\|\boldsymbol{y% }-\boldsymbol{x}\|_{2}.

(171)

In the case $h\leq\frac{2}{m+M}$ , applying the previous result to the second summand, we get

\|\boldsymbol{y}-\boldsymbol{x}-h(\nabla f(\boldsymbol{y})-\nabla f(% \boldsymbol{x}))\|_{2}^{2}\leq(1-hm)^{2}\|\boldsymbol{y}-\boldsymbol{x}\|^{2}.

(172)

In the case when $h\geq\frac{2}{m+M}$ , we use the Lipschitz continuity of $\nabla f$ , which leads to

\|\boldsymbol{y}-\boldsymbol{x}-h(\nabla f(\boldsymbol{y})-\nabla f(% \boldsymbol{x}))\|_{2}^{2}\leq(hM-1)^{2}\|\boldsymbol{y}-\boldsymbol{x}\|^{2}.

(173)

Summing up, for all $h\in(0,2/M)$ we have shown

\|\boldsymbol{y}-\boldsymbol{x}-h(\nabla f(\boldsymbol{y})-\nabla f(% \boldsymbol{x}))\|_{2}^{2}\leq\left\{(1-hm)^{2}\vee(hM-1)^{2}\right\}\|% \boldsymbol{y}-\boldsymbol{x}\|^{2}.

(174)

This completes the proof. ∎

Proof of Lemma 3.

We start the proof with the case $p=1$ . The function $x\mapsto f^{\prime}(x)$ being Lipschitz continuous is almost surely differentiable. Furthermore, it is clear that $|f^{\prime\prime}(x)|\leq M$ for every $x$ for which this second derivative exists. The result of ²⁹ Theorem 7.20 implies that

f^{\prime}(x)-f^{\prime}(0)=\int_{0}^{x}f^{\prime\prime}(y)\,dy.

(175)

Therefore, using the relation $f^{\prime}(x)\,\pi(x)=-\pi^{\prime}(x)$ , we get

$\displaystyle\int_{\mathbb{R}}f^{\prime}(x)^{2}\,\pi(x)\,dx$	$\displaystyle=f^{\prime}(0)\int_{\mathbb{R}}f^{\prime}(x)\,\pi(x)\,dx+\int_{% \mathbb{R}}\Big{(}\int_{0}^{x}f^{\prime\prime}(y)\,dy\Big{)}f^{\prime}(x)\,\pi% (x)\,dx$	(176)
	$\displaystyle=-f^{\prime}(0)\int_{\mathbb{R}}\pi^{\prime}(x)\,dx-\int_{\mathbb% {R}}\Big{(}\int_{0}^{x}f^{\prime\prime}(y)\,dy\Big{)}\pi^{\prime}(x)\,dx$	(177)
	$\displaystyle=-\int_{0}^{\infty}\int_{0}^{x}f^{\prime\prime}(y)\,\pi^{\prime}(% x)\,dy\,dx+\int_{-\infty}^{0}\int_{x}^{0}f^{\prime\prime}(y)\,\pi^{\prime}(x)% \,dy\,dx.$	(178)

In view of Fubini’s theorem, we arrive at

\displaystyle\int_{\mathbb{R}}f^{\prime}(x)^{2}\,\pi(x)\,dx

\displaystyle=\int_{0}^{\infty}f^{\prime\prime}(y)\,\pi(y)\,dy+\int_{-\infty}^% {0}f^{\prime\prime}(y)\,\pi(y)\,dy\leq M.

(179)

Now let us return to the multidimensional case:

\int_{\mathbb{R}^{p}}\|\nabla f(\boldsymbol{x})\|_{2}^{2}\,\pi(\boldsymbol{x})% \,d\boldsymbol{x}=\sum\limits_{k=1}^{p}\int_{\mathbb{R}^{p}}\left(\frac{% \partial f}{\partial x_{k}}(\boldsymbol{x})\right)^{2}\pi(\boldsymbol{x})\,d% \boldsymbol{x}.

(180)

We will show that each of the summands is less than $M$ , thus the sum is less than $Mp$ . Let us prove it for $k=1$ . The proof is similar for the case $k>1$ . Using Fubini’s theorem, we have

\int_{\mathbb{R}^{p}}\left(\frac{\partial f}{\partial x_{1}}(\boldsymbol{x})% \right)^{2}\pi(\boldsymbol{x})\,d\boldsymbol{x}=\int_{\mathbb{R}}\ldots\int_{% \mathbb{R}}\left(\frac{\partial f}{\partial x_{1}}(x_{1},x_{2},\ldots,x_{p})% \right)^{2}\pi(x_{1},x_{2},\ldots,x_{p})\,dx_{1}dx_{2}\ldots dx_{p}.

(181)

Let us fix the $(p-1)$ -tuple $(x_{2},x_{3},\ldots,x_{p})$ and define functions $g$ and $\eta$ as $g(t)=f(t,x_{2},\ldots,x_{p})$ and $\eta(t)=\pi(t,x_{2},\ldots,x_{p})$ , respectively. It is easy to verify that $\eta$ is an integrable log-concave function, with $g$ as its potential. The latter is also differentiable and its derivative is Lipschitz-continuous with constant $M$ . Thus we have

\int_{\mathbb{R}}\left(\frac{\partial f}{\partial x_{1}}(x_{1},x_{2},\ldots,x_% {p})\right)^{2}\pi(x_{1},x_{2},\ldots,x_{p})\,dx_{1}=\int_{\mathbb{R}}\left(g^% {\prime}(t)\right)^{2}\eta(t)dt.

(182)

From the definition one can verify that $\int_{\mathbb{R}}\eta(t)dt=\pi_{1}(x_{2},\ldots,x_{p})$ , where $\pi_{1}$ is the marginal distribution of all the coordinates except the first. Therefore,

	$\displaystyle\int_{\mathbb{R}}g^{\prime}(t)^{2}\eta(t)dt$	$\displaystyle=\pi_{1}(x_{2},\ldots,x_{p})\int_{\mathbb{R}}g^{\prime}(t)^{2}% \frac{\eta(t)}{\pi_{1}(x_{2},\ldots,x_{p})}dt$		(183)
		$\displaystyle\leq M\pi_{1}(x_{2},\ldots,x_{p})$		(184)

The last inequality is true due to (179). Returning to our initial integral, we obtain

\int_{\mathbb{R}^{p}}\left(\frac{\partial f}{\partial x_{1}}(\boldsymbol{x})% \right)^{2}\pi(\boldsymbol{x})\,d\boldsymbol{x}\leq M\int_{\mathbb{R}^{p-1}}% \pi_{1}(x_{2},\ldots,x_{p})dx_{2}\ldots dx_{p}=M.

(185)

This completes the proof. ∎

Proof of Lemma 4.

Since the process $\boldsymbol{L}$ is stationary, $V(a)$ has the same distribution as $V(0)$ . For this reason, it suffices to prove the claim of the lemma for $a=0$ only. Using the Cauchy-Schwarz inequality and the Lipschitz continuity of $f$ , we get

$\displaystyle\\|\boldsymbol{V}(0)\\|_{L_{2}}$	$\displaystyle=\Big{\\|}\int_{0}^{h}\big{(}\nabla f(\boldsymbol{L}_{t})-\nabla f% (\boldsymbol{L}_{0})\big{)}\,dt\Big{\\|}_{L_{2}}$	(186)
	$\displaystyle\leq\int_{0}^{h}\big{\\|}\nabla f(\boldsymbol{L}_{t})-\nabla f(% \boldsymbol{L}_{0})\big{\\|}_{L_{2}}\,dt$	(187)
	$\displaystyle\leq M\int_{0}^{h}\big{\\|}\boldsymbol{L}_{t}-\boldsymbol{L}_{0}% \big{\\|}_{L_{2}}\,dt.$	(188)

Combining this inequality with the definition of $\boldsymbol{L}_{t}$ , we arrive at

$\displaystyle\\|\boldsymbol{V}(0)\\|_{L_{2}}$	$\displaystyle\leq M\int_{0}^{h}\big{\\|}-\int_{0}^{t}\nabla f(\boldsymbol{L}_{s% })\,ds+\sqrt{2}\,\boldsymbol{W}\!_{t}\big{\\|}_{L_{2}}\,dt$	(189)
	$\displaystyle\leq M\int_{0}^{h}\big{\\|}\int_{0}^{t}\nabla f(\boldsymbol{L}_{s}% )\,ds\big{\\|}_{L_{2}}\,dt+M\int_{0}^{h}\big{\\|}\sqrt{2}\,\boldsymbol{W}\!_{t}% \big{\\|}_{L_{2}}\,dt$	(190)
	$\displaystyle\leq M\int_{0}^{h}\int_{0}^{t}\\|\nabla f(\boldsymbol{L}_{s})\\|_{L% _{2}}\,ds\,dt+M\int_{0}^{h}\sqrt{2pt}\,dt.$	(191)

In view of the stationarity of $\boldsymbol{L}_{t}$ , we have $\|\nabla f(\boldsymbol{L}_{s})\|_{L_{2}}=\|\nabla f(\boldsymbol{L}_{0})\|_{L_{% 2}}$ , which leads to

\displaystyle\|\boldsymbol{V}(0)\|_{L_{2}}

\displaystyle\leq(\nicefrac{{1}}{{2}})Mh^{2}\big{\|}\nabla f(\boldsymbol{L}_{0% })\big{\|}_{L_{2}}+(\nicefrac{{2}}{{3}})M\sqrt{2p}\;h^{3/2}.

(192)

To complete the proof, it suffices to apply Lemma 3. ∎

Lemma 6.

Let us denote

	$\displaystyle\widetilde{\boldsymbol{V}}$	$\displaystyle=\int_{0}^{h}\big{(}\nabla f(\boldsymbol{L}_{t})-\nabla f(% \boldsymbol{L}_{0})-\nabla^{2}f(\boldsymbol{L}_{0})(\boldsymbol{L}_{t}-% \boldsymbol{L}_{0})\big{)}\,dt,$		(193)
	$\displaystyle\bar{\boldsymbol{V}}$	$\displaystyle=\int_{0}^{h}\Big{\{}\nabla f(\boldsymbol{L}_{t})-\nabla f(% \boldsymbol{L}_{0})-\sqrt{2}\,\int_{0}^{t}\nabla^{2}f(\boldsymbol{L}_{s})d% \boldsymbol{W}\!_{s}\Big{\}}\,dt,$		(194)

with $f$ satisfying Condition F and $h\leq 1/M$ , then

	$\displaystyle(\mathbf{E}[\\|\widetilde{\boldsymbol{V}}\\|_{2}^{2}])^{1/2}$	$\displaystyle\leq 0.877M_{2}h^{2}(p^{2}+2p)^{1/2},$		(195)
	$\displaystyle\\|\bar{\boldsymbol{V}}\\|_{L_{2}}$	$\displaystyle\leq(\nicefrac{{1}}{{2}})(M^{3/2}\sqrt{p}+M_{2}p)h^{2}.$		(196)

Proof.

We first note that we have

	$\displaystyle\\|\widetilde{\boldsymbol{V}}\\|_{2}$	$\displaystyle\leq\int_{0}^{h}\\|\int_{0}^{1}\big{(}\nabla^{2}\!f(\boldsymbol{L}% _{0}+x(\boldsymbol{L}_{t}-\boldsymbol{L}_{0}))-\nabla^{2}\!f(\boldsymbol{L}_{0% })\big{)}\,dx(\boldsymbol{L}_{t}-\boldsymbol{L}_{0})\\|_{2}\,dt$		(197)
		$\displaystyle\leq 0.5M_{2}\int_{0}^{h}\\|\boldsymbol{L}_{t}-\boldsymbol{L}_{0}% \\|_{2}^{2}\,dt.$		(198)

In view of (46), this implies that $(\mathbf{E}[\|\widetilde{\boldsymbol{V}}\|_{2}^{2}])^{1/2}\leq 0.5M_{2}\int_{0% }^{h}(\mathbf{E}[\|\boldsymbol{L}_{t}-\boldsymbol{L}_{0}\|_{2}^{4}])^{1/2}\,dt$ . Using the triangle inequality and integration by parts (precise details of the computations are omitted in the interest of saving space), we arrive at

$\displaystyle\mathbf{E}[\\|\boldsymbol{L}_{t}-\boldsymbol{L}_{0}\\|_{2}^{4}]$	$\displaystyle\leq\mathbf{E}[\\|\int_{0}^{t}\nabla f(\boldsymbol{L}_{s})\\|_{2}^{% 4}]+4\mathbf{E}[\\|\boldsymbol{W}\!_{t}\\|_{2}^{4}]$	(199)
	$\displaystyle\qquad+12\bigg{(}\mathbf{E}[\\|\int_{0}^{t}\nabla f(\boldsymbol{L}% _{s})\\|_{2}^{4}]\mathbf{E}[\\|\sqrt{2}\boldsymbol{W}\!_{t}\\|_{2}^{4}]\bigg{)}^{% 1/2}$	(200)
	$\displaystyle\leq t^{4}M^{2}p(2+p)+12t^{3}Mp(2+p)+4t^{2}p(2+p)$	(201)
	$\displaystyle=p(2+p)t^{2}(t^{2}M^{2}+12tM+4).$	(202)

Integrating this inequality, we get

$\displaystyle(\mathbf{E}[\\|\widetilde{\boldsymbol{V}}\\|_{2}^{2}])^{1/2}$	$\displaystyle\leq 0.5M_{2}(p^{2}+2p)^{1/2}\int_{0}^{h}t(t^{2}M^{2}+12tM+4)^{1/% 2}\,dt$	(203)
	$\displaystyle\leq\frac{0.5M_{2}(p^{2}+2p)^{1/2}}{M^{2}}\int_{0}^{Mh}t(t^{2}+12% t+4)^{1/2}\,dt$	(204)
	$\displaystyle\leq{0.5M_{2}h^{2}(p^{2}+2p)^{1/2}}\sup_{x\in(0,2]}\frac{1}{x^{2}% }\int_{0}^{x}t(t^{2}+12t+4)^{1/2}\,dt$	(205)
	$\displaystyle=\frac{0.5M_{2}h^{2}(p^{2}+2p)^{1/2}}{4}\int_{0}^{2}t(t^{2}+12t+4% )^{1/2}\,dt$	(206)
	$\displaystyle\leq 1.16M_{2}h^{2}(p^{2}+2p)^{1/2}.$	(207)

This completes the proof of (195). To prove (196), we first assume that $f$ is three times continuously differentiable and apply the Ito formula:

\nabla f(\boldsymbol{L}_{t})-\nabla f(\boldsymbol{L}_{0})=\int_{0}^{t}\nabla^{% 2}f(\boldsymbol{L}_{s})\,d\boldsymbol{L}_{s}+\int_{0}^{t}\Delta[\nabla f(% \boldsymbol{L}_{s})]\,ds.

Let us check that $\|\Delta[\nabla f(\boldsymbol{x})]\|_{2}=\|\nabla[\Delta f(\boldsymbol{x})]\|_% {2}\leq M_{2}p$ for every $\boldsymbol{x}\in\mathbb{R}^{p}$ . Indeed, let us introduce the function $g:\mathbb{R}^{p}\to\mathbb{R}$ defined by $g(\boldsymbol{x})=\Delta f(\boldsymbol{x})=\operatorname{tr}[\nabla^{2}f(% \boldsymbol{x})]$ . The third item of condition F implies that $|g(\boldsymbol{x}+t\boldsymbol{u})-g(\boldsymbol{x})|\leq pM_{2}|t|$ for every $t\in\mathbb{R}$ and every unit vector $\boldsymbol{u}\in\mathbb{R}^{p}$ . Therefore, letting $t$ go to zero, we get $|\boldsymbol{u}^{\top}\nabla g(\boldsymbol{x})|\leq pM_{2}$ for every unit vector $\boldsymbol{u}$ . Choosing $\boldsymbol{u}$ proportional to $\nabla g(\boldsymbol{x})$ , we get the inequality $\|\nabla g(\boldsymbol{x})\|_{2}=\|\nabla[\Delta f(\boldsymbol{x})]\|_{2}\leq pM% _{2}$ . This leads to

$\displaystyle\\|\bar{\boldsymbol{V}}\\|_{L_{2}}$	$\displaystyle\leq\int_{0}^{h}\int_{0}^{t}\big{\\|}\nabla^{2}f(\boldsymbol{L}_{s% })\nabla f(\boldsymbol{L}_{s})-\Delta[\nabla f(\boldsymbol{L}_{s})]\big{\\|}_{L% ^{2}}\,ds\,dt$	(208)
	$\displaystyle\leq\int_{0}^{h}\int_{0}^{t}\big{(}M\big{\\|}\nabla f(\boldsymbol{% L}_{s})\big{\\|}_{L^{2}}+M_{2}p\big{)}\,ds\,dt$	(209)
	$\displaystyle=(\nicefrac{{1}}{{2}})(M^{3/2}\sqrt{p}+M_{2}p)h^{2}.$	(210)

This completes the proof of the lemma in the case of three times continuously differentiable functions $f$ . If $f$ is two-times differentiable with a second-order derivative satisfying the Lipschitz condition, then we can choose an arbitrarily small $\delta>0$ and apply the previous result to the smoothed function $f_{\delta}=f*\varphi_{\delta}$ . Here, $\varphi_{\delta}$ denotes the density of the Gaussian distribution $\mathcal{N}_{p}(0,\delta^{2}\mathbf{I}_{p})$ and “ $*$ ” is the convolution operator. The formula $\nabla^{2}f_{\delta}=(\nabla^{2}f)*\varphi_{\delta}$ implies that $f_{\delta}$ satisfies the required smoothness assumptions with the same constants $M$ and $M_{2}$ as the function $f$ . Thus, defining $\bar{\boldsymbol{V}}_{\delta}$ in the same way as $\bar{\boldsymbol{V}}$ with $f_{\delta}$ instead of $f$ , we get

\displaystyle\|\bar{\boldsymbol{V}}_{\delta}\|_{L_{2}}

\displaystyle\leq(\nicefrac{{1}}{{2}})(M^{3/2}\sqrt{p}+M_{2}p)h^{2}.

(211)

On the other hand, setting $g_{\delta}=f-f_{\delta}$ , we get

$\displaystyle\\|\bar{\boldsymbol{V}}_{\delta}-\bar{\boldsymbol{V}}\\|_{L_{2}}$	$\displaystyle\leq\int_{0}^{h}\Big{\\|}\nabla g_{\delta}(\boldsymbol{L}_{t})-% \nabla g_{\delta}(\boldsymbol{L}_{0})-\sqrt{2}\,\int_{0}^{t}\nabla^{2}g_{% \delta}(\boldsymbol{L}_{s})d\boldsymbol{W}\!_{s}\Big{\\|}_{L^{2}}\,dt$	(212)
	$\displaystyle\leq\int_{0}^{h}\big{\\|}\nabla g_{\delta}(\boldsymbol{L}_{t})-% \nabla g_{\delta}(\boldsymbol{L}_{0})\big{\\|}_{L^{2}}\,dt$	(213)
	$\displaystyle\qquad+\sqrt{2p}\,\int_{0}^{h}\bigg{(}\int_{0}^{t}\mathbf{E}\\|% \nabla^{2}g_{\delta}(\boldsymbol{L}_{s})\\|^{2}ds\bigg{)}^{1/2}\,dt.$	(214)

Using the Lipschitz continuity of $\nabla f$ and $\nabla^{2}f$ , one easily checks that

$\displaystyle\\|\nabla g_{\delta}(\boldsymbol{x})\\|_{2}$	$\displaystyle\leq\int_{\mathbb{R}^{p}}\\|\nabla f(\boldsymbol{x}-\boldsymbol{y}% )-\nabla f(\boldsymbol{x})\\|_{2}\varphi_{\delta}(\boldsymbol{y})\,d\boldsymbol% {y}$	(215)
	$\displaystyle\leq M\int_{\mathbb{R}^{p}}\\|\boldsymbol{y}\\|_{2}\varphi_{\delta}% (\boldsymbol{y})\,d\boldsymbol{y}\leq M\delta\sqrt{p},$	(216)
$\displaystyle\\|\nabla^{2}g_{\delta}(\boldsymbol{x})\\|$	$\displaystyle\leq\int_{\mathbb{R}^{p}}\\|\nabla^{2}f(\boldsymbol{x}-\boldsymbol% {y})-\nabla^{2}f(\boldsymbol{x})\\|\varphi_{\delta}(\boldsymbol{y})\,d% \boldsymbol{y}$	(217)
	$\displaystyle\leq M_{2}\int_{\mathbb{R}^{p}}\\|\boldsymbol{y}\\|_{2}\varphi_{% \delta}(\boldsymbol{y})\,d\boldsymbol{y}\leq M_{2}\delta\sqrt{p}.$	(218)

This implies that the limit, when $\delta$ tends to zero, of $\|\bar{\boldsymbol{V}}_{\delta}-\bar{\boldsymbol{V}}\|_{L_{2}}$ is equal to zero. As a consequence,

$\displaystyle\\|\bar{\boldsymbol{V}}\\|_{L_{2}}$	$\displaystyle\leq\lim_{\delta\to 0}\big{(}\\|\bar{\boldsymbol{V}}_{\delta}\\|_{L% _{2}}+\\|\bar{\boldsymbol{V}}_{\delta}-\bar{\boldsymbol{V}}\\|_{L_{2}}\big{)}$	(219)
	$\displaystyle\leq(\nicefrac{{1}}{{2}})(M^{3/2}\sqrt{p}+M_{2}p)h^{2}+\lim_{% \delta\to 0}\\|\bar{\boldsymbol{V}}_{\delta}-\bar{\boldsymbol{V}}\\|_{L_{2}}$	(220)
	$\displaystyle\leq(\nicefrac{{1}}{{2}})(M^{3/2}\sqrt{p}+M_{2}p)h^{2}.$	(221)

This completes the proof of the lemma. ∎

Lemma 7.

Let $A$ , $B$ and $C$ be given non-negative numbers such that $A\in(0,1)$ . Assume that the sequence of non-negative numbers $\{x_{k}\}_{k\in\mathbb{N}}$ satisfies the recursive inequality

\displaystyle x_{k+1}^{2}

\displaystyle\leq[(1-A)x_{k}+C]^{2}+B^{2}

(222)

for every integer $k\geq 0$ . Let us denote

	$\displaystyle E$	$\displaystyle=\frac{(1-A)C+\big{\{}C^{2}+(2A-A^{2})B^{2}\big{\}}^{1/2}}{2A-A^{% 2}}\geq\frac{(1-A)C}{A(2-A)}+\frac{B}{\sqrt{A(2-A)}}$		(223)
	$\displaystyle D$	$\displaystyle=\big{\{}[(1-A)E+C]^{2}+B^{2}\big{\}}^{1/2}-(1-A)E\leq C+\frac{B^% {2}A}{C+\sqrt{A(2-A)}\,B}$		(224)

Then

\displaystyle x_{k}

\displaystyle\leq(1-A)^{k}x_{0}+\frac{D}{A}\leq(1-A)^{k}x_{0}+\frac{C}{A}+% \frac{B^{2}}{C+\sqrt{A(2-A)}\,B}

(225)

for all integers $k\geq 0$ .

Proof.

We will repeatedly use the fact that $D=EA$ . Let us introduce the sequence $y_{k}$ defined as follows: $y_{0}=x_{0}+E$ and

\displaystyle y_{k+1}=(1-A)y_{k}+D,\quad k=0,1,2,\ldots

(226)

We will first show that $y_{k}\geq x_{k}\vee E$ for every $k\geq 0$ . This can be done by mathematical induction. For $k=0$ , this claim directly follows from the definition of $y_{0}$ . Assume that for some $k$ , we have $x_{k}\leq y_{k}$ and $y_{k}\geq E$ . Then, for $k+1$ , we have

$\displaystyle x_{k+1}$	$\displaystyle\leq\big{(}[(1-A)x_{k}+C]^{2}+B^{2}\big{)}^{1/2}$	(227)
	$\displaystyle\leq\big{(}[(1-A)y_{k}+C]^{2}+B^{2}\big{)}^{1/2}$	(228)
	$\displaystyle=(1-A)y_{k}+\big{(}[(1-A)y_{k}+C]^{2}+B^{2}\big{)}^{1/2}-(1-A)y_{k}$	(229)
	$\displaystyle\leq(1-A)y_{k}+\big{(}[(1-A)E+C]^{2}+B^{2}\big{)}^{1/2}-(1-A)E=y_% {k+1}$	(230)

and, since $D=EA$ , $y_{k+1}=(1-A)y_{k}+D\geq(1-A)E+EA=E$ . Thus, we have checked that the sequence $x_{k}$ is dominated by the sequence $y_{k}$ . It remains to establish an upper bound on $y_{k}$ . This is an easy task since $y_{k}$ satisfies a first-order linear recurrence relation. We get

$\displaystyle y_{k}$	$\displaystyle=(1-A)^{k-1}y_{1}+\sum_{j=0}^{k-2}(1-A)^{j}D$	(231)
	$\displaystyle=(1-A)^{k-1}\Big{(}x_{1}+\frac{D}{A}\Big{)}+\frac{D}{A}\big{(}1-(% 1-A)^{k-1}\big{)}$	(232)
	$\displaystyle=(1-A)^{k-1}x_{1}+\frac{D}{A}.$	(233)

This completes the proof of (225). ∎

Proof of Lemma 5.

Let us introduce the $\mathbb{R}^{p}$ -valued random process $\boldsymbol{v}_{t}=-\exp\big{\{}\int_{0}^{t}\mathbf{H}_{u}\,du\big{\}}\int_{0}% ^{t}\mathbf{H}_{s}\boldsymbol{x}_{s}\,ds$ . The time derivative of this process satisfies

\boldsymbol{v}^{\prime}_{t}=-\exp\Big{\{}\int_{0}^{t}\mathbf{H}_{u}\,du\Big{\}% }\mathbf{H}_{t}\boldsymbol{\alpha}_{t}.

This implies that $\boldsymbol{v}_{t}=-\int_{0}^{t}\exp\big{\{}\int_{0}^{s}\mathbf{H}_{u}\,du\big% {\}}\mathbf{H}_{s}\boldsymbol{\alpha}_{s}\,ds$ . Using the definition of $\boldsymbol{v}_{t}$ , we can check that $\int_{0}^{t}\mathbf{H}_{s}\boldsymbol{x}_{s}\,ds=-\exp\big{\{}-\int_{0}^{t}% \mathbf{H}_{u}\,du\big{\}}\boldsymbol{v}_{t}=\int_{0}^{t}\exp\big{\{}-\int_{s}% ^{t}\mathbf{H}_{u}\,du\big{\}}\mathbf{H}_{s}\boldsymbol{\alpha}_{s}\,ds$ . Substituting this in (91), we get

\displaystyle\boldsymbol{x}_{t}

\displaystyle=\boldsymbol{\alpha}_{t}-\int_{0}^{t}\exp\big{\{}-\int_{s}^{t}% \mathbf{H}_{u}\,du\big{\}}\mathbf{H}_{s}\boldsymbol{\alpha}_{s}\,ds.

(234)

On the other hand—using the notation $\mathbf{M}_{t}=\exp\big{\{}\int_{0}^{t}\mathbf{H}_{u}\,du\big{\}}$ and the integration by parts formula for semi-martingales—the second integral on the right hand side of (92) can be modified as follows:

$\displaystyle\int_{0}^{t}\exp\Big{\{}-\int_{s}^{t}\mathbf{H}_{u}\,du\Big{\}}d% \boldsymbol{\alpha}_{s}$	$\displaystyle=\mathbf{M}_{t}^{-1}\int_{0}^{t}\mathbf{M}_{s}d\boldsymbol{\alpha% }_{s}$	(235)
	$\displaystyle=\mathbf{M}_{t}^{-1}\Big{(}\mathbf{M}_{t}\boldsymbol{\alpha}_{t}-% \mathbf{M}_{0}\boldsymbol{\alpha}_{0}-\int_{0}^{t}d\mathbf{M}_{s}\,\boldsymbol% {\alpha}_{s}\Big{)}$	(236)
	$\displaystyle=\boldsymbol{\alpha}_{t}-\exp\Big{\{}-\int_{0}^{t}\mathbf{H}_{u}% \,du\Big{\}}\boldsymbol{\alpha}_{0}$	(237)
	$\displaystyle\qquad-\int_{0}^{t}\exp\Big{\{}-\int_{s}^{t}\mathbf{H}_{u}\,du% \Big{\}}\mathbf{H}_{s}\boldsymbol{\alpha}_{s}\,ds.$	(238)

Combining this equation with (234), we get the claim of the lemma. ∎

Lemma 8.

Let $A$ and $B$ be given positive numbers and $\{C_{k}\}_{k\in\mathbb{N}}$ be a given sequence of real numbers. Assume that the sequence $\{x_{k}\}_{k\in\mathbb{N}}$ satisfies the recursive inequality

\displaystyle x_{k+1}\leq Ax_{k}^{2}+2Bx_{k}+C_{k},\qquad\forall k\in\mathbb{N}.

(239)

Then, for all $k\in\mathbb{N}$ ,

\displaystyle x_{k}\leq\frac{1}{A}\big{(}Ax_{0}+B\big{)}^{2^{k}}\exp\bigg{\{}% \sum_{j=0}^{k-1}2^{k-1-j}\,\frac{AC_{j}+B(1-B)}{(Ax_{0}+B)^{2^{j+1}}}\bigg{\}}.

(240)

Proof.

Let us introduce the sequences $\{y_{k}\}_{k\in\mathbb{N}}$ and $\{z_{k}\}_{k\in\mathbb{N}}$ defined by the relations $y_{0}=x_{0}$ ,

	$\displaystyle y_{k+1}$	$\displaystyle=Ay_{k}^{2}+2By_{k}+C_{k},$		(241)
	$\displaystyle z_{k}$	$\displaystyle=(Ax_{0}+B)^{2^{k}}\exp\bigg{\{}\sum_{j=0}^{k-1}2^{k-1-j}\,\frac{% AC_{j}+B(1-B)}{(Ax_{0}+B)^{2^{j+1}}}\bigg{\}}.$		(242)

Using mathematical induction, one easily shows that inequalities

\displaystyle x_{k}\leq y_{k}\qquad\text{and}\qquad(Ax_{0}+B)^{2^{k}}\leq Ay_{% k}+B\leq z_{k}

(243)

hold for every $k\in\mathbb{N}$ . As a consequence, we get

\displaystyle x_{k}\leq\frac{Ax_{k}+B}{A}\leq\frac{Ay_{k}+B}{A}\leq\frac{z_{k}% }{A}.

(244)

This completes the proof of the lemma. ∎

Lemma 9.

Let $A,B,C,D$ be positive numbers satisfying $D<A<1$ and $\{x_{k}\}_{k\in\mathbb{N}}$ be a sequence of positive numbers satisfying the inequality

\displaystyle x_{k+1}\leq\big{(}(1-A)^{2}x_{k}^{2}+B^{2}\big{)}^{1/2}+C+Dx_{k}.

(245)

Then, for every $k\geq 0$ , we have

\displaystyle x_{k}\leq(1-A+D)^{k}x_{0}+\frac{C}{A-D}+\frac{B}{\sqrt{(A-D)(2-A% -D)}}.

(246)

Proof.

We start by setting

E=\frac{B}{\sqrt{(A-D)(2-A-D)}},\qquad F=C+(A-D)E

and by defining a new sequence $\{y_{k}\}_{k\in\mathbb{N}}$ by $y_{0}=x_{0}+E$ and

y_{k+1}=(1-A+D)y_{k}+F.

Our goal is to prove that $y_{k}\geq x_{k}\vee E$ for every $k$ . This claim is clearly true for $k=0$ . Let us assume that it is true for the value $k$ and prove its validity for $k+1$ . Since the function $x\mapsto\sqrt{x^{2}+a^{2}}-x$ is decreasing, we have

$\displaystyle x_{k+1}$	$\displaystyle\leq\sqrt{(1-A)^{2}y_{k}^{2}+B^{2}}+C+Dy_{k}$	(247)
	$\displaystyle\leq(1-A+D)y_{k}+C+\sqrt{(1-A)^{2}y_{k}^{2}+B^{2}}-(1-A)y_{k}$	(248)
	$\displaystyle\leq(1-A+D)y_{k}+C+\sqrt{(1-A)^{2}E^{2}+B^{2}}-(1-A)E=y_{k+1}.$	(249)

On the other hand,

	$\displaystyle y_{k+1}$	$\displaystyle\geq(1-A+D)y_{k}+(A-D)E$		(250)
		$\displaystyle\geq(1-A+D)E+(A-D)E=E.$		(251)

This implies, in particular, that $x_{k}\leq y_{k}$ for every $k\in\mathbb{N}$ . Since $\{y_{k}\}$ satisfies a first-order linear recursion, we get $y_{k}=(1-A+D)^{k}y_{0}+F(1-(1-A+D)^{k})/(A-D)$ . ∎

Acknowledgments

The work of AD was partially supported by the grant Investissements d’Avenir (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047).

References

Alfonsi et al. [2014] Alfonsi, A., Jourdain, B., and Kohatsu-Higa, A. (2014). Pathwise optimal transport bounds between a one-dimensional diffusion and its euler scheme. Ann. Appl. Probab., 24(3):1049–1080.
Alfonsi et al. [2015] Alfonsi, A., Jourdain, B., and Kohatsu-Higa, A. (2015). Optimal transport bounds between the time-marginals of a multidimensional diffusion and its euler scheme. Electron. J. Probab., 20:31 pp.
Alquier et al. [2016] Alquier, P., Friel, N., Everitt, R., and Boland, A. (2016). Noisy monte carlo: convergence of markov chains with approximate transition kernels. Statistics and Computing, 26(1):29–47.
Andrieu et al. [2016] Andrieu, C., Ridgway, J., and Whiteley, N. (2016). Sampling normalizing constants in high dimensions using inhomogeneous diffusions. ArXiv e-prints.
Bhattacharya [1978] Bhattacharya, R. N. (1978). Criteria for recurrence and existence of invariant measures for multidimensional diffusions. Ann. Probab., 6(4):541–553.
Brosse et al. [2017] Brosse, N., Durmus, A., and Moulines, É. (2017). Normalizing constants of log-concave densities. ArXiv e-prints.
Brosse et al. [2017] Brosse, N., Durmus, A., Moulines, É., and Pereyra, M. (2017). Sampling from a log-concave distribution with compact support with proximal langevin monte carlo. In Kale, S. and Shamir, O., editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 319–342.
Bubeck et al. [2018] Bubeck, S., Eldan, R., and Lehec, J. (2018). Sampling from a log-concave distribution with projected langevin monte carlo. Discrete & Computational Geometry, 59(4):757–783.
Chen et al. [2015] Chen, C., Ding, N., and Carin, L. (2015). On the convergence of stochastic gradient mcmc algorithms with high-order integrators. In Advances in Neural Information Processing Systems, pages 2278–2286.
Cheng and Bartlett [2018] Cheng, X. and Bartlett, P. (2018). Convergence of Langevin MCMC in KL-divergence. In Proceedings of ALT2018.
Cheng et al. [2017] Cheng, X., Chatterji, N. S., Bartlett, P. L., and Jordan, M. I. (2017). Underdamped Langevin MCMC: A non-asymptotic analysis. ArXiv e-prints.
Chong and Zak [2013] Chong, E. and Zak, S. (2013). An Introduction to Optimization. Wiley Series in Discrete Mathematics and Optimization. Wiley.
Dalalyan [2017a] Dalalyan, A. (2017a). Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Kale, S. and Shamir, O., editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 678–689.
Dalalyan [2017b] Dalalyan, A. S. (2017b). Theoretical guarantees for approximate sampling from a smooth and log-concave density. J. R. Stat. Soc. B, 79:651–676.
Durmus and Moulines [2016] Durmus, A. and Moulines, E. (2016). High-dimensional Bayesian inference via the Unadjusted Langevin Algorithm. ArXiv e-prints.
Durmus and Moulines [2017] Durmus, A. and Moulines, E. (2017). Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. Ann. Appl. Probab., 27(3):1551–1587.
Durmus et al. [2018] Durmus, A., Moulines, É., and Pereyra, M. (2018). Efficient Bayesian Computation by Proximal Markov Chain Monte Carlo: When Langevin Meets Moreau. SIAM Journal on Imaging Sciences, 11(1).
Griewank [1993] Griewank, A. (1993). Some bounds on the complexity of gradients, jacobians, and hessians. In Pardalos, P., editor, Complexity in Nonlinear Optimization, pages 128–161. World Scientific publishers.
Huggins and Zou [2017] Huggins, J. and Zou, J. (2017). Quantifying the accuracy of approximate diffusions and Markov chains. In Singh, A. and Zhu, J., editors, Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 382–391, Fort Lauderdale, FL, USA. PMLR.
Jarner and Hansen [2000] Jarner, S. F. and Hansen, E. (2000). Geometric ergodicity of Metropolis algorithms. Stochastic Process. Appl., 85(2):341–361.
Luu et al. [2017] Luu, T. D., Fadili, J., and Chesneau, C. (2017). Sampling from non-smooth distribution through Langevin diffusion. working paper or preprint.
Ma et al. [2018] Ma, Y.-A., Chen, Y., **, C., Flammarion, N., and Jordan, M. I. (2018). Sampling can be faster than optimization. arXiv preprint arXiv:1811.08413.
Nagapetyan et al. [2017] Nagapetyan, T., Duncan, A. B., Hasenclever, L., Vollmer, S. J., Szpruch, L., and Zygalakis, K. (2017). The True Cost of Stochastic Gradient Langevin Dynamics. ArXiv e-prints.
Nesterov [2004] Nesterov, Y. (2004). Introductory lectures on convex optimization, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, MA.
Raginsky et al. [2017] Raginsky, M., Rakhlin, A., and Telgarsky, M. (2017). Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Kale, S. and Shamir, O., editors, Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 1674–1703.
Roberts and Rosenthal [1998] Roberts, G. O. and Rosenthal, J. S. (1998). Optimal scaling of discrete approximations to Langevin diffusions. J. R. Stat. Soc. Ser. B Stat. Methodol., 60(1):255–268.
Roberts and Stramer [2002] Roberts, G. O. and Stramer, O. (2002). Langevin diffusions and Metropolis-Hastings algorithms. Methodol. Comput. Appl. Probab., 4(4):337–357 (2003).
Roberts and Tweedie [1996] Roberts, G. O. and Tweedie, R. L. (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363.
Rudin [1987] Rudin, W. (1987). Real and complex analysis. McGraw-Hill Book Co., New York, third edition.
Stramer and Tweedie [1999a] Stramer, O. and Tweedie, R. L. (1999a). Langevin-type models. I. Diffusions with given stationary distributions and their discretizations. Methodol. Comput. Appl. Probab., 1(3):283–306.
Stramer and Tweedie [1999b] Stramer, O. and Tweedie, R. L. (1999b). Langevin-type models. II. Self-targeting candidates for MCMC algorithms. Methodol. Comput. Appl. Probab., 1(3):307–328.
Teh et al. [2016] Teh, Y. W., Thiery, A. H., and Vollmer, S. J. (2016). Consistency and fluctuations for stochastic gradient langevin dynamics. Journal of Machine Learning Research, 17(7):1–33.
Vollmer and Zygalakis [2015] Vollmer, S. J. and Zygalakis, K. C. (2015). (Non-) asymptotic properties of Stochastic Gradient Langevin Dynamics. ArXiv e-prints.
Welling and Teh [2011] Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 681–688.
Wibisono et al. [2016] Wibisono, A., Wilson, A. C., and Jordan, M. I. (2016). A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47):E7351–E7358.
Xu et al. [2018] Xu, P., Chen, J., Zou, D., and Gu, Q. (2018). Global convergence of langevin dynamics based algorithms for nonconvex optimization. Advances in Neural Information Processing Systems, pages 3126–3137.

$\displaystyle\\|\boldsymbol{\Delta}_{k+1}\\|_{L_{2}}^{2}$	$\displaystyle=\big{\\|}\boldsymbol{\Delta}_{k}-h\boldsymbol{U}-\boldsymbol{V}+h% \mathbf{E}[\boldsymbol{\zeta}_{k}\|\boldsymbol{\vartheta}_{k,\boldsymbol{h}}]% \big{\\|}_{L_{2}}^{2}+h^{2}\big{\\|}\boldsymbol{\zeta}_{k}-\mathbf{E}[% \boldsymbol{\zeta}_{k}\|\boldsymbol{\vartheta}_{k,\boldsymbol{h}}]\big{\\|}_{L_{% 2}}^{2}$	(54)
	$\displaystyle\leq\big{\\|}\boldsymbol{\Delta}_{k}-h\boldsymbol{U}-\boldsymbol{V% }+h\mathbf{E}[\boldsymbol{\zeta}_{k}\|\boldsymbol{\vartheta}_{k,\boldsymbol{h}}% ]\big{\\|}_{L_{2}}^{2}+\sigma^{2}h^{2}p$	(55)
	$\displaystyle\leq\big{\{}\\|\boldsymbol{\Delta}_{k}-h\boldsymbol{U}\\|_{L_{2}}+h% \delta\sqrt{p}+\\|\boldsymbol{V}\\|_{L_{2}}\big{\}}^{2}+\sigma^{2}h^{2}p.$	(56)

	$\displaystyle\\|\boldsymbol{\Delta}_{k+1}\\|_{L_{2}}^{2}$	$\displaystyle\leq\big{\\|}\boldsymbol{\Delta}_{k}-h\boldsymbol{U}-\bar{% \boldsymbol{V}}-\sqrt{2}\boldsymbol{S}_{h}+h\mathbf{E}[\boldsymbol{\zeta}_{k}\|% \boldsymbol{\vartheta}_{k,\boldsymbol{h}}]\big{\\|}_{L_{2}}^{2}+\sigma^{2}h^{2}p$		(83)
		$\displaystyle\leq\big{\{}\big{(}\\|\boldsymbol{\Delta}_{k}-h\boldsymbol{U}\\|_{L% _{2}}^{2}+2\\|\boldsymbol{S}_{h}\\|_{L_{2}}^{2}\big{)}^{1/2}+h\delta\sqrt{p}+\\|% \bar{\boldsymbol{V}}\\|_{L_{2}}\big{\}}^{2}+\sigma^{2}h^{2}p.$		(84)

	$\displaystyle\\|\boldsymbol{S}_{h}\\|_{L_{2}}^{2}$	$\displaystyle=\Big{\\|}\int_{0}^{h}(h-s)\nabla^{2}f(\boldsymbol{L}_{s})\,d% \boldsymbol{W}\!_{s}\Big{\\|}_{L_{2}}^{2}$		(85)
		$\displaystyle=\int_{0}^{h}(h-s)^{2}\mathbf{E}[\\|\nabla^{2}f(\boldsymbol{L}_{s}% )\\|_{F}^{2}]\,ds\leq(\nicefrac{{1}}{{3}})\,M^{2}h^{3}p.$		(86)

$\displaystyle\\|\boldsymbol{\Delta}_{0}+A_{t}\\|_{2}$	$\displaystyle\leq\\|\boldsymbol{\Delta}_{0}-t\big{(}\nabla f(\boldsymbol{L}_{0}% )-\nabla f(\boldsymbol{D}_{0})\big{)}\\|_{2}$	(100)
	$\displaystyle\qquad+\int_{0}^{t}\\|\mathbf{I}-e^{-s\mathbf{H}_{0}}\\|\,ds\big{\\|% }\nabla f(\boldsymbol{L}_{0})-\nabla f(\boldsymbol{D}_{0})\big{\\|}_{2}$	(101)
	$\displaystyle\leq(1-mt+0.5M^{2}t^{2})\\|\boldsymbol{\Delta}_{0}\\|_{2}.$	(102)

	$\displaystyle\\|C_{t}\\|_{2}$	$\displaystyle\leq\sqrt{MM_{2}\\|\boldsymbol{\Delta}_{0}\\|_{2}}\int_{0}^{t}\int_% {0}^{s}\\|\nabla f(\boldsymbol{L}_{u})\\|_{2}\,du\,ds$		(104)
		$\displaystyle\leq\mu\\|\boldsymbol{\Delta}_{0}\\|_{2}+(4\mu)^{-1}MM_{2}\bigg{(}% \int_{0}^{t}(t-u)\\|\nabla f(\boldsymbol{L}_{u})\\|_{2}\,du\bigg{)}^{2}.$		(105)