Non-geodesically-convex optimization in the Wasserstein space

Hoang Phuc Hau Luu Department of Computer Science, University of Helsinki Hanlin Yu Department of Computer Science, University of Helsinki Bernardo Williams Department of Computer Science, University of Helsinki Petrus Mikkola Department of Computer Science, University of Helsinki
Marcelo Hartmann
Department of Computer Science, University of Helsinki
Kai Puolamäki Department of Computer Science, University of Helsinki Arto Klami Department of Computer Science, University of Helsinki
Abstract

We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. When the regularization term is the negative entropy, the optimization problem becomes a sampling problem where it minimizes the Kullback-Leibler divergence between a probability measure (optimization variable) and a target probability measure whose logarithmic probability density is a nonconvex function. We derive multiple convergence insights for a novel semi Forward-Backward Euler scheme under several nonconvex (and possibly nonsmooth) regimes. Notably, the semi Forward-Backward Euler is just a slight modification of the Forward-Backward Euler whose convergence is—to our knowledge—still unknown in our very general non-geodesically-convex setting.

1 Introduction

Sampling and optimization are intertwined. For example, the (overdamped) Langevin dynamics, typically considered a sampling algorithm, can be considered gradient descent optimization where a suitable amount of Gaussian noise is injected at each step. There are also deeper connections. At the limit of infinitesimal stepsize, the law of the Langevin dynamics is governed by the Fokker-Planck equation describing a diffusion over time of probability measures. In the seminal paper [35], Jordan, Kinderlehrer, and Otto reinterpreted the Fokker-Planck equation as the gradient flow of the functional relative entropy, a.k.a. Kullback-Leibler (KL) divergence, in the (Wasserstein) space of finite second-moment probability measures equipped with the Wasserstein metric. The discovery connects the two fields and encourages optimization in the Wasserstein space, even conceptually, as it directly gives insight into the sampling context. Studies in continuous-time dynamics [21, 12, 58, 29] seem natural and enjoy nice theoretical properties without discretization error. Another line of research studies discretization of Wasserstein gradient flow by either quantifying the discretization error between the continuous-time flow and the discrete-time flow [35, 59, 26, 24, 27] or viewing discrete-time flows as iterative optimization schemes in the Wasserstein space [57, 25, 62, 10] where the primary focus is on (geodesically) convex optimization problems.

Nonconvex, nonsmooth optimization is challenging, even in Euclidean space, quoting Rockafellar [55]: “In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity.” The landscape of nonconvex problems is mostly underexplored in the Wasserstein space. In the sampling language, it amounts to sampling from a non-log-concave and possibly non-log-Lipschitz-smooth target distribution. Recently, Balasubramanian et al. [8] advocated the need for a sound theory for non-log-concave sampling and provided some guarantees for the unadjusted Langevin algorithm (ULA) in sampling from log-smooth (Lipschitz/Hölder smooth) densities. These results are preliminary for the ULA (and its stochastic/smoothing variants) with a specific class of densities (smooth). Theoretical understandings of other classes of algorithms and densities are needed.

We approach the subject through the lens of nonconvex optimization in the space of probability distributions and pose discretized Wasserstein gradient flows as iterative minimization algorithms. This allows us to, on the one hand, use and extend tools from classical nonconvex optimization and, on the other hand, derive more connections between sampling and optimization.

We study the following non-geodesically-convex optimization problem defined over the space 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) of probability measures μ𝜇\muitalic_μ over X=d𝑋superscript𝑑X=\mathbb{R}^{d}italic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with finite second moment, i.e., x2𝑑μ(x)<+superscriptnorm𝑥2differential-d𝜇𝑥\int{\|x\|^{2}}d\mu(x)<+\infty∫ ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) < + ∞,

minμ𝒫2(X)(μ):=F(μ)+(μ):=GH(μ)+(μ)assignsubscript𝜇subscript𝒫2𝑋𝜇subscript𝐹𝜇𝜇assignsubscript𝐺𝐻𝜇𝜇\min_{\mu\in\mathcal{P}_{2}(X)}\mathcal{F}(\mu):=\mathcal{E}_{F}(\mu)+\mathscr% {H}(\mu):=\mathcal{E}_{G-H}(\mu)+\mathscr{H}(\mu)roman_min start_POSTSUBSCRIPT italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) end_POSTSUBSCRIPT caligraphic_F ( italic_μ ) := caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ ) + script_H ( italic_μ ) := caligraphic_E start_POSTSUBSCRIPT italic_G - italic_H end_POSTSUBSCRIPT ( italic_μ ) + script_H ( italic_μ ) (1)

where F:X:𝐹𝑋F:X\to\mathbb{R}italic_F : italic_X → blackboard_R is a nonconvex function which can be represented as a difference of two convex functions G𝐺Gitalic_G and H𝐻Hitalic_H, F(μ):=F(x)𝑑μ(x)assignsubscript𝐹𝜇𝐹𝑥differential-d𝜇𝑥\mathcal{E}_{F}(\mu):=\int{F(x)}d\mu(x)caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ ) := ∫ italic_F ( italic_x ) italic_d italic_μ ( italic_x ) is the potential energy, and :𝒫2(X){+}:subscript𝒫2𝑋\mathscr{H}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}script_H : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } plays a role as the regularizer which is assumed to be a convex function along generalized geodesics. Informally, these are curves connecting two points and are “straight” from the view of a third point.

Why difference-of-convex structure?

Nonconvexity lies at the difference-of-convex (DC) structure F=GH𝐹𝐺𝐻F=G-Hitalic_F = italic_G - italic_H, where G𝐺Gitalic_G and H𝐻Hitalic_H are called the first and second DC components, respectively. F𝐹Fitalic_F being nonconvex implies Fsubscript𝐹\mathcal{E}_{F}caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT being non-geodesically-convex in general. First, it is well-known that the class of DC functions is very rich, and DC structures are present everywhere in real-world applications [52, 39, 40, 1, 23, 48, 50]. Weakly convex and Lipschitz smooth (L-smooth) functions are two subclasses of the class of DC functions. Furthermore, any continuous function can be approximated by a sequence of DC functions over a compact, convex domain [7]. Second, the class of DC functions still retains some structural information, making the extension of convex analysis possible [52]. Geometric characteristics of subdifferentials of DC components help define stationarity concepts that are more practical than analysis-flavored concepts like Fréchet or Clarke stationarity [22] where a structure is missing. Such structural information is crucial in the context of classical DC programming [52] and in our analysis in Wasserstein space using tools from optimal transport.

Context

Many problems in machine learning and sampling fall into the spectrum of problem (1). The regularizer \mathscr{H}script_H can be the internal energy [3, Sect. 10.4.3]. Under McCann condition, the internal energy is convex along generalized geodesics [3, Prop. 9.3.9]. In particular, the negative entropy, (μ)=log(μ(x))𝑑μ(x)𝜇𝜇𝑥differential-d𝜇𝑥\mathscr{H}(\mu)=\int\log(\mu(x))d\mu(x)script_H ( italic_μ ) = ∫ roman_log ( italic_μ ( italic_x ) ) italic_d italic_μ ( italic_x ) if μ𝜇\muitalic_μ is absolutely continuous w.r.t. Lebesgue measure, ++\infty+ ∞ otherwise, is a special case of internal energy satisfying McCann condition. In the latter case, (μ)=KL(μμ)+const𝜇KLconditional𝜇superscript𝜇const\mathcal{F}(\mu)=\operatorname{KL}(\mu\|\mu^{*})+\text{const}caligraphic_F ( italic_μ ) = roman_KL ( italic_μ ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + const where μ(x)exp(F(x))proportional-tosuperscript𝜇𝑥𝐹𝑥\mu^{*}(x)\propto\exp(-F(x))italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ∝ roman_exp ( - italic_F ( italic_x ) ), the optimization problem reduces to a sampling problem with log-DC target distribution. In the context of infinitely wide two-layer neural networks and Maximum Mean Discrepancy [43, 6, 21], let μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the optimal distribution over a network’s parameters, k𝑘kitalic_k be a given kernel, the regularizer is then the interaction energy (μ)=k(x,y)𝑑μ(x)𝑑μ(y)𝜇𝑘𝑥𝑦differential-d𝜇𝑥differential-d𝜇𝑦\mathscr{H}(\mu)=\int\int{k(x,y)}d\mu(x)d\mu(y)script_H ( italic_μ ) = ∫ ∫ italic_k ( italic_x , italic_y ) italic_d italic_μ ( italic_x ) italic_d italic_μ ( italic_y ) and F(x)=2k(x,y)𝑑μ(y).𝐹𝑥2𝑘𝑥𝑦differential-dsuperscript𝜇𝑦F(x)=-2\int{k(x,y)}d\mu^{*}(y).italic_F ( italic_x ) = - 2 ∫ italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) . In general, \mathscr{H}script_H is not convex along generalized geodesics and F𝐹Fitalic_F is nonconvex but not necessarily DC. However, when the kernel has Lipschitz gradient (as the case considered in [6]), we can adjust both \mathscr{H}script_H and F𝐹Fitalic_F as (μ)=k(x,y)+αx2+αy2dμ(x)dμ(y)𝜇𝑘𝑥𝑦𝛼superscriptnorm𝑥2𝛼superscriptnorm𝑦2𝑑𝜇𝑥𝑑𝜇𝑦\mathscr{H}(\mu)=\int\int{k(x,y)}+\alpha\|x\|^{2}+\alpha\|y\|^{2}d\mu(x)d\mu(y)script_H ( italic_μ ) = ∫ ∫ italic_k ( italic_x , italic_y ) + italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) italic_d italic_μ ( italic_y ) and F(x)=2k(x,y)𝑑μ(y)2αx2𝐹𝑥2𝑘𝑥𝑦differential-dsuperscript𝜇𝑦2𝛼superscriptnorm𝑥2F(x)=-2\int{k(x,y)}d\mu^{*}(y)-2\alpha\|x\|^{2}italic_F ( italic_x ) = - 2 ∫ italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) - 2 italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for some α>0𝛼0\alpha>0italic_α > 0 making \mathscr{H}script_H generalized geodesically convex and F𝐹Fitalic_F concave (hence DC); see Appx. A.2.

Our idea is to minimize (1) in the space of probability distributions by discretization of the gradient flow of \mathcal{F}caligraphic_F, leveraging on the JKO (Jordan, Kinderlehrer, and Otto) operator (2). In the previous work [62], this has been done with the Forward-Backward (FB) Euler discretization, but it lacks convergence analysis. Recently, Salim et al. [57] did some study on FB Euler, but their results do not apply here because F𝐹Fitalic_F is nonconvex and possibly nonsmooth. Further leveraging on the DC structure of F𝐹Fitalic_F and inspired by classical DC programming literature [52], we subtly modify the FB Euler to give rise to a scheme named semi FB Euler that enjoys major theoretical advantages as we can provide a wide range of convergence analysis. A detailed discussion is in Sect. 3.1.

Our contributions

To our knowledge, no prior work studies problem (1) when F𝐹Fitalic_F is DC. Therefore, most of the derived results in this paper are novel. We propose and analyze the semi FB Euler scheme (4) and provide the following hierarchical set of new insights:

  • Thm. 1

    We show that if the H𝐻Hitalic_H is continuously differentiable, every cluster point of the sequence of distributions {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT generated by semi FB Euler is a critical point to \mathcal{F}caligraphic_F. Note that criticality is a notion from the DC programming literature [52] and it is a necessary condition for local optimality; See Sect. 3.3.

  • Thm. 2

    We provide convergence rate of O(N1)𝑂superscript𝑁1O(N^{-1})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) in terms of Wasserstein (sub)gradient map** in the general non-smooth setting. Again, the notion of gradient map** [30, 34, 47] is from the context of proximal algorithms in Euclidean space that is applicable to nonconvex programs where the notion of distance to global solution is—in general—not possible to work out.

  • Thm. 3

    Under the extra assumption that H𝐻Hitalic_H is continuously twice differentiable and has bounded Hessian, we provide a convergence rate of O(N12)𝑂superscript𝑁12O(N^{-\frac{1}{2}})italic_O ( italic_N start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) in terms of distance of 00 to the Fréchet subdifferential of \mathcal{F}caligraphic_F. One can think of this as convergence rate to Fréchet stationarity, i.e., if μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a Fréchet stationary point of \mathcal{F}caligraphic_F, then, by definition, 00 is in the Fréchet subdifferential of \mathcal{F}caligraphic_F at μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Fréchet stationarity is a relatively sharp necessary condition for local optimality.

  • Thm. 4, 5

    Under the assumptions of Thm. 3 and additionally \mathcal{F}caligraphic_F satisfying the Łojasciewicz-type inequality for some Łojasciewicz exponent of θ[0,1)𝜃01\theta\in[0,1)italic_θ ∈ [ 0 , 1 ), we show that {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is a Cauchy sequence under Wasserstein topology, and thanks to the completeness of the Wasserstein space, the whole sequence {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT converges to some μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We show that μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is in fact a global minimizer to \mathcal{F}caligraphic_F. Furthermore, we provide convergence rate of μnμsubscript𝜇𝑛superscript𝜇\mu_{n}\to\mu^{*}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in three different regimes (W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the Wasserstein metric): (1) if θ=0𝜃0\theta=0italic_θ = 0, W2(μn,μ)subscript𝑊2subscript𝜇𝑛superscript𝜇W_{2}(\mu_{n},\mu^{*})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) converges to 00 after a finite number of steps; (2) if θ(0,1/2]𝜃012\theta\in(0,1/2]italic_θ ∈ ( 0 , 1 / 2 ], both (μn)(μ)subscript𝜇𝑛superscript𝜇\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and W2(μn,μ)subscript𝑊2subscript𝜇𝑛superscript𝜇W_{2}(\mu_{n},\mu^{*})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) converges to 00 exponentially fast; (3) if θ(1/2,1)𝜃121\theta\in(1/2,1)italic_θ ∈ ( 1 / 2 , 1 ), both (μn)(μ)subscript𝜇𝑛superscript𝜇\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and W2(μn,μ)subscript𝑊2subscript𝜇𝑛superscript𝜇W_{2}(\mu_{n},\mu^{*})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) converges sublinearly to 00 with rates O(n12θ1)𝑂superscript𝑛12𝜃1O\left(n^{-\frac{1}{2\theta-1}}\right)italic_O ( italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ) and O(n1θ2θ1)𝑂superscript𝑛1𝜃2𝜃1O\left(n^{-\frac{1-\theta}{2\theta-1}}\right)italic_O ( italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 - italic_θ end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ), respectively. When \mathscr{H}script_H is the negative entropy, (μn)(μ)=KL(μnμ)subscript𝜇𝑛superscript𝜇KLconditionalsubscript𝜇𝑛superscript𝜇\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})=\operatorname{KL}(\mu_{n}\|\mu^{*})caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = roman_KL ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ); Therefore, in the sampling context, we provide convergence guarantees in both Wasserstein and KL distances. See Sect. 4.3 for additional observations and implications.

2 Preliminaries

2.1 Notations and basic results in measure theory and functional analysis

We denote by X=d𝑋superscript𝑑X=\mathbb{R}^{d}italic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, (X)𝑋\mathcal{B}(X)caligraphic_B ( italic_X ) the Borel σ𝜎\sigmaitalic_σ-algebra over X𝑋Xitalic_X, and dsuperscript𝑑\mathscr{L}^{d}script_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT the Lebesgue measure on X𝑋Xitalic_X. 𝒫(X)𝒫𝑋\mathcal{P}(X)caligraphic_P ( italic_X ) is the set of Borel probability measures on X𝑋Xitalic_X. For μ𝒫(X)𝜇𝒫𝑋\mu\in\mathcal{P}(X)italic_μ ∈ caligraphic_P ( italic_X ), we denote its second-order moment by 𝔪2(μ):=Xx2𝑑μ(x)assignsubscript𝔪2𝜇subscript𝑋superscriptnorm𝑥2differential-d𝜇𝑥\mathfrak{m}_{2}(\mu):=\int_{X}{\|x\|^{2}}d\mu(x)fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ) := ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ), where 𝔪2(μ)subscript𝔪2𝜇\mathfrak{m}_{2}(\mu)fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ) can be infinity. 𝒫2(X)𝒫(X)subscript𝒫2𝑋𝒫𝑋\mathcal{P}_{2}(X)\subset\mathcal{P}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) ⊂ caligraphic_P ( italic_X ) denotes a set of finite second-order moment probability measures. 𝒫2,abs(X)𝒫2(X)subscript𝒫2abs𝑋subscript𝒫2𝑋\mathcal{P}_{2,\operatorname{abs}}(X)\subset\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) ⊂ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) is the set of measures that are absolutely continuous w.r.t. dsuperscript𝑑\mathscr{L}^{d}script_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. μ𝜇\muitalic_μ-a.e. stands for almost everywhere w.r.t. μ𝜇\muitalic_μ.

Cp(X),Cc(X),Cb(X)superscript𝐶𝑝𝑋subscriptsuperscript𝐶𝑐𝑋subscript𝐶𝑏𝑋C^{p}(X),C^{\infty}_{c}(X),C_{b}(X)italic_C start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X ) , italic_C start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_X ) , italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_X ) are the classes of p𝑝pitalic_p-time continuously differentiable functions, infinitely differentiable functions with compact support, bounded and continuous functions, respectively.

From functional analysis [20], for each p1𝑝1p\geq 1italic_p ≥ 1, Lp(X,μ)superscript𝐿𝑝𝑋𝜇L^{p}(X,\mu)italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X , italic_μ ) denotes the Banach space of measurable (where measurable is understood as Borel measurable from now on) functions f𝑓fitalic_f such that X|f(x)|p𝑑μ(x)<+subscript𝑋superscript𝑓𝑥𝑝differential-d𝜇𝑥\int_{X}{|f(x)|^{p}}d\mu(x)<+\infty∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT | italic_f ( italic_x ) | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) < + ∞. We shall consider an element of Lp(X,μ)superscript𝐿𝑝𝑋𝜇L^{p}(X,\mu)italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X , italic_μ ) as an equivalent class of functions that agree μ𝜇\muitalic_μ-a.e. on X𝑋Xitalic_X rather than a sole function. The norm of fLp(X,μ)𝑓superscript𝐿𝑝𝑋𝜇f\in L^{p}(X,\mu)italic_f ∈ italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X , italic_μ ) is fLp(X,μ)=(X|f(x)|p𝑑μ(x))1/psubscriptnorm𝑓superscript𝐿𝑝𝑋𝜇superscriptsubscript𝑋superscript𝑓𝑥𝑝differential-d𝜇𝑥1𝑝\|f\|_{L^{p}(X,\mu)}=(\int_{X}{|f(x)|^{p}}d\mu(x))^{1/p}∥ italic_f ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X , italic_μ ) end_POSTSUBSCRIPT = ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT | italic_f ( italic_x ) | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT. When p=2𝑝2p=2italic_p = 2, L2(X,μ)superscript𝐿2𝑋𝜇L^{2}(X,\mu)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_μ ) is actually a Hilbert space with the inner product f,gL2(X,μ)=Xf(x)g(x)𝑑μ(x)subscript𝑓𝑔superscript𝐿2𝑋𝜇subscript𝑋𝑓𝑥𝑔𝑥differential-d𝜇𝑥\langle f,g\rangle_{L^{2}(X,\mu)}=\int_{X}{f(x)g(x)}d\mu(x)⟨ italic_f , italic_g ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_μ ) end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_g ( italic_x ) italic_d italic_μ ( italic_x ) which induces the mentioned norm. These results can be extended to vector-valued functions. In particular, we denote by L2(X,X,μ)superscript𝐿2𝑋𝑋𝜇L^{2}(X,X,\mu)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) the Hilbert space of ξ:XX:𝜉𝑋𝑋\xi:X\to Xitalic_ξ : italic_X → italic_X in which ξL2(X,μ)norm𝜉superscript𝐿2𝑋𝜇\|\xi\|\in L^{2}(X,\mu)∥ italic_ξ ∥ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_μ ). The norm ξL2(X,X,μ):=(Xξ(x)2𝑑μ(x))1/2assignsubscriptnorm𝜉superscript𝐿2𝑋𝑋𝜇superscriptsubscript𝑋superscriptnorm𝜉𝑥2differential-d𝜇𝑥12\|\xi\|_{L^{2}(X,X,\mu)}:=(\int_{X}\|\xi(x)\|^{2}d\mu(x))^{1/2}∥ italic_ξ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) end_POSTSUBSCRIPT := ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT.

We say that f:X:𝑓𝑋f:X\to\mathbb{R}italic_f : italic_X → blackboard_R has quadratic growth if there exists a>0𝑎0a>0italic_a > 0 such that |f(x)|a(x2+1)𝑓𝑥𝑎superscriptnorm𝑥21|f(x)|\leq a(\|x\|^{2}+1)| italic_f ( italic_x ) | ≤ italic_a ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) for all xX𝑥𝑋x\in Xitalic_x ∈ italic_X. It is clear that if f𝑓fitalic_f has quadratic growth and μ𝒫2(X)𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), then fL1(X,μ).𝑓superscript𝐿1𝑋𝜇f\in L^{1}(X,\mu).italic_f ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_X , italic_μ ) .

The pushforward of a measure μ𝒫(X)𝜇𝒫𝑋\mu\in\mathcal{P}(X)italic_μ ∈ caligraphic_P ( italic_X ) through a Borel map T:Xm:𝑇𝑋superscript𝑚T:X\to\mathbb{R}^{m}italic_T : italic_X → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, denoted by T#μsubscript𝑇#𝜇T_{\#}\muitalic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ is defined by (T#μ)(A):=μ(T1(A))assignsubscript𝑇#𝜇𝐴𝜇superscript𝑇1𝐴(T_{\#}\mu)(A):=\mu(T^{-1}(A))( italic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ ) ( italic_A ) := italic_μ ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_A ) ) for every Borel sets Am.𝐴superscript𝑚A\subset\mathbb{R}^{m}.italic_A ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .

2.2 Optimal transport [3, 4, 61, 60]

Given μ,ν𝒫(X)𝜇𝜈𝒫𝑋\mu,\nu\in\mathcal{P}(X)italic_μ , italic_ν ∈ caligraphic_P ( italic_X ), the principal problem in optimal transport is to find a transport map T𝑇Titalic_T pushing μ𝜇\muitalic_μ to ν𝜈\nuitalic_ν, i.e., T#μ=νsubscript𝑇#𝜇𝜈T_{\#}\mu=\nuitalic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ = italic_ν, in the most cost-efficient way, i.e., minimizing xT(x)2superscriptnorm𝑥𝑇𝑥2\|x-T(x)\|^{2}∥ italic_x - italic_T ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on μ𝜇\muitalic_μ-average. Monge’s formulation for this problem is infT:T#μ=νXxT(x)2𝑑μ(x)subscriptinfimum:𝑇subscript𝑇#𝜇𝜈subscript𝑋superscriptnorm𝑥𝑇𝑥2differential-d𝜇𝑥\inf_{T:T_{\#}\mu=\nu}\int_{X}{\|x-T(x)\|^{2}}d\mu(x)roman_inf start_POSTSUBSCRIPT italic_T : italic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ = italic_ν end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ), where the optimal solution, if exists, is denoted by Tμνsuperscriptsubscript𝑇𝜇𝜈T_{\mu}^{\nu}italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT and called the optimal (Monge) map. Monge’s problem can be ill-posed, e.g., no such Tμνsuperscriptsubscript𝑇𝜇𝜈T_{\mu}^{\nu}italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT exists when μ𝜇\muitalic_μ is a Dirac mass and ν𝜈\nuitalic_ν is absolutely continuous [4].

By relaxing Monge’s formulation, Kantorovich considers minγΓ(μ,ν)X×Xxy2𝑑γ(x,y)subscript𝛾Γ𝜇𝜈subscript𝑋𝑋superscriptnorm𝑥𝑦2differential-d𝛾𝑥𝑦\min_{\gamma\in\Gamma(\mu,\nu)}\int_{X\times X}\|x-y\|^{2}d\gamma(x,y)roman_min start_POSTSUBSCRIPT italic_γ ∈ roman_Γ ( italic_μ , italic_ν ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_γ ( italic_x , italic_y ), where Γ(μ,ν)Γ𝜇𝜈\Gamma(\mu,\nu)roman_Γ ( italic_μ , italic_ν ) denotes the set of probabilities over X×X𝑋𝑋X\times Xitalic_X × italic_X whose marginals are μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν, i.e, γΓ(μ,ν)𝛾Γ𝜇𝜈\gamma\in\Gamma(\mu,\nu)italic_γ ∈ roman_Γ ( italic_μ , italic_ν ) iff proj1#γ=μ,proj2#γ=νformulae-sequencesubscriptsubscriptproj1#𝛾𝜇subscriptsubscriptproj2#𝛾𝜈{\operatorname{proj}_{1}}_{\#}\gamma=\mu,{\operatorname{proj}_{2}}_{\#}\gamma=\nuroman_proj start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_γ = italic_μ , roman_proj start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_γ = italic_ν where proj1,proj2subscriptproj1subscriptproj2\operatorname{proj}_{1},\operatorname{proj}_{2}roman_proj start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_proj start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the projections onto the first X𝑋Xitalic_X space and the second X𝑋Xitalic_X space, respectively. Such γ𝛾\gammaitalic_γ is called a plan. Kantorovich’s formulation is well-posed because Γ(μ,ν)Γ𝜇𝜈\Gamma(\mu,\nu)roman_Γ ( italic_μ , italic_ν ) is non-empty (at least μ×νΓ(μ,ν)𝜇𝜈Γ𝜇𝜈\mu\times\nu\in\Gamma(\mu,\nu)italic_μ × italic_ν ∈ roman_Γ ( italic_μ , italic_ν )) and the argminargmin\operatorname*{arg\,min}roman_arg roman_min element actually exists (see [4, Sect. 2.2]). The set of optimal plans between μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν is denoted by Γo(μ,ν).subscriptΓ𝑜𝜇𝜈\Gamma_{o}(\mu,\nu).roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ) . In terms of random variables, any pairs (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) where Xμ,Yνformulae-sequencesimilar-to𝑋𝜇similar-to𝑌𝜈X\sim\mu,Y\sim\nuitalic_X ∼ italic_μ , italic_Y ∼ italic_ν is called a coupling of μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν while it is called an optimal coupling if the joint law of X𝑋Xitalic_X and Y𝑌Yitalic_Y is in Γo(μ,ν)subscriptΓ𝑜𝜇𝜈\Gamma_{o}(\mu,\nu)roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ).

In 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), the min\minroman_min value in Kantorovich’s problem specifies a valid metric referred to as Wasserstein distance, W2(μ,ν)=(X×Xxy2𝑑γ(x,y))1/2subscript𝑊2𝜇𝜈superscriptsubscript𝑋𝑋superscriptnorm𝑥𝑦2differential-d𝛾𝑥𝑦12W_{2}(\mu,\nu)=(\int_{X\times X}\|x-y\|^{2}d\gamma(x,y))^{1/2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) = ( ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_γ ( italic_x , italic_y ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT for some, and thus all, γΓo(μ,ν)𝛾subscriptΓ𝑜𝜇𝜈\gamma\in\Gamma_{o}(\mu,\nu)italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ). The metric space (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is then called the Wasserstein space. In 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), beside the convergence notion induced by the Wasserstein metric, there is a weaker notion of convergence called narrow convergence: we say a sequence {μn}n𝒫2(X)subscriptsubscript𝜇𝑛𝑛subscript𝒫2𝑋\{\mu_{n}\}_{n\in\mathbb{N}}\subset\mathcal{P}_{2}(X){ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT ⊂ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) converges narrowly to μ𝒫2(X)𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) if Xϕ(x)𝑑μn(x)Xϕ(x)𝑑μ(x)subscript𝑋italic-ϕ𝑥differential-dsubscript𝜇𝑛𝑥subscript𝑋italic-ϕ𝑥differential-d𝜇𝑥\int_{X}{\phi(x)}d\mu_{n}(x)\to\int_{X}{\phi(x)}d\mu(x)∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ϕ ( italic_x ) italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) → ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ϕ ( italic_x ) italic_d italic_μ ( italic_x ) for all ϕCb(X).italic-ϕsubscript𝐶𝑏𝑋\phi\in C_{b}(X).italic_ϕ ∈ italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_X ) . Convergence in the Wasserstein metric implies narrow convergence but the converse is not necessarily true. The extra condition to make it true is 𝔪2(μn)𝔪2(μ)subscript𝔪2subscript𝜇𝑛subscript𝔪2𝜇\mathfrak{m}_{2}(\mu_{n})\to\mathfrak{m}_{2}(\mu)fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ). We denote Wasserstein and narrow convergence by WassWass\xrightarrow{\operatorname{\text{Wass}}}start_ARROW overwass → end_ARROW and narrownarrow\xrightarrow{\operatorname{\text{narrow}}}start_ARROW overna → end_ARROW, respectively.

If μ𝒫2,abs(X),ν𝒫2(X)formulae-sequence𝜇subscript𝒫2abs𝑋𝜈subscript𝒫2𝑋\mu\in\mathcal{P}_{2,\operatorname{abs}}(X),\nu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) , italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), Monge’s formulation is well-posed and the unique (μ𝜇\muitalic_μ-a.e.) solution exists, and in this case, it is safe to talk about (and use) the optimal transport map Tμνsuperscriptsubscript𝑇𝜇𝜈T_{\mu}^{\nu}italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT. Moreover, there exists some convex function f𝑓fitalic_f such that Tμν=fsuperscriptsubscript𝑇𝜇𝜈𝑓T_{\mu}^{\nu}=\nabla fitalic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT = ∇ italic_f μ𝜇\muitalic_μ-a.e. Kantorovich’s problem also has a unique solution γ𝛾\gammaitalic_γ and it is given by γ=(I,Tμν)#μ𝛾subscript𝐼superscriptsubscript𝑇𝜇𝜈#𝜇\gamma=(I,T_{\mu}^{\nu})_{\#}\muitalic_γ = ( italic_I , italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ where I𝐼Iitalic_I is the identity map. This is known as Brenier theorem or polar factorization theorem [18].

2.3 Subdifferential calculus in the Wasserstein space

Apart from being a metric space, (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) also enjoys some pre-Riemannian structure making subdifferential calculus on it possible. Let us have a picture of a manifold in mind. Firstly, the tangent space [3] of 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) at μ𝜇\muitalic_μ is Tanμ𝒫2(X):={ψ:ψCc(X)}¯L2(X,X,μ)assignsubscriptTan𝜇subscript𝒫2𝑋superscript¯conditional-set𝜓𝜓superscriptsubscript𝐶𝑐𝑋superscript𝐿2𝑋𝑋𝜇\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X):=\overline{\{\nabla\psi:\psi\in C_{% c}^{\infty}(X)\}}^{L^{2}(X,X,\mu)}roman_Tan start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) := over¯ start_ARG { ∇ italic_ψ : italic_ψ ∈ italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_X ) } end_ARG start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) end_POSTSUPERSCRIPT, where the closure is w.r.t. the L2(X,X,μ)superscript𝐿2𝑋𝑋𝜇L^{2}(X,X,\mu)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ )-topology. Intuitively, for ψCc(X)𝜓superscriptsubscript𝐶𝑐𝑋\psi\in C_{c}^{\infty}(X)italic_ψ ∈ italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_X ), I+ϵψ𝐼italic-ϵ𝜓I+\epsilon\nabla\psiitalic_I + italic_ϵ ∇ italic_ψ is an optimal transport map if ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 is small enough [36], so ψ𝜓\nabla\psi∇ italic_ψ plays a role as "tangent vector".

Let ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ }, we denote dom(ϕ)={μ𝒫2(X):ϕ(μ)<+}domitalic-ϕconditional-set𝜇subscript𝒫2𝑋italic-ϕ𝜇\operatorname{dom}(\phi)=\{\mu\in\mathcal{P}_{2}(X):\phi(\mu)<+\infty\}roman_dom ( italic_ϕ ) = { italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : italic_ϕ ( italic_μ ) < + ∞ }. Let μdom(ϕ)𝜇domitalic-ϕ\mu\in\operatorname{dom}(\phi)italic_μ ∈ roman_dom ( italic_ϕ ), we say that a map ξL2(X,X,μ)𝜉superscript𝐿2𝑋𝑋𝜇\xi\in L^{2}(X,X,\mu)italic_ξ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) belongs to the Fréchet subdifferential [15, 36] Fϕ(μ)superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{-}\phi(\mu)∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) if ϕ(ν)ϕ(μ)supγΓo(μ,ν)X×Xξ(x),yx𝑑γ(x,y)+o(W2(μ,ν))italic-ϕ𝜈italic-ϕ𝜇subscriptsupremum𝛾subscriptΓ𝑜𝜇𝜈subscript𝑋𝑋𝜉𝑥𝑦𝑥differential-d𝛾𝑥𝑦𝑜subscript𝑊2𝜇𝜈\phi(\nu)-\phi(\mu)\geq\sup_{\gamma\in\Gamma_{o}(\mu,\nu)}\int_{X\times X}{% \langle\xi(x),y-x\rangle}d\gamma(x,y)+o(W_{2}(\mu,\nu))italic_ϕ ( italic_ν ) - italic_ϕ ( italic_μ ) ≥ roman_sup start_POSTSUBSCRIPT italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ⟨ italic_ξ ( italic_x ) , italic_y - italic_x ⟩ italic_d italic_γ ( italic_x , italic_y ) + italic_o ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ) for all ν𝒫2(X)𝜈subscript𝒫2𝑋\nu\in\mathcal{P}_{2}(X)italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), where the little-o notation means lims0o(s)/s=0.subscript𝑠0𝑜𝑠𝑠0\lim_{s\to 0}{o(s)/s}=0.roman_lim start_POSTSUBSCRIPT italic_s → 0 end_POSTSUBSCRIPT italic_o ( italic_s ) / italic_s = 0 . If Fϕ(μ)superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{-}\phi(\mu)\neq\emptyset∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) ≠ ∅, we say ϕitalic-ϕ\phiitalic_ϕ is Fréchet subdifferentiable at μ𝜇\muitalic_μ. We also denote dom(Fϕ)={μ𝒫2(X):Fϕ(μ)}domsuperscriptsubscript𝐹italic-ϕconditional-set𝜇subscript𝒫2𝑋superscriptsubscript𝐹italic-ϕ𝜇\operatorname{dom}(\partial_{F}^{-}\phi)=\{\mu\in\mathcal{P}_{2}(X):\partial_{% F}^{-}\phi(\mu)\neq\emptyset\}roman_dom ( ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ) = { italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) ≠ ∅ }.

Similarly, we say that ξL2(X,X,μ)𝜉superscript𝐿2𝑋𝑋𝜇\xi\in L^{2}(X,X,\mu)italic_ξ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) belongs to the (Fréchet) superdifferential F+ϕ(μ)superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{+}\phi(\mu)∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) of ϕitalic-ϕ\phiitalic_ϕ at μ𝜇\muitalic_μ if ξF(ϕ)(μ)𝜉superscriptsubscript𝐹italic-ϕ𝜇-\xi\in\partial_{F}^{-}(-\phi)(\mu)- italic_ξ ∈ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( - italic_ϕ ) ( italic_μ ). In other words, F(ϕ)(μ)=F+ϕ(μ).superscriptsubscript𝐹italic-ϕ𝜇superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{-}(-\phi)(\mu)=-\partial_{F}^{+}\phi(\mu).∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( - italic_ϕ ) ( italic_μ ) = - ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) .

We say ϕitalic-ϕ\phiitalic_ϕ is Wassertein differentiable [15, 36] at μdom(ϕ)𝜇domitalic-ϕ\mu\in\operatorname{dom}(\phi)italic_μ ∈ roman_dom ( italic_ϕ ) if Fϕ(μ)F+ϕ(μ)superscriptsubscript𝐹italic-ϕ𝜇superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{-}\phi(\mu)\cap\partial_{F}^{+}\phi(\mu)\neq\emptyset∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) ∩ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) ≠ ∅. We call an element of the intersection, denoted by Wϕ(μ)subscript𝑊italic-ϕ𝜇\nabla_{W}\phi(\mu)∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_ϕ ( italic_μ ), a Wasserstein gradient of ϕitalic-ϕ\phiitalic_ϕ at μ𝜇\muitalic_μ, and it holds ϕ(ν)ϕ(μ)=X×XWϕ(μ)(x),yx𝑑γ(x,y)+o(W2(μ,ν))italic-ϕ𝜈italic-ϕ𝜇subscript𝑋𝑋subscript𝑊italic-ϕ𝜇𝑥𝑦𝑥differential-d𝛾𝑥𝑦𝑜subscript𝑊2𝜇𝜈\phi(\nu)-\phi(\mu)=\int_{X\times X}{\langle\nabla_{W}\phi(\mu)(x),y-x\rangle}% d\gamma(x,y)+o(W_{2}(\mu,\nu))italic_ϕ ( italic_ν ) - italic_ϕ ( italic_μ ) = ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ⟨ ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_ϕ ( italic_μ ) ( italic_x ) , italic_y - italic_x ⟩ italic_d italic_γ ( italic_x , italic_y ) + italic_o ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ), for all ν𝒫2(X)𝜈subscript𝒫2𝑋\nu\in\mathcal{P}_{2}(X)italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) and any γΓo(μ,ν).𝛾subscriptΓ𝑜𝜇𝜈\gamma\in\Gamma_{o}(\mu,\nu).italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ) . The Wasserstein gradient is not unique in general, but its parallel component in Tanμ𝒫2(X)subscriptTan𝜇subscript𝒫2𝑋\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X)roman_Tan start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) is unique, and this parallel component is again a valid Wasserstein gradient as the orthogonal component plays no role in the above definitions, i.e., if ξTanμ𝒫2(X)superscript𝜉perpendicular-tosubscriptTan𝜇subscript𝒫2superscript𝑋perpendicular-to\xi^{\perp}\in\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X)^{\perp}italic_ξ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∈ roman_Tan start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, it holds X×Xξ(x),yx𝑑γ(x,y)=0subscript𝑋𝑋superscript𝜉perpendicular-to𝑥𝑦𝑥differential-d𝛾𝑥𝑦0\int_{X\times X}\langle\xi^{\perp}(x),y-x\rangle d\gamma(x,y)=0∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ⟨ italic_ξ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ) , italic_y - italic_x ⟩ italic_d italic_γ ( italic_x , italic_y ) = 0 for any ν𝒫2(X)𝜈subscript𝒫2𝑋\nu\in\mathcal{P}_{2}(X)italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) and γΓo(μ,ν)𝛾subscriptΓ𝑜𝜇𝜈\gamma\in\Gamma_{o}(\mu,\nu)italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ) [36, Prop. 2.5]. We may refer to this parallel component as the unique Wasserstein gradient of ϕitalic-ϕ\phiitalic_ϕ at μ𝜇\muitalic_μ.

2.4 Optimization in the Wasserstein space

A function ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } is called proper if dom(ϕ)domitalic-ϕ\operatorname{dom}(\phi)\neq\emptysetroman_dom ( italic_ϕ ) ≠ ∅, while it is called lower semicontinuous (l.s.c) if for any sequence μnWassμWasssubscript𝜇𝑛𝜇\mu_{n}\xrightarrow{\operatorname{\text{Wass}}}\muitalic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ, it holds lim infnϕ(μn)ϕ(μ)subscriptlimit-infimum𝑛italic-ϕsubscript𝜇𝑛italic-ϕ𝜇\liminf_{n}\phi(\mu_{n})\geq\phi(\mu)lim inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ϕ ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_ϕ ( italic_μ ).

We next recall (a simplified version of) generalized geodesic convexity.

Definition 1.

[57] Let ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\mathcal{\phi}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ }. We say ϕitalic-ϕ\phiitalic_ϕ is convex along generalized geodesics if μ,π𝒫2(X)for-all𝜇𝜋subscript𝒫2𝑋\forall\mu,\pi\in\mathcal{P}_{2}(X)∀ italic_μ , italic_π ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), ν𝒫2,abs(X)for-all𝜈subscript𝒫2abs𝑋\forall\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)∀ italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), ϕ((tTνμ+(1t)Tνπ)#ν)tϕ(μ)+(1t)ϕ(π)italic-ϕsubscript𝑡superscriptsubscript𝑇𝜈𝜇1𝑡superscriptsubscript𝑇𝜈𝜋#𝜈𝑡italic-ϕ𝜇1𝑡italic-ϕ𝜋\phi((tT_{\nu}^{\mu}+(1-t)T_{\nu}^{\pi})_{\#}\nu)\leq t\phi(\mu)+(1-t)\phi(\pi)italic_ϕ ( ( italic_t italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT + ( 1 - italic_t ) italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ν ) ≤ italic_t italic_ϕ ( italic_μ ) + ( 1 - italic_t ) italic_ϕ ( italic_π ), t[0,1]for-all𝑡01\forall t\in[0,1]∀ italic_t ∈ [ 0 , 1 ].

The curve t(tTνμ+(1t)Tνπ)#νmaps-to𝑡subscript𝑡superscriptsubscript𝑇𝜈𝜇1𝑡superscriptsubscript𝑇𝜈𝜋#𝜈t\mapsto(tT_{\nu}^{\mu}+(1-t)T_{\nu}^{\pi})_{\#}\nuitalic_t ↦ ( italic_t italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT + ( 1 - italic_t ) italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ν (called a generalized geodesic) interpolates from π𝜋\piitalic_π to μ𝜇\muitalic_μ as t𝑡titalic_t runs from 00 to 1111. The definition says that ϕitalic-ϕ\phiitalic_ϕ is convex along these curves. If μ𝒫2,abs(X)𝜇subscript𝒫2abs𝑋\mu\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) and ν=μ𝜈𝜇\nu=\muitalic_ν = italic_μ, the curve is a geodesic in (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). If the definition is relaxed to the class of geodesics only, we say that ϕitalic-ϕ\phiitalic_ϕ is convex along geodesics.

An important characterization of Fréchet subdifferential of a geodesically convex function is that we can drop the little-o notation in its definition in Sect. 2.3 [3, Sect 10.1.1]. As a convention, for a geodesically convex function ϕitalic-ϕ\phiitalic_ϕ, the Fréchet subdifferential Fsuperscriptsubscript𝐹\partial_{F}^{-}∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT will be simply written as \partial.

First-order optimality conditions

Let ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } be a proper function. μ𝒫2(X)superscript𝜇subscript𝒫2𝑋\mu^{*}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) is a global minimizer of ϕitalic-ϕ\phiitalic_ϕ if ϕ(μ)ϕ(μ),μ𝒫2(X).formulae-sequenceitalic-ϕsuperscript𝜇italic-ϕ𝜇for-all𝜇subscript𝒫2𝑋\phi(\mu^{*})\leq\phi(\mu),\forall\mu\in\mathcal{P}_{2}(X).italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϕ ( italic_μ ) , ∀ italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) . For local optimality, we shall use the Wasserstein metric to define neighborhoods. μ𝒫2(X)superscript𝜇subscript𝒫2𝑋\mu^{*}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) is a local minimizer if there exists r>0𝑟0r>0italic_r > 0 such that ϕ(μ)ϕ(μ)italic-ϕsuperscript𝜇italic-ϕ𝜇\phi(\mu^{*})\leq\phi(\mu)italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϕ ( italic_μ ) for all μ:W2(μ,μ)<r.:𝜇subscript𝑊2𝜇superscript𝜇𝑟\mu:W_{2}(\mu,\mu^{*})<r.italic_μ : italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_r . We shall denote B(μ,r):={μ𝒫2(X):W2(μ,μ)<r}assign𝐵superscript𝜇𝑟conditional-set𝜇subscript𝒫2𝑋subscript𝑊2𝜇superscript𝜇𝑟B(\mu^{*},r):=\{\mu\in\mathcal{P}_{2}(X):W_{2}(\mu,\mu^{*})<r\}italic_B ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r ) := { italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_r } the (open) Wasserstein ball centered at μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with radius r𝑟ritalic_r. If we replace <<< by \leq we obtain the notion of a closed Wasserstein ball.

We call μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT a Fréchet stationary point of ϕitalic-ϕ\phiitalic_ϕ if 0Fϕ(μ).0superscriptsubscript𝐹italic-ϕsuperscript𝜇0\in\partial_{F}^{-}\phi(\mu^{*}).0 ∈ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . Fréchet stationarity is a necessary condition for local optimality. In other words, if μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a local minimizer, it is a Fréchet stationary point (Lem. 5 in Appendix). In addition, if ϕitalic-ϕ\phiitalic_ϕ is Wasserstein differentiable at μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wϕ(μ)(x)=0subscript𝑊italic-ϕsuperscript𝜇𝑥0\nabla_{W}\phi(\mu^{*})(x)=0∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_x ) = 0 μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-a.e. [36]. When ϕitalic-ϕ\phiitalic_ϕ is geodesically convex, Fréchet stationarity is a sufficient condition for global optimality (Lem. 6 in Appendix).

3 Semi Forward-Backward Euler for difference-of-convex structures

3.1 Wasserstein gradient flows: different types of discretizations

To present the idea of minimizing \mathcal{F}caligraphic_F by using discretizations of its gradient flow in a neat way, we first assume for a moment that F𝐹Fitalic_F is infinitely differentiable and \mathscr{H}script_H is the negative entropy.

We wish to minimize (1) in the space of probability distributions. A natural idea is to apply discretizations of the gradient flow of \mathcal{F}caligraphic_F, where the gradient flow is defined (under some technical assumptions [35]) as the limit γ0+𝛾superscript0\gamma\to 0^{+}italic_γ → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT of the following scheme with some simple time-interpolation

μn+1JKOγ(μn), where JKOγ(μ):=argminν𝒫2(X)(μ)+12γW22(μ,ν).formulae-sequencesubscript𝜇𝑛1subscriptJKO𝛾subscript𝜇𝑛assign where subscriptJKO𝛾𝜇subscriptargmin𝜈subscript𝒫2𝑋𝜇12𝛾superscriptsubscript𝑊22𝜇𝜈\displaystyle\mu_{n+1}\in\operatorname{JKO}_{\gamma\mathcal{F}}(\mu_{n}),\text% { where }\operatorname{JKO}_{\gamma\mathcal{F}}(\mu):=\operatorname*{arg\,min}% _{\nu\in\mathcal{P}_{2}(X)}\mathcal{F}(\mu)+\dfrac{1}{2\gamma}W_{2}^{2}(\mu,% \nu).italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_JKO start_POSTSUBSCRIPT italic_γ caligraphic_F end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , where roman_JKO start_POSTSUBSCRIPT italic_γ caligraphic_F end_POSTSUBSCRIPT ( italic_μ ) := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) end_POSTSUBSCRIPT caligraphic_F ( italic_μ ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ , italic_ν ) . (2)

Straightforwardly, given a fixed γ>0𝛾0\gamma>0italic_γ > 0, (2) gives back a discretization for this flow known as Backward Euler. On the other hand, if \mathcal{F}caligraphic_F is Wasserstein differentiable (Sect. 2.2), the Forward Euler discretization reads [62] μn+1=(IγW(μn))#μnsubscript𝜇𝑛1subscript𝐼𝛾subscript𝑊subscript𝜇𝑛#subscript𝜇𝑛\mu_{n+1}=(I-\gamma\nabla_{W}\mathcal{F}(\mu_{n}))_{\#}\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I - italic_γ ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which is reinterpreted as doing gradient descent in the space of probability distributions. These are optimization methods that work directly on the objective function \mathcal{F}caligraphic_F itself. However, the composite structure of \mathcal{F}caligraphic_F (a sum of several terms) can also be exploited. One such scheme is the unadjusted Langevin algorithm (ULA), where it first takes a gradient step w.r.t. the potential part, then follows the heat flow corresponding to the entropy part [62]: νn+1=(IγF)#μn, and μn+1=𝒩(0,2γI)νn+1formulae-sequencesubscript𝜈𝑛1subscript𝐼𝛾𝐹#subscript𝜇𝑛 and subscript𝜇𝑛1𝒩02𝛾𝐼subscript𝜈𝑛1\nu_{n+1}=(I-\gamma\nabla F)_{\#}\mu_{n},\text{ and }\mu_{n+1}=\mathcal{N}(0,2% \gamma I)*\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I - italic_γ ∇ italic_F ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , and italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = caligraphic_N ( 0 , 2 italic_γ italic_I ) ∗ italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, where * is the convolution. This ULA is "viewed" in the space of distributions (Eulerian approach), a more familiar and equivalent form of the ULA from the particle perspective (Lagrangian approach) goes like xn+1=xnγF(xn)+2γzksubscript𝑥𝑛1subscript𝑥𝑛𝛾𝐹subscript𝑥𝑛2𝛾subscript𝑧𝑘x_{n+1}=x_{n}-\gamma\nabla F(x_{n})+\sqrt{2\gamma}z_{k}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_γ ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_γ end_ARG italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where zk𝒩(0,I)similar-tosubscript𝑧𝑘𝒩0𝐼z_{k}\sim\mathcal{N}(0,I)italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ). The ULA is known to be asymptotically biased even for Gaussian target measure (Ornstein-Uhlenbeck process). To correct this bias, the Metropolis-Hasting accept-reject step [54] is sometimes introduced. Metropolis-Hasting algorithm [44, 32] is a much more general framework that works with quite any proposal (e.g., a random walk) whose convergence analysis is based on the Markov kernel satisfying the detailed balance condition. This convergence framework is different from what is considered in this work: we are more interested in the underlying dynamics of the chain. Metropolis-Hasting algorithm is indeed another story.

In optimization, for composite structure, Forward-Backward (FB) Euler and its variants are methods of choice [51, 9]. The corresponding FB Euler for \mathcal{F}caligraphic_F will take the gradient step (forward) according to the potential, and JKO step (backward) w.r.t. the negative entropy

(FB Euler)νn+1=(IγF)#μn, and μn+1JKOγ(νn+1).formulae-sequence(FB Euler)subscript𝜈𝑛1subscript𝐼𝛾𝐹#subscript𝜇𝑛 and subscript𝜇𝑛1subscriptJKO𝛾subscript𝜈𝑛1\displaystyle\text{(FB Euler)}\quad\nu_{n+1}=(I-\gamma\nabla F)_{\#}\mu_{n},% \text{ and }\mu_{n+1}\in\operatorname{JKO}_{\gamma\mathscr{H}}(\nu_{n+1}).(FB Euler) italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I - italic_γ ∇ italic_F ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , and italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_JKO start_POSTSUBSCRIPT italic_γ script_H end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) . (3)

This scheme appears in [62] without convergence analysis, and later on [57] derives non-asymptotic convergence guarantees under the assumption F𝐹Fitalic_F being convex and Lipschitz smooth.

In this work, as F𝐹Fitalic_F is nonconvex and nonsmooth, the theory in [57] does not apply, and the convergence (if any) of (3) remains mysterious. The DC structure of F𝐹Fitalic_F can be further exploited. In DC programming [52], the forward step should be applied to the concave part, while the backward step should be applied to the convex part. We hence propose the following semi FB Euler

(semi FB Euler)νn+1=(I+γH)#μn, and μn+1JKOγ(+G)(νn+1)formulae-sequence(semi FB Euler)subscript𝜈𝑛1subscript𝐼𝛾𝐻#subscript𝜇𝑛 and subscript𝜇𝑛1subscriptJKO𝛾subscript𝐺subscript𝜈𝑛1\displaystyle\text{(semi FB Euler)}\quad\nu_{n+1}=(I+\gamma\nabla H)_{\#}\mu_{% n},\text{ and }\mu_{n+1}\in\operatorname{JKO}_{\gamma(\mathscr{H}+\mathcal{E}_% {G})}(\nu_{n+1})(semi FB Euler) italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I + italic_γ ∇ italic_H ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , and italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_JKO start_POSTSUBSCRIPT italic_γ ( script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) (4)

for which we can provide convergence guarantees. Apparently, the difference between semi FB Euler and FB Euler is subtle: while FB Euler does forward on GH=GHsubscript𝐺𝐻subscript𝐺subscript𝐻\mathcal{E}_{G-H}=\mathcal{E}_{G}-\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_G - italic_H end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and backward on \mathscr{H}script_H, semi FB Euler does forward on Hsubscript𝐻-\mathcal{E}_{H}- caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and backward on +Gsubscript𝐺\mathscr{H}+\mathcal{E}_{G}script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT; recall that =GH+subscript𝐺subscript𝐻\mathcal{F}=\mathcal{E}_{G}-\mathcal{E}_{H}+\mathscr{H}caligraphic_F = caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT + script_H.

Theoretically, semi FB Euler enjoys some advantages compared to FB Euler. Thanks to Brenier theorem (Sect. 2.2), the pushing step in semi FB Euler is optimal since H𝐻Hitalic_H is convex; Meanwhile, the pushing in FB Euler is non-optimal whose optimal Monge map is not identifiable in general. The convergence of FB Euler is still an open question, even when F𝐹Fitalic_F is (DC) differentiable. In contrast, we can provide a solid theoretical guarantee for semi FB Euler, especially when H𝐻Hitalic_H is differentiable. Additionally, we also offer convergence guarantees when H𝐻Hitalic_H is nonsmooth.

3.2 Problem setting

Our goal is to minimize the non-geodesically-convex functional (μ)=F(μ)+(μ)𝜇subscript𝐹𝜇𝜇\mathcal{F}(\mu)=\mathcal{E}_{F}(\mu)+\mathscr{H}(\mu)caligraphic_F ( italic_μ ) = caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ ) + script_H ( italic_μ ) over 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), where F=GH𝐹𝐺𝐻F=G-Hitalic_F = italic_G - italic_H is a DC function. We make Assumption 1 throughout the paper:

Assumption 1.
  • (i)

    The objective function \mathcal{F}caligraphic_F is bounded below.

  • (ii)

    G,H:X:𝐺𝐻𝑋G,H:X\to\mathbb{R}italic_G , italic_H : italic_X → blackboard_R are convex functions and have quadratic growth.

  • (iii)

    :𝒫2(X){+}:subscript𝒫2𝑋\mathscr{H}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}script_H : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } is proper, l.s.c, and convex along generalized geodesics in (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), and dom()𝒫2,abs(X).domsubscript𝒫2abs𝑋\operatorname{dom}(\mathcal{H})\subset\mathcal{P}_{2,\operatorname{abs}}(X).roman_dom ( caligraphic_H ) ⊂ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) .

  • (iv)

    There exists γ0>0subscript𝛾00\gamma_{0}>0italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 such that γ(0,γ0)for-all𝛾0subscript𝛾0\forall\gamma\in(0,\gamma_{0})∀ italic_γ ∈ ( 0 , italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), JKOγ(G+)(μ)subscriptJKO𝛾subscript𝐺𝜇\operatorname{JKO}_{\gamma(\mathcal{E}_{G}+\mathscr{H})}(\mu)\neq\emptysetroman_JKO start_POSTSUBSCRIPT italic_γ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) end_POSTSUBSCRIPT ( italic_μ ) ≠ ∅ for every μ𝒫2(X).𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X).italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) .

Note that Assumption 1(iv) is a commonly-used assumption to simplify technical complication when working with the JKO operator [3, 15, 57]. Assumption 1(ii) implies Gsubscript𝐺\mathcal{E}_{G}caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and Hsubscript𝐻\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT are continuous w.r.t. Wasserstein topology [2, Prop. 2.4] (G,H𝐺𝐻G,Hitalic_G , italic_H are continuous [46, Cor. 2.27] and have quadratic growth).

We only make Assumption 2 in the asymptotic convergence analysis (Thm. 1).

Assumption 2 (Compactness).

Every sublevel set of \mathcal{F}caligraphic_F, Sλ:={μ𝒫2(X):(μ)λ}assignsubscript𝑆𝜆conditional-set𝜇subscript𝒫2𝑋𝜇𝜆S_{\lambda}:=\{\mu\in\mathcal{P}_{2}(X):\mathcal{F}(\mu)\leq\lambda\}italic_S start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT := { italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : caligraphic_F ( italic_μ ) ≤ italic_λ }, is compact with respect to the Wasserstein topology.

In Euclidean space, compactness of sublevel sets of f𝑓fitalic_f is usually enforced via coercivity assumption: f(x)+𝑓𝑥f(x)\to+\inftyitalic_f ( italic_x ) → + ∞ whenever x+norm𝑥\|x\|\to+\infty∥ italic_x ∥ → + ∞, which holds for a wide class of functions to be minimized. A striking difference in the Wasserstein space is that closed Wasserstein balls are not compact in the Wasserstein topology (only compact under narrow topology) [36, Prop. 4.2], making coercivity not sufficient to induce (Wasserstein) compactness. Assumption 2 is meant to simplify these difficulties.

3.3 Optimality charactizations

First, it follows from Assumption 1(iii), dom()𝒫2,abs(X).domsubscript𝒫2abs𝑋\operatorname{dom}(\mathcal{F})\subset\mathcal{P}_{2,\operatorname{abs}}(X).roman_dom ( caligraphic_F ) ⊂ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) . By analogy to DC programming in Euclidean space, we call μdom()superscript𝜇dom\mu^{*}\in\operatorname{dom}(\mathcal{F})italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_dom ( caligraphic_F ) a critical point of =+GHsubscript𝐺subscript𝐻\mathcal{F}=\mathscr{H}+\mathcal{E}_{G}-\mathcal{E}_{H}caligraphic_F = script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT if (+G)(μ)H(μ).subscript𝐺superscript𝜇subscript𝐻superscript𝜇\partial(\mathscr{H}+\mathcal{E}_{G})(\mu^{*})\cap\partial\mathcal{E}_{H}(\mu^% {*})\neq\emptyset.∂ ( script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∩ ∂ caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≠ ∅ . Criticality is a necessary condition for local optimality (Lem. 7). Moreover, if either +Gsubscript𝐺\mathscr{H}+\mathcal{E}_{G}script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT or Hsubscript𝐻\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is Wasserstein differentiable at μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, criticality becomes Fréchet stationarity (Lem. 8).

3.4 Semi FB Euler: a general setting

In this work, we allow H𝐻Hitalic_H to be non-differentiable, meaning that H𝐻\partial H∂ italic_H (convex subdifferential [46]) contains multiple elements in general. We first pick a selector S𝑆Sitalic_S of H𝐻\partial H∂ italic_H, i.e., S:XX:𝑆𝑋𝑋S:X\to Xitalic_S : italic_X → italic_X, such that S(x)H(x)𝑆𝑥𝐻𝑥S(x)\in\partial H(x)italic_S ( italic_x ) ∈ ∂ italic_H ( italic_x ). By the axiom of choice (Zermelo, 1904, see, e.g., [33]), such selection always exists. However, an arbitrary selector can behave badly, e.g., not Borel measurable. We shall first restrict ourselves to the class of Borel measurable selectors (see Appx. A.1 for an existence discussion).

Assumption 3 (Measurability).

The selector S𝑆Sitalic_S is Borel measurable.

We recall the semi FB scheme (4) but for nonsmooth F𝐹Fitalic_F as follows: start with an initial distribution μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), given a discretization stepsize 0<γ<γ00𝛾subscript𝛾00<\gamma<\gamma_{0}0 < italic_γ < italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we repeat the following two steps:

νn+1subscript𝜈𝑛1\displaystyle\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT =(I+γS)#μn push forward step;μn+1=JKOγ(G+)(νn+1) JKO step.formulae-sequenceabsentsubscript𝐼𝛾𝑆#subscript𝜇𝑛 push forward step;subscript𝜇𝑛1subscriptJKO𝛾subscript𝐺subscript𝜈𝑛1 JKO step\displaystyle=(I+\gamma S)_{\#}\mu_{n}\quad\triangleleft\text{ push forward % step;}\quad\mu_{n+1}=\operatorname{JKO}_{\gamma(\mathcal{E}_{G}+\mathscr{H})}(% \nu_{n+1})\quad\triangleleft\text{ JKO step}.= ( italic_I + italic_γ italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ◁ push forward step; italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = roman_JKO start_POSTSUBSCRIPT italic_γ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ◁ JKO step .

Well-definiteness and properties: Given μn𝒫2(X)subscript𝜇𝑛subscript𝒫2𝑋\mu_{n}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), it follows from Lem. (4) that νn+1𝒫2(X)subscript𝜈𝑛1subscript𝒫2𝑋\nu_{n+1}\in\mathcal{P}_{2}(X)italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ). The two generated sequences are then in 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ). Moreover, it follows from Assumption 1 that {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT are in 𝒫2,abs(X)subscript𝒫2abs𝑋\mathcal{P}_{2,\operatorname{abs}}(X)caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), so are {νn}nsubscriptsubscript𝜈𝑛𝑛\{\nu_{n}\}_{n\in\mathbb{N}}{ italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT using Lem. 9 by noting that I+γS𝐼𝛾𝑆I+\gamma Sitalic_I + italic_γ italic_S is subgradient of a strongly convex function x(1/2)x2+γH(x)maps-to𝑥12superscriptnorm𝑥2𝛾𝐻𝑥x\mapsto(1/2)\|x\|^{2}+\gamma H(x)italic_x ↦ ( 1 / 2 ) ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ italic_H ( italic_x ).

4 Convergence analysis

4.1 Asymptotic analysis

Lemma 1 (Descent lemma).

Under Assumptions 1 and 3, let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be the sequence of distributions produced by semi FB Euler starting from some μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) with γ<γ0𝛾subscript𝛾0\gamma<\gamma_{0}italic_γ < italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then it holds (μn+1)(μn)1γXTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x),nformulae-sequencesubscript𝜇𝑛1subscript𝜇𝑛1𝛾subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥for-all𝑛\mathcal{F}(\mu_{n+1})\leq\mathcal{F}(\mu_{n})-\frac{1}{\gamma}\int_{X}{\|T_{% \nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x),\quad% \forall n\in\mathbb{N}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≤ caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) , ∀ italic_n ∈ blackboard_N.

Lem. 1 shows that the objective does not increase along semi FB Euler’s iterates. Proof of Lem. 1 is in Appx. A.3. By using Lem. 1, we establish asymptotic convergence for semi FB Euler as follows.

For the asymptotic convergence analysis, we need the following assumption on the second DC component H𝐻Hitalic_H.

Assumption 4.

H𝐻Hitalic_H is continuously differentiable.

Theorem 1 (Asymptotic convergence).

Under Assumptions 1, 2, 4, let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT and {νn}nsubscriptsubscript𝜈𝑛𝑛\{\nu_{n}\}_{n\in\mathbb{N}}{ italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be sequences produced by semi FB Euler starting from some μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) with γ<γ0𝛾subscript𝛾0\gamma<\gamma_{0}italic_γ < italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then,

  • (i)

    {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT has a cluster point.

  • (ii)

    If supn(νn)<+subscriptsupremum𝑛subscript𝜈𝑛\sup_{n\in\mathbb{N}}\mathscr{H}(\nu_{n})<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT script_H ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞, every cluster point of {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is a critical point of \mathcal{F}caligraphic_F.

Proof of Thm.1 is in Appx. A.4. Thm. 1 does not ensure convergence of the whole sequence {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT; Rather, it guarantees subsequential convergence to critical points of \mathcal{F}caligraphic_F.

4.2 Non asymptotic analysis

To measure how fast the algorithm converges, we need some convergence measurement. First, for proximal-type algorithms in Euclidean space, the notion of gradient map** 𝒢γ(xn)subscript𝒢𝛾subscript𝑥𝑛\mathcal{G}_{\gamma}(x_{n})caligraphic_G start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is usually used (see, e.g., [30, 47] and [34, Eq. (5)]) and we measure the rate 𝒢γ(xn)20superscriptnormsubscript𝒢𝛾subscript𝑥𝑛20\|\mathcal{G}_{\gamma}(x_{n})\|^{2}\to 0∥ caligraphic_G start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → 0. In analogy as in Euclidean space, we define the Wasserstein (sub)gradient map** as follows 𝒢γ(μ):=1γ(ITμJKOγ(G+)((I+γS)#μ))assignsubscript𝒢𝛾𝜇1𝛾𝐼superscriptsubscript𝑇𝜇subscriptJKO𝛾subscript𝐺subscript𝐼𝛾𝑆#𝜇\mathcal{G}_{\gamma}(\mu):=\frac{1}{\gamma}\left(I-T_{\mu}^{\operatorname{JKO}% _{\gamma(\mathcal{E}_{G}+\mathscr{H})}((I+\gamma S)_{\#}\mu)}\right)caligraphic_G start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_μ ) := divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ( italic_I - italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_JKO start_POSTSUBSCRIPT italic_γ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) end_POSTSUBSCRIPT ( ( italic_I + italic_γ italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ ) end_POSTSUPERSCRIPT ), and we measure the rate of 𝒢γ(μn)L2(X,X,μn)20subscriptsuperscriptnormsubscript𝒢𝛾subscript𝜇𝑛2superscript𝐿2𝑋𝑋subscript𝜇𝑛0\|\mathcal{G}_{\gamma}(\mu_{n})\|^{2}_{L^{2}(X,X,\mu_{n})}\to 0∥ caligraphic_G start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT → 0.

Theorem 2 (Convergence rate: Wasserstein (sub)gradient map**).

Under Assumptions 1, 3, let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be the sequence of distributions produced by semi FB Euler starting from some μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) with γ<γ0𝛾subscript𝛾0\gamma<\gamma_{0}italic_γ < italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then it holds minn=1,N¯𝒢γ(μn)L2(X,X,μn)2=O(N1)subscript𝑛¯1𝑁subscriptsuperscriptnormsubscript𝒢𝛾subscript𝜇𝑛2superscript𝐿2𝑋𝑋subscript𝜇𝑛𝑂superscript𝑁1\min_{n=\overline{1,N}}\|\mathcal{G}_{\gamma}(\mu_{n})\|^{2}_{L^{2}(X,X,\mu_{n% })}=O(N^{-1})roman_min start_POSTSUBSCRIPT italic_n = over¯ start_ARG 1 , italic_N end_ARG end_POSTSUBSCRIPT ∥ caligraphic_G start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT = italic_O ( italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

Proof of Thm. 2 is in Appx. A.5. This theorem holds without requiring G𝐺Gitalic_G and H𝐻Hitalic_H to be differentiable.

Next, if H𝐻Hitalic_H is twice differentiable with uniformly bounded Hessian, we can derive a stronger convergence guarantee based on Fréchet stationarity (see Sect. 2.4). In other words, we evaluate the rate of dist(0,F(μn)):=infξF(μn)ξL2(X,X;μn)0assigndist0superscriptsubscript𝐹subscript𝜇𝑛subscriptinfimum𝜉subscriptsuperscript𝐹subscript𝜇𝑛subscriptnorm𝜉superscript𝐿2𝑋𝑋subscript𝜇𝑛0\operatorname{dist}{(0,\partial_{F}^{-}\mathcal{F}(\mu_{n}))}:=\inf_{\xi\in% \partial^{-}_{F}\mathcal{F}(\mu_{n})}\|\xi\|_{L^{2}(X,X;\mu_{n})}\to 0roman_dist ( 0 , ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) := roman_inf start_POSTSUBSCRIPT italic_ξ ∈ ∂ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ italic_ξ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X ; italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT → 0.

Assumption 5.

HC2(X)𝐻superscript𝐶2𝑋H\in C^{2}(X)italic_H ∈ italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X ) whose Hessian is bounded uniformly (H𝐻Hitalic_H is then LHsubscript𝐿𝐻L_{H}italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT-smooth).

Theorem 3 (Convergence rate: Fréchet subdifferentials).

Under Assumptions 1, 5, let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be the sequence of distributions produced by semi FB Euler starting from some μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) with γ<γ0𝛾subscript𝛾0\gamma<\gamma_{0}italic_γ < italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then minn=1,N¯dist(0,F(μn))=O(N12).subscript𝑛¯1𝑁dist0subscriptsuperscript𝐹subscript𝜇𝑛𝑂superscript𝑁12\min_{n=\overline{1,N}}{\operatorname{dist}{(0,\partial^{-}_{F}\mathcal{F}(\mu% _{n}))}}=O\left(N^{-\frac{1}{2}}\right).roman_min start_POSTSUBSCRIPT italic_n = over¯ start_ARG 1 , italic_N end_ARG end_POSTSUBSCRIPT roman_dist ( 0 , ∂ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) = italic_O ( italic_N start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) .

Proof of Thm. 3 is in Appx. A.6.

4.3 Fast convergence under isoperimetry and beyond

Fast convergence can be obtained under isoperimetry, e.g., log-Sobolev inequality (LSI). There are certain connections between LSI in sampling and the Łojasiewicz condition in optimization. Since we are working with the non-Wasserstein-differentiable objective function, Łojasiewicz condition is the right tool to employ. In nonconvex optimization in Euclidean space, analytic and subanalytic functions are a large class satisfying Łojasiewicz condition [37, 14]. Subanalytic DC programs are studied in [38]. In the infinite-dimensional setting of the Wasserstein space, the Łojasiewicz condition should be regarded as functional inequalities [12].

Assumption 6 (Łojasiewicz condition in the Wasserstein space).

Assume that superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal value of \mathcal{F}caligraphic_F, and assume there exist r0(,+]subscript𝑟0superscriptr_{0}\in(\mathcal{F}^{*},+\infty]italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , + ∞ ], θ[0,1)𝜃01\theta\in[0,1)italic_θ ∈ [ 0 , 1 ), and c>0𝑐0c>0italic_c > 0 such that for all μ𝒫2(X)𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), (μ)<r0c((μ))θξL2(X,X,μ),ξF(μ)formulae-sequence𝜇superscriptsubscript𝑟0𝑐superscript𝜇superscript𝜃subscriptnorm𝜉superscript𝐿2𝑋𝑋𝜇for-all𝜉superscriptsubscript𝐹𝜇\mathcal{F}(\mu)-\mathcal{F}^{*}<r_{0}\Rightarrow c\left(\mathcal{F}(\mu)-% \mathcal{F}^{*}\right)^{\theta}\leq\|\xi\|_{L^{2}(X,X,\mu)},~{}\forall\xi\in% \partial_{F}^{-}\mathcal{F}(\mu)caligraphic_F ( italic_μ ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT < italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⇒ italic_c ( caligraphic_F ( italic_μ ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ≤ ∥ italic_ξ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) end_POSTSUBSCRIPT , ∀ italic_ξ ∈ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_F ( italic_μ ), where the convention 00=0superscript0000^{0}=00 start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = 0 is used. θ[0,1)𝜃01\theta\in[0,1)italic_θ ∈ [ 0 , 1 ) is called Łojasiewicz exponent of \mathcal{F}caligraphic_F at optimality.

Remark 1.

If \mathscr{H}script_H is the is negative entropy, FC2(X)𝐹superscript𝐶2𝑋F\in C^{2}(X)italic_F ∈ italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X ) whose Hessian is bounded uniformly, then \mathcal{F}caligraphic_F is Wasserstein differentiable at any μ𝒫2,abs(X)𝜇subscript𝒫2abs𝑋\mu\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) and [36, Prop. 2.12, E.g. 2.3] W(μ)=μμ+Fsubscript𝑊𝜇𝜇𝜇𝐹\nabla_{W}\mathcal{F}(\mu)=\frac{\nabla\mu}{\mu}+\nabla F∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_F ( italic_μ ) = divide start_ARG ∇ italic_μ end_ARG start_ARG italic_μ end_ARG + ∇ italic_F, where, by abuse of notation, the probability density function of μ𝜇\muitalic_μ is still denoted by μ𝜇\muitalic_μ. We have W(μ)L2(X,X,μ)2=μ(x)μ(x)+F(x)2𝑑μ(x)=μ(x)logμ(x)μ(x)2𝑑xsubscriptsuperscriptnormsubscript𝑊𝜇2superscript𝐿2𝑋𝑋𝜇superscriptnorm𝜇𝑥𝜇𝑥𝐹𝑥2differential-d𝜇𝑥𝜇𝑥superscriptnorm𝜇𝑥superscript𝜇𝑥2differential-d𝑥\|\nabla_{W}\mathcal{F}(\mu)\|^{2}_{L^{2}(X,X,\mu)}=\int{\left\|\frac{\nabla% \mu(x)}{\mu(x)}+\nabla F(x)\right\|^{2}}d\mu(x)=\int\mu(x)\left\|\nabla\log% \frac{\mu(x)}{\mu^{*}(x)}\right\|^{2}dx∥ ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_F ( italic_μ ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) end_POSTSUBSCRIPT = ∫ ∥ divide start_ARG ∇ italic_μ ( italic_x ) end_ARG start_ARG italic_μ ( italic_x ) end_ARG + ∇ italic_F ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) = ∫ italic_μ ( italic_x ) ∥ ∇ roman_log divide start_ARG italic_μ ( italic_x ) end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x, where μexp(F)proportional-tosuperscript𝜇𝐹\mu^{*}\propto\exp(-F)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∝ roman_exp ( - italic_F ). On the other hand, (μ)=KL(μμ)𝜇superscriptKLconditional𝜇superscript𝜇\mathcal{F}(\mu)-\mathcal{F}^{*}=\operatorname{KL}(\mu\|\mu^{*})caligraphic_F ( italic_μ ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_KL ( italic_μ ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). The log-Sobolev inequality with parameter α>0𝛼0\alpha>0italic_α > 0 inequality reads [49] KL(μμ)12αFI(μμ):=12αμ(x)logμ(x)μ(x)2𝑑xKLconditional𝜇superscript𝜇12𝛼FIconditional𝜇superscript𝜇assign12𝛼𝜇𝑥superscriptnorm𝜇𝑥superscript𝜇𝑥2differential-d𝑥\operatorname{KL}(\mu\|\mu^{*})\leq\frac{1}{2\alpha}\operatorname{FI}(\mu\|\mu% ^{*}):=\frac{1}{2\alpha}\int\mu(x)\left\|\nabla\log\frac{\mu(x)}{\mu^{*}(x)}% \right\|^{2}dxroman_KL ( italic_μ ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ divide start_ARG 1 end_ARG start_ARG 2 italic_α end_ARG roman_FI ( italic_μ ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) := divide start_ARG 1 end_ARG start_ARG 2 italic_α end_ARG ∫ italic_μ ( italic_x ) ∥ ∇ roman_log divide start_ARG italic_μ ( italic_x ) end_ARG start_ARG italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_x, where FI(μμ)FIconditional𝜇superscript𝜇\operatorname{FI}(\mu\|\mu^{*})roman_FI ( italic_μ ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) is the relative Fisher information of μ𝜇\muitalic_μ with respect to μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Therefore, log-Sobolev inequality is a special case of Łojasiewicz condition with r0=+,c=2αformulae-sequencesubscript𝑟0𝑐2𝛼r_{0}=+\infty,c=\sqrt{2\alpha}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = + ∞ , italic_c = square-root start_ARG 2 italic_α end_ARG, and θ=1/2.𝜃12\theta=1/2.italic_θ = 1 / 2 .

Theorem 4.

Under Assumptions 1, 5 and Assumption 6 with parameters (r0,c,θ)subscript𝑟0𝑐𝜃(r_{0},c,\theta)( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_c , italic_θ ). Let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be the sequence of distributions produced by semi FB Euler starting from some sufficiently warm-up μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) such that (μ0)<r0subscript𝜇0subscript𝑟0\mathcal{F}(\mu_{0})<r_{0}caligraphic_F ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) < italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and with stepsize γ<γ0𝛾subscript𝛾0\gamma<\gamma_{0}italic_γ < italic_γ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, then

  • (i)

    if θ=0𝜃0\theta=0italic_θ = 0, (μn)subscript𝜇𝑛superscript\mathcal{F}(\mu_{n})-\mathcal{F}^{*}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT converges to 00 in a finite number of steps;

  • (ii)

    if θ(0,1/2]𝜃012\theta\in(0,1/2]italic_θ ∈ ( 0 , 1 / 2 ], (μn)=O((MM+1)n) where M=2(γ2LH2+1)c2γ;subscript𝜇𝑛superscript𝑂superscript𝑀𝑀1𝑛 where 𝑀2superscript𝛾2superscriptsubscript𝐿𝐻21superscript𝑐2𝛾\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=O\left(\left(\frac{M}{M+1}\right)^{n}% \right)\text{ where }M=\frac{{2(\gamma^{2}L_{H}^{2}+1)}}{c^{2}{\gamma}};caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_O ( ( divide start_ARG italic_M end_ARG start_ARG italic_M + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) where italic_M = divide start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ end_ARG ;

  • (iii)

    if θ(1/2,1)𝜃121\theta\in(1/2,1)italic_θ ∈ ( 1 / 2 , 1 ), (μn)subscript𝜇𝑛superscript\mathcal{F}(\mu_{n})-\mathcal{F}^{*}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT converges sublinearly to 00, i.e., (μn)=O(n12θ1).subscript𝜇𝑛superscript𝑂superscript𝑛12𝜃1\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=O\left(n^{-\frac{1}{2\theta-1}}\right).caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_O ( italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ) .

Proof of Thm. 4 is in Appx. A.7.

Remark 2.

In the usual sampling case, i.e., \mathscr{H}script_H is the negative entropy, and under log-Sobolev condition, r0=+subscript𝑟0r_{0}=+\inftyitalic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = + ∞. Therefore, μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can be arbitrarily in 𝒫2,abs(X)subscript𝒫2abs𝑋\mathcal{P}_{2,\operatorname{abs}}(X)caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ). In the general case, however, a good enough starting point (i.e., (μ0)<r0subscript𝜇0subscript𝑟0\mathcal{F}(\mu_{0})<r_{0}caligraphic_F ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) < italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) is needed to guarantee we are in the region where Łojasiewicz condition comes into play. In such a case, (μn)=KL(μnμ)subscript𝜇𝑛superscriptKLconditionalsubscript𝜇𝑛superscript𝜇\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=\operatorname{KL}(\mu_{n}\|\mu^{*})caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_KL ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) where μ(x)exp(F(x))proportional-tosuperscript𝜇𝑥𝐹𝑥\mu^{*}(x)\propto\exp(-F(x))italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ∝ roman_exp ( - italic_F ( italic_x ) ) is the target distribution (see Rmk. 1), so Thm. 4 provides convergence rate of {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT to μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in terms of KL divergence and this convergence is exponentially fast if θ(0,1/2]𝜃012\theta\in(0,1/2]italic_θ ∈ ( 0 , 1 / 2 ].

Theorem 5.

Under the same set of assumptions as in Thm. 4, the sequence {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is a Cauchy sequence under Wasserstein topology. Furthermore, as the Wasserstein space (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is complete [4, Thm. 2.2], every Cauchy sequence is convergent, i.e., there exists μ𝒫2(X)superscript𝜇subscript𝒫2𝑋\mu^{*}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) such that μnWassμ.Wasssubscript𝜇𝑛superscript𝜇\mu_{n}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}.italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT . The limit distribution μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is indeed the global minimizer of \mathcal{F}caligraphic_F. In addition:

  • (i)

    if θ=0𝜃0\theta=0italic_θ = 0, W2(μn,μ)subscript𝑊2subscript𝜇𝑛superscript𝜇W_{2}(\mu_{n},\mu^{*})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) converges to 00 in a finite number of steps;

  • (ii)

    if θ(0,1/2]𝜃012\theta\in(0,1/2]italic_θ ∈ ( 0 , 1 / 2 ], W2(μn,μ)=O((MM+1)n), where M=1+(2(γ2LH2+1))12θ(1θ)γ1θθc1θformulae-sequencesubscript𝑊2subscript𝜇𝑛superscript𝜇𝑂superscript𝑀𝑀1𝑛 where 𝑀1superscript2superscript𝛾2superscriptsubscript𝐿𝐻2112𝜃1𝜃superscript𝛾1𝜃𝜃superscript𝑐1𝜃W_{2}(\mu_{n},\mu^{*})=O\left(\left(\frac{M}{M+1}\right)^{n}\right),\text{ % where }M=1+\frac{(2(\gamma^{2}L_{H}^{2}+1))^{\frac{1}{2\theta}}}{(1-\theta)% \gamma^{\frac{1-\theta}{\theta}}c^{\frac{1}{\theta}}}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_O ( ( divide start_ARG italic_M end_ARG start_ARG italic_M + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) , where italic_M = 1 + divide start_ARG ( 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_θ end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_θ ) italic_γ start_POSTSUPERSCRIPT divide start_ARG 1 - italic_θ end_ARG start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT end_ARG;

  • (iii)

    if θ(1/2,1)𝜃121\theta\in(1/2,1)italic_θ ∈ ( 1 / 2 , 1 ), W2(μn,μ)=O(n1θ2θ1)subscript𝑊2subscript𝜇𝑛superscript𝜇𝑂superscript𝑛1𝜃2𝜃1W_{2}(\mu_{n},\mu^{*})=O\left(n^{-\frac{1-\theta}{2\theta-1}}\right)italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_O ( italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 - italic_θ end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ).

Proof of Thm. 5 is in Appx. A.8. This theorem provides convergence to optimality in terms of Wasserstein distance.

5 Numerical illustrations

The JKO operator is a great theoretical tool to study Wasserstein gradient flows, e.g., it is the main recipe used in the seminal paper [35] that gives variational structure for the Fokker-Planck equation. However, the JKO operator is not quite scalable (at least for now). To learn the JKO, recent advances use the gradient of an input-convex neural network [5] to approximate the optimal Monge map pushing νn+1subscript𝜈𝑛1\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT to μn+1subscript𝜇𝑛1\mu_{n+1}italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT [45]. This approach is inspired by Brenier theorem asserting that an optimal Monge map has to be the (sub)gradient field of some convex function. We use this neural network approach to perform some numerical sampling experiments from non-log-concave distributions: the Gaussian mixture distribution and the distance-to-set-prior [53] relaxed von Mises–Fisher distribution. Both are log-DC and the latter has non-differentiable logarithmic probability density (see Appx. C). Fig. 1 presents the sampling results. Implementation details are in Appx. B and experiment details are in Appx. C.

Refer to caption

(a)

Refer to caption

(b)

Refer to caption

(c)

Refer to caption

(d)

Figure 1: (a) and (b): Mixture of Gaussians. (a) shows samples obtained from semi FB Euler at iteration 40404040 and (b) shows KL divergence along the training process: semi FB Euler with sound theory is as fast as FB Euler; (c) and (d): Relaxed von Mises-Fisher. (c) shows true probability density, and (d) shows the sample histogram obtained from semi FB Euler. In this experiment, FB Euler fails to work, attributed to the high curvature of the relaxed von Mises-Fisher.

6 Related work

We first narrow down our discussion on FB Euler and its variants in the Wasserstein space. When \mathscr{H}script_H is the negative entropy, Wibisono [62] provides some insightful discussion on how FB Euler should be consistent (no asymptotic bias) because the backward step is adjoint to the forward step, hence preserves stationarity. However, no convergence theory is presented for FB Euler in the Wasserstein space in [62]. Recently, Salim et al. [57] provide convergence guarantee for FB Euler within the following setting: \mathscr{H}script_H is convex along generalized geodesics, F𝐹Fitalic_F is Lipschitz smooth and convex/strongly convex.

Some other papers have tangential relations to our work, mainly from the ULA (and its variants) literature. Durmus et al. [25] analyze the ULA from the convex optimization perspective. Vempala et al. [59] show that LSI and Hessian boundedness suffice for fast convergence of the ULA where "fast" is understood as fast to the biased target since ULA is a biased algorithm. Balasubramanian et al. [8] analyze the ULA under quite mild conditions: log-density is Lipschitz/Hölder smooth. Bernton [10] studies the proximal-ULA also under the convex assumption, where the difference to the ULA is the first step: gradient descent is replaced by the proximal operator. Similar to ULA, proximal-ULA is asymptotically biased. To address nonsmoothness, another line of research utilizes Moreau-Yosida envelopes to create smooth approximations of the ULA dynamics [28, 42]. This approach is also applicable to certain classes of non-log-concave distributions [42] and is more of a flavour of discretization error quantification.

7 Conclusion

We propose a new semi FB Euler scheme as a discretization of Wasserstein gradient flow and show that it has favourably theoretical guarantees that the commonly used FB Euler does not yet have if the objective function is not convex along generalized geodesics. Our theoretical analysis opens up interesting avenues for future work.

Acknowledgments and Disclosure of Funding

This work is supported by the Research Council of Finland Flagship programme: Finnish Center for Artificial Intelligence FCAI, and additionally by grants 345811, 348952, and 346376 (VILMA: Centre of Excellence: Virtual Laboratory for Molecular Level Atmospheric Transformations). The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources. H.P.H. Luu specifically thanks Michel Ledoux, Luigi Ambrosio, and Alain Durmus for helpful information.

References

  • [1] Miju Ahn, Jong-Shi Pang, and Jack Xin. Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM Journal on Optimization, 27(3):1637–1665, 2017.
  • [2] Luigi Ambrosio, Alberto Bressan, Dirk Helbing, Axel Klar, Enrique Zuazua, Luigi Ambrosio, and Nicola Gigli. A user’s guide to optimal transport. Modelling and Optimisation of Flows on Networks: Cetraro, Italy 2009, Editors: Benedetto Piccoli, Michel Rascle, pages 1–155, 2013.
  • [3] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
  • [4] Luigi Ambrosio and Giuseppe Savaré. Gradient flows of probability measures. Handbook of differential equations: evolutionary equations, 3:1–136, 2006.
  • [5] Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In International Conference on Machine Learning, pages 146–155. PMLR, 2017.
  • [6] Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. Advances in Neural Information Processing Systems, 32, 2019.
  • [7] Miroslav Bačák and Jonathan M Borwein. On difference convexity of locally Lipschitz functions. Optimization, 60(8-9):961–978, 2011.
  • [8] Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for Langevin Monte Carlo. In Conference on Learning Theory, pages 2896–2923. PMLR, 2022.
  • [9] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
  • [10] Espen Bernton. Langevin Monte Carlo and JKO splitting. In Conference on learning theory, pages 1777–1798. PMLR, 2018.
  • [11] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2013.
  • [12] Adrien Blanchet and Jérôme Bolte. A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. Journal of Functional Analysis, 275(7):1650–1673, 2018.
  • [13] Vladimir Igorevich Bogachev and Maria Aparecida Soares Ruas. Measure theory, volume 2. Springer, 2007.
  • [14] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
  • [15] Benoît Bonnet. A Pontryagin maximum principle in Wasserstein spaces for constrained optimal control problems. ESAIM: Control, Optimisation and Calculus of Variations, 25:52, 2019.
  • [16] Jonathan Borwein and Adrian Lewis. CONVEX ANALYSIS AND NONLINEAR OPTIMIZATION Theory and Examples. Springer, 2006.
  • [17] Glen E Bredon. Topology and geometry, volume 139. Springer Science & Business Media, 2013.
  • [18] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
  • [19] brett1479 (https://math.stackexchange.com/users/62876/brett1479). Borel sigma algebra of one point compactification. Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/3532983 (version: 2020-02-03).
  • [20] Haim Brézis. Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011.
  • [21] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  • [22] Frank H Clarke. Optimization and nonsmooth analysis. SIAM, 1990.
  • [23] Ying Cui, Jong-Shi Pang, and Bodhisattva Sen. Composite difference-max programs for modern statistical estimation problems. SIAM Journal on Optimization, 28(4):3344–3374, 2018.
  • [24] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):651–676, 2017.
  • [25] Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
  • [26] Alain Durmus and Éric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 – 1587, 2017.
  • [27] Alain Durmus and Éric Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 – 2882, 2019.
  • [28] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
  • [29] Matthias Erbar. The heat equation on manifolds as a gradient flow in the Wasserstein space. In Annales de l’IHP Probabilités et statistiques, volume 46, pages 1–23, 2010.
  • [30] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, 2016.
  • [31] Piotr Hajlasz. Is there a Borel measurable f:dd:𝑓superscript𝑑superscript𝑑f:\mathbb{R}^{d}\to\mathbb{R}^{d}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that f(x)φ(x)𝑓𝑥𝜑𝑥f(x)\in\partial\varphi(x)italic_f ( italic_x ) ∈ ∂ italic_φ ( italic_x ) for all x𝑥xitalic_x? MathOverflow. URL:https://mathoverflow.net/q/453991 (version: 2023-12-02).
  • [32] W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications. 1970.
  • [33] Horst Herrlich. Axiom of choice, volume 1876. Springer, 2006.
  • [34] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Advances in neural information processing systems, 29, 2016.
  • [35] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
  • [36] Nicolas Lanzetti, Saverio Bolognani, and Florian Dörfler. First-order conditions for optimization in the Wasserstein space. arXiv preprint arXiv:2209.12197, 2022.
  • [37] Stanis law Łojasiewicz. Ensembles semi-analytiques. IHES notes, page 220, 1965.
  • [38] Hoai An Le Thi, Van Ngai Huynh, and Tao Pham Dinh. Convergence analysis of difference-of-convex algorithm with subanalytic data. Journal of Optimization Theory and Applications, 179(1):103–126, 2018.
  • [39] Hoai An Le Thi, Van Ngai Huynh, Tao Pham Dinh, and Hoang Phuc Hau Luu. Stochastic difference-of-convex-functions algorithms for nonconvex programming. SIAM Journal on Optimization, 32(3):2263–2293, 2022.
  • [40] Hoai An Le Thi and Tao Pham Dinh. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of operations research, 133:23–46, 2005.
  • [41] John M. Lee. Smooth Manifolds. In Introduction to Smooth Manifolds, pages 1–31. Springer.
  • [42] Tung Duy Luu, Jalal Fadili, and Christophe Chesneau. Sampling from non-smooth distributions through Langevin diffusion. Methodology and Computing in Applied Probability, 23:1173–1201, 2021.
  • [43] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
  • [44] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
  • [45] Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale Wasserstein gradient flows. Advances in Neural Information Processing Systems, 34:15243–15256, 2021.
  • [46] Boris Mordukhovich and Mau Nam Nguyen. An easy path to convex analysis and applications. Springer Nature, 2023.
  • [47] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
  • [48] Maher Nouiehed, Jong-Shi Pang, and Meisam Razaviyayn. On the pervasiveness of difference-convexity in optimization and statistics. Mathematical Programming, 174(1):195–222, 2019.
  • [49] Felix Otto and Cédric Villani. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
  • [50] Jong-Shi Pang, Meisam Razaviyayn, and Alberth Alvarado. Computing B-stationary points of nonsmooth DC programs. Mathematics of Operations Research, 42(1):95–118, 2017.
  • [51] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and trends® in Optimization, 1(3):127–239, 2014.
  • [52] Tao Pham Dinh and Hoai An Le Thi. Convex analysis approach to DC programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289–355, 1997.
  • [53] Rick Presman and Jason Xu. Distance-to-set priors and constrained Bayesian inference. In International Conference on Artificial Intelligence and Statistics, pages 2310–2326. PMLR, 2023.
  • [54] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
  • [55] R Tyrrell Rockafellar. Lagrange multipliers and optimality. SIAM review, 35(2):183–238, 1993.
  • [56] R Tyrrell Rockafellar. Convex analysis, volume 11. Princeton university press, 1997.
  • [57] Adil Salim, Anna Korba, and Giulia Luise. The Wasserstein proximal gradient algorithm. Advances in Neural Information Processing Systems, 33:12356–12366, 2020.
  • [58] Amirhossein Taghvaei and Prashant Mehta. Accelerated flow for probability distributions. In International conference on machine learning, pages 6076–6085. PMLR, 2019.
  • [59] Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.
  • [60] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
  • [61] Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.
  • [62] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018.
  • [63] Hu Zhang and Yi-Shuai Niu. A Boosted-DCA with power-sum-DC decomposition for linearly constrained polynomial programs. Journal of Optimization Theory and Applications, pages 1–40, 2024.
  • [64] Xingyu Zhou. On the Fenchel duality between strong convexity and Lipschitz continuous gradient. arXiv preprint arXiv:1803.06573, 2018.

Appendix A Theory

Lemma 2 (Transfer lemma).

[2, Sect. 1] Let T:mn:𝑇superscript𝑚superscript𝑛T:\mathbb{R}^{m}\to\mathbb{R}^{n}italic_T : blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT be a measurable map, and μ𝒫(m)𝜇𝒫superscript𝑚\mu\in\mathcal{P}(\mathbb{R}^{m})italic_μ ∈ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ), then T#μ𝒫(n)subscript𝑇#𝜇𝒫superscript𝑛T_{\#}\mu\in\mathcal{P}(\mathbb{R}^{n})italic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ ∈ caligraphic_P ( blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) and f(y)d(T#μ)(y)=(fT)(x)𝑑μ(x)𝑓𝑦𝑑subscript𝑇#𝜇𝑦𝑓𝑇𝑥differential-d𝜇𝑥\int{f(y)}d(T_{\#}\mu)(y)=\int{(f\circ T)(x)}d\mu(x)∫ italic_f ( italic_y ) italic_d ( italic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ ) ( italic_y ) = ∫ ( italic_f ∘ italic_T ) ( italic_x ) italic_d italic_μ ( italic_x ) for every measurable function f:n:𝑓superscript𝑛f:\mathbb{R}^{n}\to\mathbb{R}italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R, where the above identity has to be understood that: one of the integrals exits (potentially ±plus-or-minus\pm\infty± ∞) iff the other one exists, and in such a case they are equal. Consequently, for a bounded function f𝑓fitalic_f, the above integrals exist as real numbers that are equal.

Lemma 3.

[3, Rmk. 6.2.11] Let μ,ν𝒫2,abs(X)𝜇𝜈subscript𝒫2abs𝑋\mu,\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ , italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), then TνμTμν=Isuperscriptsubscript𝑇𝜈𝜇superscriptsubscript𝑇𝜇𝜈𝐼T_{\nu}^{\mu}\circ T_{\mu}^{\nu}=Iitalic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ∘ italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT = italic_I μ𝜇\muitalic_μ-a.e. and TμνTνμ=Isuperscriptsubscript𝑇𝜇𝜈superscriptsubscript𝑇𝜈𝜇𝐼T_{\mu}^{\nu}\circ T_{\nu}^{\mu}=Iitalic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT ∘ italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT = italic_I ν𝜈\nuitalic_ν-a.e.

Theorem 6 (Characterization of Fréchet subdifferential for geodesically convex functions).

[3, Section 10.1] Suppose ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } is proper, l.s.c, convex on geodesics. Let μdom(ϕ)𝒫2,abs(X)𝜇domitalic-ϕsubscript𝒫2abs𝑋\mu\in\operatorname{dom}(\partial\phi)\cap\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ ∈ roman_dom ( ∂ italic_ϕ ) ∩ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), then a vector ξL2(X,X,μ)𝜉superscript𝐿2𝑋𝑋𝜇\xi\in L^{2}(X,X,\mu)italic_ξ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) belongs to the Fréchet subdifferential of ϕitalic-ϕ\phiitalic_ϕ at μ𝜇\muitalic_μ if and only if

ϕ(ν)ϕ(μ)Xξ(x),Tμν(x)x𝑑μ(x)νdom(ϕ).formulae-sequenceitalic-ϕ𝜈italic-ϕ𝜇subscript𝑋𝜉𝑥superscriptsubscript𝑇𝜇𝜈𝑥𝑥differential-d𝜇𝑥for-all𝜈domitalic-ϕ\displaystyle\phi(\nu)-\phi(\mu)\geq\int_{X}{\langle\xi(x),T_{\mu}^{\nu}(x)-x% \rangle}d\mu(x)\quad\forall\nu\in\operatorname{dom}(\phi).italic_ϕ ( italic_ν ) - italic_ϕ ( italic_μ ) ≥ ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ italic_ξ ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT ( italic_x ) - italic_x ⟩ italic_d italic_μ ( italic_x ) ∀ italic_ν ∈ roman_dom ( italic_ϕ ) .
Lemma 4.

Let H:X:𝐻𝑋H:X\to\mathbb{R}italic_H : italic_X → blackboard_R be a convex function having quadratic growth and ξ𝜉\xiitalic_ξ be a measurable selector of H𝐻\partial H∂ italic_H, i.e., ξ(x)H(x)𝜉𝑥𝐻𝑥\xi(x)\in\partial H(x)italic_ξ ( italic_x ) ∈ ∂ italic_H ( italic_x ) for all xX𝑥𝑋x\in Xitalic_x ∈ italic_X. Then, for all μ𝒫2(X)𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ),

Xξ(x)2𝑑μ(x)<+.subscript𝑋superscriptnorm𝜉𝑥2differential-d𝜇𝑥\displaystyle\int_{X}{\|\xi(x)\|^{2}}d\mu(x)<+\infty.∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) < + ∞ . (5)

In other words, ξL2(X,X,μ)𝜉superscript𝐿2𝑋𝑋𝜇\xi\in L^{2}(X,X,\mu)italic_ξ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) for all μ𝒫2(X)𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ).

Proof.

Since ξ(x)H(x)𝜉𝑥𝐻𝑥\xi(x)\in\partial H(x)italic_ξ ( italic_x ) ∈ ∂ italic_H ( italic_x ), by tangent inequality for convex functions, we have

H(y)H(x)+ξ(x),yx,yX.formulae-sequence𝐻𝑦𝐻𝑥𝜉𝑥𝑦𝑥for-all𝑦𝑋\displaystyle H(y)\geq H(x)+\langle\xi(x),y-x\rangle,\quad\forall y\in X.italic_H ( italic_y ) ≥ italic_H ( italic_x ) + ⟨ italic_ξ ( italic_x ) , italic_y - italic_x ⟩ , ∀ italic_y ∈ italic_X .

By picking y=x+ηξ(x)𝑦𝑥𝜂𝜉𝑥y=x+\eta\xi(x)italic_y = italic_x + italic_η italic_ξ ( italic_x ) for some η>0𝜂0\eta>0italic_η > 0, we get

H(x+ηξ(x))H(x)ξ(x),ηξ(x)=ηξ(x)2.𝐻𝑥𝜂𝜉𝑥𝐻𝑥𝜉𝑥𝜂𝜉𝑥𝜂superscriptnorm𝜉𝑥2\displaystyle H(x+\eta\xi(x))-H(x)\geq\langle\xi(x),\eta\xi(x)\rangle=\eta\|% \xi(x)\|^{2}.italic_H ( italic_x + italic_η italic_ξ ( italic_x ) ) - italic_H ( italic_x ) ≥ ⟨ italic_ξ ( italic_x ) , italic_η italic_ξ ( italic_x ) ⟩ = italic_η ∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (6)

Since H𝐻Hitalic_H has quadratic growth, for some a>0𝑎0a>0italic_a > 0,

H(x+ηξ(x))H(x)𝐻𝑥𝜂𝜉𝑥𝐻𝑥\displaystyle H(x+\eta\xi(x))-H(x)italic_H ( italic_x + italic_η italic_ξ ( italic_x ) ) - italic_H ( italic_x ) |H(x+ηξ(x))|+|H(x)|absent𝐻𝑥𝜂𝜉𝑥𝐻𝑥\displaystyle\leq|H(x+\eta\xi(x))|+|H(x)|≤ | italic_H ( italic_x + italic_η italic_ξ ( italic_x ) ) | + | italic_H ( italic_x ) |
a(x+ηξ(x)2+1)+a(x2+1)absent𝑎superscriptnorm𝑥𝜂𝜉𝑥21𝑎superscriptnorm𝑥21\displaystyle\leq a\left(\|x+\eta\xi(x)\|^{2}+1\right)+a(\|x\|^{2}+1)≤ italic_a ( ∥ italic_x + italic_η italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) + italic_a ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 )
2a+ax2+a(x+ηξ(x))2absent2𝑎𝑎superscriptnorm𝑥2𝑎superscriptnorm𝑥𝜂norm𝜉𝑥2\displaystyle\leq 2a+a\|x\|^{2}+a(\|x\|+\eta\|\xi(x)\|)^{2}≤ 2 italic_a + italic_a ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_a ( ∥ italic_x ∥ + italic_η ∥ italic_ξ ( italic_x ) ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
2a+ax2+2a(x2+η2ξ(x)2)absent2𝑎𝑎superscriptnorm𝑥22𝑎superscriptnorm𝑥2superscript𝜂2superscriptnorm𝜉𝑥2\displaystyle\leq 2a+a\|x\|^{2}+2a(\|x\|^{2}+\eta^{2}\|\xi(x)\|^{2})≤ 2 italic_a + italic_a ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_a ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
=2a+3ax2+2aη2ξ(x)2.absent2𝑎3𝑎superscriptnorm𝑥22𝑎superscript𝜂2superscriptnorm𝜉𝑥2\displaystyle=2a+3a\|x\|^{2}+2a\eta^{2}\|\xi(x)\|^{2}.= 2 italic_a + 3 italic_a ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_a italic_η start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Combining with (6), it holds

η(12aη)ξ(x)22a+3ax2.𝜂12𝑎𝜂superscriptnorm𝜉𝑥22𝑎3𝑎superscriptnorm𝑥2\displaystyle\eta(1-2a\eta)\|\xi(x)\|^{2}\leq 2a+3a\|x\|^{2}.italic_η ( 1 - 2 italic_a italic_η ) ∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_a + 3 italic_a ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

By choosing 0<η<1/(2a)0𝜂12𝑎0<\eta<1/(2a)0 < italic_η < 1 / ( 2 italic_a ), we obtain

ξ(x)22aη(12aη)+3aη(12aη)x2.superscriptnorm𝜉𝑥22𝑎𝜂12𝑎𝜂3𝑎𝜂12𝑎𝜂superscriptnorm𝑥2\displaystyle\|\xi(x)\|^{2}\leq\dfrac{2a}{\eta(1-2a\eta)}+\dfrac{3a}{\eta(1-2a% \eta)}\|x\|^{2}.∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG 2 italic_a end_ARG start_ARG italic_η ( 1 - 2 italic_a italic_η ) end_ARG + divide start_ARG 3 italic_a end_ARG start_ARG italic_η ( 1 - 2 italic_a italic_η ) end_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Therefore, ξ(x)2superscriptnorm𝜉𝑥2\|\xi(x)\|^{2}∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT has quadratic growth and - as a consequence - (5) holds for any μ𝒫2(X).𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X).italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) .

Lemma 5.

Let ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } be a proper function. Let μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be a local minimizer of ϕitalic-ϕ\phiitalic_ϕ, then μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a Fréchet stationary point of ϕitalic-ϕ\phiitalic_ϕ.

Proof.

There exists r>0𝑟0r>0italic_r > 0 such that ϕ(μ)ϕ(μ)italic-ϕsuperscript𝜇italic-ϕ𝜇\phi(\mu^{*})\leq\phi(\mu)italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϕ ( italic_μ ) for all μ𝒫2(X):W2(μ,μ)<r.:𝜇subscript𝒫2𝑋subscript𝑊2𝜇superscript𝜇𝑟\mu\in\mathcal{P}_{2}(X):W_{2}(\mu,\mu^{*})<r.italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_r . It follows that

lim infμWassμϕ(μ)ϕ(μ)W2(μ,μ)0,subscriptlimit-infimumWass𝜇superscript𝜇italic-ϕ𝜇italic-ϕsuperscript𝜇subscript𝑊2𝜇superscript𝜇0\displaystyle\liminf_{\mu\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}}{% \dfrac{\phi(\mu)-\phi(\mu^{*})}{W_{2}(\mu,\mu^{*})}}\geq 0,lim inf start_POSTSUBSCRIPT italic_μ start_ARROW overwass → end_ARROW italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_ϕ ( italic_μ ) - italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) end_ARG ≥ 0 ,

so 0Fϕ(μ)0superscriptsubscript𝐹italic-ϕsuperscript𝜇0\in\partial_{F}^{-}\phi(\mu^{*})0 ∈ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), or μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a Fréchet stationary point of ϕitalic-ϕ\phiitalic_ϕ. ∎

Lemma 6.

Let ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } be a proper, l.s.c, geodesically convex function. Suppose that μ𝒫2,abs(X)superscript𝜇subscript𝒫2abs𝑋\mu^{*}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) is a Fréchet stationary point of ϕitalic-ϕ\phiitalic_ϕ. Then, μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a global minimizer of ϕitalic-ϕ\phiitalic_ϕ.

Proof.

By definition of Fréchet stationarity, 0ϕ(μ)0italic-ϕsuperscript𝜇0\in\partial\phi(\mu^{*})0 ∈ ∂ italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). By characterization of subdifferential of geodesically convex functions (Thm. 6), it holds ϕ(μ)ϕ(μ)italic-ϕ𝜇italic-ϕsuperscript𝜇\phi(\mu)\geq\phi(\mu^{*})italic_ϕ ( italic_μ ) ≥ italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) for all μdom(ϕ)𝜇domitalic-ϕ\mu\in\operatorname{dom}(\phi)italic_μ ∈ roman_dom ( italic_ϕ ), or μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a global minimizer of ϕitalic-ϕ\phiitalic_ϕ. ∎

Lemma 7.

Under Assumption 1, let μdom()superscript𝜇dom\mu^{*}\in\operatorname{dom}(\mathcal{F})italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_dom ( caligraphic_F ) be a local minimizer of \mathcal{F}caligraphic_F, then μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a critical point of \mathcal{F}caligraphic_F, i.e., (+G)(μ)H(μ).subscript𝐺superscript𝜇subscript𝐻superscript𝜇\partial(\mathscr{H}+\mathcal{E}_{G})(\mu^{*})\cap\partial\mathcal{E}_{H}(\mu^% {*})\neq\emptyset.∂ ( script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∩ ∂ caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≠ ∅ .

Proof.

Since μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a local minimizer of \mathcal{F}caligraphic_F, there exists r>0𝑟0r>0italic_r > 0 such that

(μ)(μ),μ𝒫2(X):W2(μ,μ)<r.:formulae-sequencesuperscript𝜇𝜇for-all𝜇subscript𝒫2𝑋subscript𝑊2𝜇superscript𝜇𝑟\mathcal{F}(\mu^{*})\leq\mathcal{F}(\mu),\quad\forall\mu\in\mathcal{P}_{2}(X):% W_{2}(\mu,\mu^{*})<r.caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ caligraphic_F ( italic_μ ) , ∀ italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_r . (7)

Let ξ𝜉\xiitalic_ξ be a measurable selector of H𝐻\partial H∂ italic_H. Thanks to Lemma 4, ξL2(X,X,μ)𝜉superscript𝐿2𝑋𝑋superscript𝜇\xi\in L^{2}(X,X,\mu^{*})italic_ξ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). According to [4, Prop. 4.13], ξH(μ)𝜉subscript𝐻superscript𝜇\xi\in\partial\mathcal{E}_{H}(\mu^{*})italic_ξ ∈ ∂ caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). It follows from Thm. 6 that

H(μ)H(μ)+Xξ(x),Tμμ(x)x𝑑μ(x),μ𝒫2(X).formulae-sequencesubscript𝐻𝜇subscript𝐻superscript𝜇subscript𝑋𝜉𝑥superscriptsubscript𝑇superscript𝜇𝜇𝑥𝑥differential-dsuperscript𝜇𝑥for-all𝜇subscript𝒫2𝑋\displaystyle\mathcal{E}_{H}(\mu)\geq\mathcal{E}_{H}(\mu^{*})+\int_{X}{\langle% \xi(x),T_{\mu^{*}}^{\mu}(x)-x\rangle}d\mu^{*}(x),\quad\forall\mu\in\mathcal{P}% _{2}(X).caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ ) ≥ caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ italic_ξ ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_x ) - italic_x ⟩ italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) , ∀ italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) . (8)

From (7) and (8), for μB(μ,r)𝜇𝐵superscript𝜇𝑟\mu\in B(\mu^{*},r)italic_μ ∈ italic_B ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r ),

(μ)+G(μ)(μ)+G(μ)+Xξ(x),Tμμ(x)x𝑑μ(x).𝜇subscript𝐺𝜇superscript𝜇subscript𝐺superscript𝜇subscript𝑋𝜉𝑥superscriptsubscript𝑇superscript𝜇𝜇𝑥𝑥differential-dsuperscript𝜇𝑥\displaystyle\mathscr{H}(\mu)+\mathcal{E}_{G}(\mu)\geq\mathscr{H}(\mu^{*})+% \mathcal{E}_{G}(\mu^{*})+\int_{X}{\langle\xi(x),T_{\mu^{*}}^{\mu}(x)-x\rangle}% d\mu^{*}(x).script_H ( italic_μ ) + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_μ ) ≥ script_H ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ italic_ξ ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_x ) - italic_x ⟩ italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) .

Therefore, ξ(+G)(μ)𝜉subscript𝐺superscript𝜇\xi\in\partial(\mathscr{H}+\mathcal{E}_{G})(\mu^{*})italic_ξ ∈ ∂ ( script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) since +Gsubscript𝐺\mathscr{H}+\mathcal{E}_{G}script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT is geodesically convex. It follows that μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a critical point of .\mathcal{F}.caligraphic_F .

Lemma 8.

Let 𝒰,𝒱:𝒫2(X){+}:𝒰𝒱subscript𝒫2𝑋\mathcal{U},\mathcal{V}:\mathcal{P}_{2}(X)\to\mathbb{R}{\cup\{+\infty\}}caligraphic_U , caligraphic_V : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ }. The following statements hold

  • a.

    F(𝒰+𝒱)(μ)F𝒰(μ)+F𝒱(μ).superscriptsubscript𝐹𝒰𝜇superscriptsubscript𝐹𝒱𝜇superscriptsubscript𝐹𝒰𝒱𝜇\partial_{F}^{-}(\mathcal{U}+\mathcal{V})(\mu)\supset\partial_{F}^{-}\mathcal{% U}(\mu)+\partial_{F}^{-}\mathcal{V}(\mu).∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( caligraphic_U + caligraphic_V ) ( italic_μ ) ⊃ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_U ( italic_μ ) + ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_V ( italic_μ ) .

  • b.

    If 𝒱𝒱\mathcal{V}caligraphic_V is Wasserstein differentiable of μ𝜇\muitalic_μ, then

    F(𝒰+𝒱)(μ)=F𝒰(μ)+W𝒱(μ).superscriptsubscript𝐹𝒰𝒱𝜇superscriptsubscript𝐹𝒰𝜇subscript𝑊𝒱𝜇\displaystyle\partial_{F}^{-}\mathscr{(}\mathcal{U}+\mathcal{V})(\mu)=\partial% _{F}^{-}\mathcal{U}(\mu)+\nabla_{W}\mathcal{V}(\mu).∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( caligraphic_U + caligraphic_V ) ( italic_μ ) = ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_U ( italic_μ ) + ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_V ( italic_μ ) . (9)
Proof.

Item a. is trivial from the definition of Fréchet subdifferential. For item b., from item a., we first see that,

F(𝒰+𝒱)(μ)F𝒰(μ)+W𝒱(μ).superscriptsubscript𝐹𝒰𝜇subscript𝑊𝒱𝜇superscriptsubscript𝐹𝒰𝒱𝜇\displaystyle\partial_{F}^{-}(\mathcal{U}+\mathcal{V})(\mu)\supset\partial_{F}% ^{-}\mathcal{U}(\mu)+\nabla_{W}\mathcal{V}(\mu).∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( caligraphic_U + caligraphic_V ) ( italic_μ ) ⊃ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_U ( italic_μ ) + ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_V ( italic_μ ) . (10)

On the other hand, we apply item a. for 𝒰+𝒱𝒰𝒱\mathcal{U}+\mathcal{V}caligraphic_U + caligraphic_V and 𝒱𝒱-\mathcal{V}- caligraphic_V to obtain

F𝒰(μ)F(𝒰+𝒱)(μ)+F(𝒱)(μ).superscriptsubscript𝐹𝒰𝒱𝜇superscriptsubscript𝐹𝒱𝜇superscriptsubscript𝐹𝒰𝜇\displaystyle\partial_{F}^{-}\mathcal{U}(\mu)\supset\partial_{F}^{-}(\mathcal{% U}+\mathcal{V})(\mu)+\partial_{F}^{-}(-\mathcal{V})(\mu).∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_U ( italic_μ ) ⊃ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( caligraphic_U + caligraphic_V ) ( italic_μ ) + ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( - caligraphic_V ) ( italic_μ ) .

Since W𝒱(μ)F(𝒱)(μ)subscript𝑊𝒱𝜇superscriptsubscript𝐹𝒱𝜇-\nabla_{W}\mathcal{V}(\mu)\in\partial_{F}^{-}(-\mathcal{V})(\mu)- ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_V ( italic_μ ) ∈ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( - caligraphic_V ) ( italic_μ ), it follows that

F𝒰(μ)F(𝒰+𝒱)(μ)W𝒱(μ).superscriptsubscript𝐹𝒰𝒱𝜇subscript𝑊𝒱𝜇superscriptsubscript𝐹𝒰𝜇\displaystyle\partial_{F}^{-}\mathcal{U}(\mu)\supset\partial_{F}^{-}(\mathcal{% U}+\mathcal{V})(\mu)-\nabla_{W}\mathcal{V}(\mu).∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_U ( italic_μ ) ⊃ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( caligraphic_U + caligraphic_V ) ( italic_μ ) - ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_V ( italic_μ ) . (11)

From (10) and (11), we derive (9). ∎

Lemma 9.

Let μdmuch-less-than𝜇superscript𝑑\mu\ll\mathscr{L}^{d}italic_μ ≪ script_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and g𝑔gitalic_g is a strongly convex function. Then g#μd.much-less-thansubscript𝑔#𝜇superscript𝑑\nabla g_{\#}\mu\ll\mathscr{L}^{d}.∇ italic_g start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ ≪ script_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

Proof.

Let Ω={x:2g(x) exists}Ωconditional-set𝑥superscript2𝑔𝑥 exists\Omega=\{x:\nabla^{2}g(x)\text{ exists}\}roman_Ω = { italic_x : ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g ( italic_x ) exists }, then d(dΩ)=0superscript𝑑superscript𝑑Ω0\mathscr{L}^{d}(\mathbb{R}^{d}\setminus\Omega)=0script_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∖ roman_Ω ) = 0 (Aleksandrov, see, e.g., [3, Thm. 5.5.4]). Since g𝑔gitalic_g is strongly convex, g𝑔\nabla g∇ italic_g is injective on ΩΩ\Omegaroman_Ω and |det2g|>0superscript2𝑔0|\det\nabla^{2}g|>0| roman_det ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_g | > 0 on ΩΩ\Omegaroman_Ω. By applying Lemma 5.5.3 [3], g#μd.much-less-thansubscript𝑔#𝜇superscript𝑑\nabla g_{\#}\mu\ll\mathscr{L}^{d}.∇ italic_g start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ ≪ script_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT .

Lemma 10.

[57] Let 𝒢:𝒫2(X){+}:𝒢subscript𝒫2𝑋\mathcal{G}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}caligraphic_G : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } be proper and l.s.c. Suppose that 𝒢𝒢\mathcal{G}caligraphic_G is convex along generalized geodesics. Let ν𝒫2,abs(X)𝜈subscript𝒫2abs𝑋\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), μ,π𝒫2(X)𝜇𝜋subscript𝒫2𝑋\mu,\pi\in\mathcal{P}_{2}(X)italic_μ , italic_π ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ). If ξ𝒢(μ)𝜉𝒢𝜇\xi\in\partial\mathcal{G}(\mu)italic_ξ ∈ ∂ caligraphic_G ( italic_μ ), then

XξTνμ(x),Tνπ(x)Tνμ(x)𝑑ν(x)𝒢(π)𝒢(μ).subscript𝑋𝜉superscriptsubscript𝑇𝜈𝜇𝑥superscriptsubscript𝑇𝜈𝜋𝑥superscriptsubscript𝑇𝜈𝜇𝑥differential-d𝜈𝑥𝒢𝜋𝒢𝜇\displaystyle\int_{X}{\langle\xi\circ T_{\nu}^{\mu}(x),T_{\nu}^{\pi}(x)-T_{\nu% }^{\mu}(x)\rangle}d\nu(x)\leq\mathcal{G}(\pi)-\mathcal{G}(\mu).∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ italic_ξ ∘ italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_x ) ⟩ italic_d italic_ν ( italic_x ) ≤ caligraphic_G ( italic_π ) - caligraphic_G ( italic_μ ) .

A.1 Existence of a Borel measurable selector of the subdifferential of a convex function

Given a convex function H:X:𝐻𝑋H:X\to\mathbb{R}italic_H : italic_X → blackboard_R, we prove that there exists a Borel measurable selector S(x)H(x)𝑆𝑥𝐻𝑥S(x)\in\partial H(x)italic_S ( italic_x ) ∈ ∂ italic_H ( italic_x ). Although this problem is of natural interest, we are not aware of it as well as its proof at least in standard textbooks in convex analysis. Credits go to a quite recent MathOverflow thread [31], from which we give detailed proof as follows.

Firstly, we recall Alexandroff’s compactification of a topological space (X,τ)𝑋𝜏(X,\tau)( italic_X , italic_τ ). From set theory, X𝑋Xitalic_X is strictly smaller than 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT, which is the set of all subsets of X𝑋Xitalic_X, i.e., there is no bijection from X𝑋Xitalic_X to 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT. So 2Xsuperscript2𝑋2^{X}2 start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT cannot be contained in X𝑋Xitalic_X. So, there is an element named \infty that is not in X𝑋Xitalic_X. We denote X=X{}superscript𝑋𝑋X^{\infty}=X\cup\{\infty\}italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT = italic_X ∪ { ∞ }. One-point Alexandroff compactification states that (1) there exists a topology τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT in Xsuperscript𝑋X^{\infty}italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT accepting X𝑋Xitalic_X as a topological subspace, i.e., the original topology τ𝜏\tauitalic_τ in X𝑋Xitalic_X is inherited from τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT, and (2) (X,τ)superscript𝑋superscript𝜏(X^{\infty},\tau^{\infty})( italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ) is compact.

The topology τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT can be specifically described as follows: open sets of τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT are either open sets of τ𝜏\tauitalic_τ or the complements of the form (XS){}𝑋𝑆(X\setminus S)\cup\{\infty\}( italic_X ∖ italic_S ) ∪ { ∞ } where S𝑆Sitalic_S are closed compact subsets of X𝑋Xitalic_X.

In our case, X=d𝑋superscript𝑑X=\mathbb{R}^{d}italic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and τ𝜏\tauitalic_τ is the standard Euclidean topology, the Alexandroff compactification of X𝑋Xitalic_X is also metrizable [17, Thm. 12.12]. It is, in fact, homeomorphic to the sphere 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT whose topology is inherited from the ambient space d+1superscript𝑑1\mathbb{R}^{d+1}blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT. Moreover, the mentioned metric is the Riemannian metric of 𝕊dsuperscript𝕊𝑑\mathbb{S}^{d}blackboard_S start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [41, Thm. 13.29].

Secondly, since ψ(x):=H(x)+(1/2)x2assign𝜓𝑥𝐻𝑥12superscriptnorm𝑥2\psi(x):=H(x)+(1/2)\|x\|^{2}italic_ψ ( italic_x ) := italic_H ( italic_x ) + ( 1 / 2 ) ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is 1-strongly convex, its Fenchel conjugate yψ(y)maps-to𝑦superscript𝜓𝑦y\mapsto\psi^{*}(y)italic_y ↦ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) defined as

ψ(y)=supxd{x,yψ(x)}superscript𝜓𝑦subscriptsupremum𝑥superscript𝑑𝑥𝑦𝜓𝑥\displaystyle\psi^{*}(y)=\sup_{x\in\mathbb{R}^{d}}\{\langle x,y\rangle-\psi(x)\}italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) = roman_sup start_POSTSUBSCRIPT italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { ⟨ italic_x , italic_y ⟩ - italic_ψ ( italic_x ) }

is 1-smooth [64, Thm. 1].

By [56, Cor. 23.5.1], (ψ)1=ψsuperscript𝜓1superscript𝜓(\partial\psi)^{-1}=\nabla\psi^{*}( ∂ italic_ψ ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the sense that

yψ(ψ(y))yd.formulae-sequence𝑦𝜓superscript𝜓𝑦for-all𝑦superscript𝑑\displaystyle y\in\partial\psi(\nabla\psi^{*}(y))\quad\forall y\in\mathbb{R}^{% d}.italic_y ∈ ∂ italic_ψ ( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ) ∀ italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT . (12)

On the other hand, ψ𝜓\partial\psi∂ italic_ψ is strongly monotone [46, Ex. 3.9] in the following sense,

x2x1,y2y1x2x12subscript𝑥2subscript𝑥1subscript𝑦2subscript𝑦1superscriptnormsubscript𝑥2subscript𝑥12\displaystyle\langle x_{2}-x_{1},y_{2}-y_{1}\rangle\geq\|x_{2}-x_{1}\|^{2}⟨ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ≥ ∥ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (13)

for all x1,x2dsubscript𝑥1subscript𝑥2superscript𝑑x_{1},x_{2}\in\mathbb{R}^{d}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and y1ψ(x1),y2ψ(x2)formulae-sequencesubscript𝑦1𝜓subscript𝑥1subscript𝑦2𝜓subscript𝑥2y_{1}\in\partial\psi(x_{1}),y_{2}\in\partial\psi(x_{2})italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ ∂ italic_ψ ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ ∂ italic_ψ ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ).

We see that ψsuperscript𝜓\nabla\psi^{*}∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is subjective. Indeed we show that for every xd𝑥superscript𝑑x\in\mathbb{R}^{d}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, there exists yd𝑦superscript𝑑y\in\mathbb{R}^{d}italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT such that ψ(y)=xsuperscript𝜓𝑦𝑥\nabla\psi^{*}(y)=x∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) = italic_x. We show that relation holds for any yψ(x)𝑦𝜓𝑥y\in\partial\psi(x)italic_y ∈ ∂ italic_ψ ( italic_x ). By contradiction, suppose that ψ(y)xsuperscript𝜓𝑦𝑥\nabla\psi^{*}(y)\neq x∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ≠ italic_x. From the strong monotonicity of ψ𝜓\partial\psi∂ italic_ψ as in (13), ψ(ψ(y))ψ(x)=.𝜓superscript𝜓𝑦𝜓𝑥\partial\psi(\nabla\psi^{*}(y))\cap\partial\psi(x)=\emptyset.∂ italic_ψ ( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ) ∩ ∂ italic_ψ ( italic_x ) = ∅ . However, from (12) and by the choice of y𝑦yitalic_y, it holds yψ(ψ(y))ψ(x)𝑦𝜓superscript𝜓𝑦𝜓𝑥y\in\partial\psi(\nabla\psi^{*}(y))\cap\partial\psi(x)italic_y ∈ ∂ italic_ψ ( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ) ∩ ∂ italic_ψ ( italic_x ). This is a contradiction.

Thirdly, we recall a fundamental result on the compactness of the subdifferential of a convex function: if C𝐶Citalic_C is compact, then ψ(C)𝜓𝐶\partial\psi(C)∂ italic_ψ ( italic_C ) is compact [56, Thm. 24.7].

Fourthly, we need the Federer-Morse theorem [13] as follows:

Theorem 7.

Let Z𝑍Zitalic_Z be a compact metric space, Y𝑌Yitalic_Y be a Hausdorff topological space and f:ZY:𝑓𝑍𝑌f:Z\to Yitalic_f : italic_Z → italic_Y be a continuous map**. Then, there exists a Borel set BZ𝐵𝑍B\subset Zitalic_B ⊂ italic_Z such that f(B)=f(Z)𝑓𝐵𝑓𝑍f(B)=f(Z)italic_f ( italic_B ) = italic_f ( italic_Z ) and f𝑓fitalic_f is injective on B𝐵Bitalic_B. Furtheremore, f1:f(Z)B:superscript𝑓1𝑓𝑍𝐵f^{-1}:f(Z)\to Bitalic_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : italic_f ( italic_Z ) → italic_B is Borel.

Now we observe that ψ(x)superscript𝜓𝑥\nabla\psi^{*}(x)\to\infty∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) → ∞ was x𝑥x\to\inftyitalic_x → ∞. Otherwise, by using the compactness of ψ𝜓\partial\psi∂ italic_ψ and (12), we will get a contradiction immediately. We then can extend ψsuperscript𝜓\nabla\psi^{*}∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in a continuous way in the Alexandorff compactification space X=d{}superscript𝑋superscript𝑑X^{\infty}=\mathbb{R}^{d}\cup\{\infty\}italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ∪ { ∞ } by simply putting ψ()=.superscript𝜓\nabla\psi^{*}(\infty)=\infty.∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( ∞ ) = ∞ . We shall show that this extension of ψsuperscript𝜓\nabla\psi^{*}∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is continuous from (X,τ)superscript𝑋superscript𝜏(X^{\infty},\tau^{\infty})( italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ) to (X,τ)superscript𝑋superscript𝜏(X^{\infty},\tau^{\infty})( italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ), or (ψ)1(V)τsuperscriptsuperscript𝜓1𝑉superscript𝜏(\nabla\psi^{*})^{-1}(V)\in\tau^{\infty}( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_V ) ∈ italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT for all Vτ𝑉superscript𝜏V\in\tau^{\infty}italic_V ∈ italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT. Recall that, by construction, open sets of τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT are either open sets of τ𝜏\tauitalic_τ or the complements of the form (XS){}𝑋𝑆(X\setminus S)\cup\{\infty\}( italic_X ∖ italic_S ) ∪ { ∞ } where S𝑆Sitalic_S are compact subsets of X𝑋Xitalic_X. The former type of open sets is handled easily since ψsuperscript𝜓\nabla\psi^{*}∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is already continuous in (X,τ)𝑋𝜏(X,\tau)( italic_X , italic_τ ). For the latter type, let U=(ψ)1((XS){})=(ψ)1(XS){}=(X(ψ)1(S)){}𝑈superscriptsuperscript𝜓1𝑋𝑆superscriptsuperscript𝜓1𝑋𝑆𝑋superscriptsuperscript𝜓1𝑆U=(\nabla\psi^{*})^{-1}((X\setminus S)\cup\{\infty\})=(\nabla\psi^{*})^{-1}(X% \setminus S)\cup\{\infty\}=(X\setminus(\nabla\psi^{*})^{-1}(S))\cup\{\infty\}italic_U = ( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ( italic_X ∖ italic_S ) ∪ { ∞ } ) = ( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_X ∖ italic_S ) ∪ { ∞ } = ( italic_X ∖ ( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_S ) ) ∪ { ∞ } for a compact set S𝑆Sitalic_S. Proving U𝑈Uitalic_U open boils down to proving (ψ)1(S)superscriptsuperscript𝜓1𝑆(\nabla\psi^{*})^{-1}(S)( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_S ) compact. Indeed, it is closed since S𝑆Sitalic_S is closed. It is bounded. Otherwise, it will be contradictory to ψ(x)superscript𝜓𝑥\nabla\psi^{*}(x)\to\infty∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) → ∞ as x𝑥x\to\inftyitalic_x → ∞.

We now can apply the Federer-Morse theorem for Z=Y=(X,τ)𝑍𝑌superscript𝑋superscript𝜏Z=Y=(X^{\infty},\tau^{\infty})italic_Z = italic_Y = ( italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ) by noting that (X,τ)superscript𝑋superscript𝜏(X^{\infty},\tau^{\infty})( italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT , italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ) is metrizable and a metric space is a Hausdorff space, and for f=ψ𝑓superscript𝜓f=\nabla\psi^{*}italic_f = ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: there exists a Borel set BX𝐵superscript𝑋B\subset X^{\infty}italic_B ⊂ italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT such that ψ|B:BX:evaluated-atsuperscript𝜓𝐵𝐵superscript𝑋\nabla\psi^{*}|_{B}:B\to X^{\infty}∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT : italic_B → italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT is a bijection and the inverse map** (ψ|B)1:XB:superscriptevaluated-atsuperscript𝜓𝐵1superscript𝑋𝐵(\nabla\psi^{*}|_{B})^{-1}:X^{\infty}\to B( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT → italic_B is Borel measurable (here Borel set/measurability are with respect to τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT, not yet τ𝜏\tauitalic_τ). This is the Borel (w.r.t. τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT) selector of ψ𝜓\partial\psi∂ italic_ψ.

Finally, we need to convert Borel measurability w.r.t. τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT to Borel measurability w.r.t. τ𝜏\tauitalic_τ. In terms of map**, \infty is mapped to \infty either way around. So we only need to show: (ψ|B)1:XX:superscriptevaluated-atsuperscript𝜓𝐵1𝑋𝑋(\nabla\psi^{*}|_{B})^{-1}:X\to X( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : italic_X → italic_X is Borel measurable w.r.t. τ𝜏\tauitalic_τ. Take any Borel set (w.r.t. τ𝜏\tauitalic_τ) EX𝐸𝑋E\subset Xitalic_E ⊂ italic_X, (ψ|B)(BE)evaluated-atsuperscript𝜓𝐵𝐵𝐸(\nabla\psi^{*}|_{B})(B\cap E)( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ( italic_B ∩ italic_E ) is Borel set w.r.t. τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT and does not contain \infty. We shall prove (ψ|B)(BE)evaluated-atsuperscript𝜓𝐵𝐵𝐸(\nabla\psi^{*}|_{B})(B\cap E)( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ( italic_B ∩ italic_E ) is a Borel set w.r.t. τ𝜏\tauitalic_τ. This follows directly from the following claim, which is from another Mathematics Stack Exchange thread [19].

Claim [19]:

σ(τ)=σ(τ{})=σ(τ){V{}:Vσ(τ)}.𝜎superscript𝜏𝜎𝜏𝜎𝜏conditional-set𝑉𝑉𝜎𝜏\displaystyle\sigma(\tau^{\infty})=\sigma(\tau\cup\{\infty\})=\sigma(\tau)\cup% \{V\cup\{\infty\}:V\in\sigma(\tau)\}.italic_σ ( italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ) = italic_σ ( italic_τ ∪ { ∞ } ) = italic_σ ( italic_τ ) ∪ { italic_V ∪ { ∞ } : italic_V ∈ italic_σ ( italic_τ ) } . (14)

A sketch of the claim proof goes as follows. For the first equality in (14), first we have σ(τ{})σ(τ)𝜎𝜏𝜎superscript𝜏\sigma(\tau\cup\{\infty\})\subset\sigma(\tau^{\infty})italic_σ ( italic_τ ∪ { ∞ } ) ⊂ italic_σ ( italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ) because (1) ττ𝜏superscript𝜏\tau\subset\tau^{\infty}italic_τ ⊂ italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT and (2) {}=XXσ(τ)superscript𝑋𝑋𝜎superscript𝜏\{\infty\}=X^{\infty}\setminus X\in\sigma(\tau^{\infty}){ ∞ } = italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∖ italic_X ∈ italic_σ ( italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ) as X,Xτ𝑋superscript𝑋superscript𝜏X,X^{\infty}\in\tau^{\infty}italic_X , italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∈ italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT. On the other hand, σ(τ)σ(τ{})𝜎superscript𝜏𝜎𝜏\sigma(\tau^{\infty})\subset\sigma(\tau\cup\{\infty\})italic_σ ( italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ) ⊂ italic_σ ( italic_τ ∪ { ∞ } ) because, again, of the construction of τsuperscript𝜏\tau^{\infty}italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT: let Uτ𝑈superscript𝜏U\in\tau^{\infty}italic_U ∈ italic_τ start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT, if Uτ𝑈𝜏U\in\tauitalic_U ∈ italic_τ then Uσ(τ{})𝑈𝜎𝜏U\in\sigma(\tau\cup\{\infty\})italic_U ∈ italic_σ ( italic_τ ∪ { ∞ } ), otherwise U=(XS){}𝑈𝑋𝑆U=(X\setminus S)\cup\{\infty\}italic_U = ( italic_X ∖ italic_S ) ∪ { ∞ } for some compact set SX𝑆𝑋S\subset Xitalic_S ⊂ italic_X. As X=d𝑋superscript𝑑X=\mathbb{R}^{d}italic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, S𝑆Sitalic_S is closed, so (XS)τ𝑋𝑆𝜏(X\setminus S)\in\tau( italic_X ∖ italic_S ) ∈ italic_τ implying Uσ(τ{}).𝑈𝜎𝜏U\in\sigma(\tau\cup\{\infty\}).italic_U ∈ italic_σ ( italic_τ ∪ { ∞ } ) . For the second equality in (14), it is straight forward to verify that 𝒢:=σ(τ){V{}:Vσ(τ)}assign𝒢𝜎𝜏conditional-set𝑉𝑉𝜎𝜏\mathscr{G}:=\sigma(\tau)\cup\{V\cup\{\infty\}:V\in\sigma(\tau)\}script_G := italic_σ ( italic_τ ) ∪ { italic_V ∪ { ∞ } : italic_V ∈ italic_σ ( italic_τ ) } is a sigma-algebra in Xsuperscript𝑋X^{\infty}italic_X start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT. Since τ{}𝒢𝜏𝒢\tau\cup\{\infty\}\subset\mathscr{G}italic_τ ∪ { ∞ } ⊂ script_G, it holds σ(τ{})𝒢𝜎𝜏𝒢\sigma(\tau\cup\{\infty\})\subset\mathscr{G}italic_σ ( italic_τ ∪ { ∞ } ) ⊂ script_G. Conversely, as σ(τ)σ(τ{})𝜎𝜏𝜎𝜏\sigma(\tau)\subset\sigma(\tau\cup\{\infty\})italic_σ ( italic_τ ) ⊂ italic_σ ( italic_τ ∪ { ∞ } ) and - consequently - V{}σ(τ{})𝑉𝜎𝜏V\cup\{\infty\}\in\sigma(\tau\cup\{\infty\})italic_V ∪ { ∞ } ∈ italic_σ ( italic_τ ∪ { ∞ } ) for all Vσ(τ)𝑉𝜎𝜏V\in\sigma(\tau)italic_V ∈ italic_σ ( italic_τ ), it holds 𝒢σ(τ{}).𝒢𝜎𝜏\mathscr{G}\subset\sigma(\tau\cup\{\infty\}).script_G ⊂ italic_σ ( italic_τ ∪ { ∞ } ) .

We conclude that (ψ|B)1:XX:superscriptevaluated-atsuperscript𝜓𝐵1𝑋𝑋(\nabla\psi^{*}|_{B})^{-1}:X\to X( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : italic_X → italic_X is a Borel (w.r.t. τ𝜏\tauitalic_τ) selector of ψ𝜓\partial\psi∂ italic_ψ. As a consequence, S:=(ψ|B)1Iassign𝑆superscriptevaluated-atsuperscript𝜓𝐵1𝐼S:=(\nabla\psi^{*}|_{B})^{-1}-Iitalic_S := ( ∇ italic_ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - italic_I is a Borel measurable selector of H𝐻\partial H∂ italic_H.

A.2 DC structure of Maximum Mean Discrepancy

Let k𝑘kitalic_k be a kernel whose gradient is Lipschitz continuous, i.e., for some L>0𝐿0L>0italic_L > 0,

k(x,y)k(x,y)L(xx2+yy2)12,x,x,y,yX,formulae-sequencenorm𝑘𝑥𝑦𝑘superscript𝑥superscript𝑦𝐿superscriptsuperscriptnorm𝑥superscript𝑥2superscriptnorm𝑦superscript𝑦212for-all𝑥superscript𝑥𝑦superscript𝑦𝑋\displaystyle\|\nabla k(x,y)-\nabla k(x^{\prime},y^{\prime})\|\leq L\left(\|x-% x^{\prime}\|^{2}+\|y-y^{\prime}\|^{2}\right)^{\frac{1}{2}},\quad\forall x,x^{% \prime},y,y^{\prime}\in X,∥ ∇ italic_k ( italic_x , italic_y ) - ∇ italic_k ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ ≤ italic_L ( ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , ∀ italic_x , italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ italic_X ,

which can be expressed equivalently as

xk(x,y)xk(x,y)2+yk(x,y)yk(x,y)2L2(xx2+yy2).superscriptnormsubscript𝑥𝑘𝑥𝑦subscript𝑥𝑘superscript𝑥superscript𝑦2superscriptnormsubscript𝑦𝑘𝑥𝑦subscript𝑦𝑘superscript𝑥superscript𝑦2superscript𝐿2superscriptnorm𝑥superscript𝑥2superscriptnorm𝑦superscript𝑦2\displaystyle\|\nabla_{x}k(x,y)-\nabla_{x}k(x^{\prime},y^{\prime})\|^{2}+\|% \nabla_{y}k(x,y)-\nabla_{y}k(x^{\prime},y^{\prime})\|^{2}\leq L^{2}\left(\|x-x% ^{\prime}\|^{2}+\|y-y^{\prime}\|^{2}\right).∥ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) - ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_k ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ ∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) - ∇ start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT italic_k ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y - italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

Let μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be some fixed distribution. Consider the following free-energy functional

(μ)=XXk(x,y)𝑑μ(x)𝑑μ(y)2XXk(x,y)𝑑μ(y)𝑑μ(x).𝜇subscript𝑋subscript𝑋𝑘𝑥𝑦differential-d𝜇𝑥differential-d𝜇𝑦2subscript𝑋subscript𝑋𝑘𝑥𝑦differential-dsuperscript𝜇𝑦differential-d𝜇𝑥\displaystyle\mathcal{F}(\mu)=\int_{X}\int_{X}k(x,y)d\mu(x)d\mu(y)-2\int_{X}% \int_{X}k(x,y)d\mu^{*}(y)d\mu(x).caligraphic_F ( italic_μ ) = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) italic_d italic_μ ( italic_x ) italic_d italic_μ ( italic_y ) - 2 ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) italic_d italic_μ ( italic_x ) .

Let α>0𝛼0\alpha>0italic_α > 0, we can rewrite \mathcal{F}caligraphic_F as follows

(μ)=XX[αx2+αy2+k(x,y)]𝑑μ(x)𝑑μ(y)2X[αx2+Xk(x,y)𝑑μ(y)]𝑑μ(x).𝜇subscript𝑋subscript𝑋delimited-[]𝛼superscriptnorm𝑥2𝛼superscriptnorm𝑦2𝑘𝑥𝑦differential-d𝜇𝑥differential-d𝜇𝑦2subscript𝑋delimited-[]𝛼superscriptnorm𝑥2subscript𝑋𝑘𝑥𝑦differential-dsuperscript𝜇𝑦differential-d𝜇𝑥\displaystyle\mathcal{F}(\mu)=\int_{X}\int_{X}\left[\alpha\|x\|^{2}+\alpha\|y% \|^{2}+k(x,y)\right]d\mu(x)d\mu(y)-2\int_{X}\left[\alpha\|x\|^{2}+\int_{X}k(x,% y)d\mu^{*}(y)\right]d\mu(x).caligraphic_F ( italic_μ ) = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT [ italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k ( italic_x , italic_y ) ] italic_d italic_μ ( italic_x ) italic_d italic_μ ( italic_y ) - 2 ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT [ italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ] italic_d italic_μ ( italic_x ) .

As k𝑘kitalic_k is Lipschitz smooth w.r.t. (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), xk(x,y)𝑑μ(y)maps-to𝑥𝑘𝑥𝑦differential-dsuperscript𝜇𝑦x\mapsto\int{k(x,y)}d\mu^{*}(y)italic_x ↦ ∫ italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) is also Lipschitz smooth. Indeed,

xXk(x,y)𝑑μ(y)xXk(x,y)𝑑μ(y)normsubscript𝑥subscript𝑋𝑘𝑥𝑦differential-dsuperscript𝜇𝑦subscript𝑥subscript𝑋𝑘superscript𝑥𝑦differential-dsuperscript𝜇𝑦\displaystyle\left\|\nabla_{x}\int_{X}{k(x,y)}d\mu^{*}(y)-\nabla_{x}\int_{X}{k% (x^{\prime},y)}d\mu^{*}(y)\right\|∥ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) - ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_k ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ∥
=X(xk(x,y)xk(x,y))𝑑μ(y)absentnormsubscript𝑋subscript𝑥𝑘𝑥𝑦subscript𝑥𝑘superscript𝑥𝑦differential-dsuperscript𝜇𝑦\displaystyle=\left\|\int_{X}{\left(\nabla_{x}k(x,y)-\nabla_{x}k(x^{\prime},y)% \right)}d\mu^{*}(y)\right\|= ∥ ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) - ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_k ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y ) ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ∥
(Xxk(x,y)xk(x,y)2𝑑μ(y))12absentsuperscriptsubscript𝑋superscriptnormsubscript𝑥𝑘𝑥𝑦subscript𝑥𝑘superscript𝑥𝑦2differential-dsuperscript𝜇𝑦12\displaystyle\leq\left(\int_{X}\left\|\nabla_{x}k(x,y)-\nabla_{x}k(x^{\prime},% y)\right\|^{2}d\mu^{*}(y)\right)^{\frac{1}{2}}≤ ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) - ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_k ( italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
L(Xxx2𝑑μ(y))12absent𝐿superscriptsubscript𝑋superscriptnorm𝑥superscript𝑥2differential-dsuperscript𝜇𝑦12\displaystyle\leq L\left(\int_{X}\left\|x-x^{\prime}\right\|^{2}d\mu^{*}(y)% \right)^{\frac{1}{2}}≤ italic_L ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
=Lxx.absent𝐿norm𝑥superscript𝑥\displaystyle=L\|x-x^{\prime}\|.= italic_L ∥ italic_x - italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ .

Next, as a standard result, if f𝑓fitalic_f is an L-smooth function, x(α/2)x2±f(x)maps-to𝑥plus-or-minus𝛼2superscriptnorm𝑥2𝑓𝑥x\mapsto(\alpha/2)\|x\|^{2}\pm f(x)italic_x ↦ ( italic_α / 2 ) ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ± italic_f ( italic_x ) are convex whenever αL𝛼𝐿\alpha\geq Litalic_α ≥ italic_L. Therefore, for αL𝛼𝐿\alpha\geq Litalic_α ≥ italic_L, W(x,y):=αx2+αy2+k(x,y)assign𝑊𝑥𝑦𝛼superscriptnorm𝑥2𝛼superscriptnorm𝑦2𝑘𝑥𝑦W(x,y):=\alpha\|x\|^{2}+\alpha\|y\|^{2}+k(x,y)italic_W ( italic_x , italic_y ) := italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_k ( italic_x , italic_y ) is convex and F(x)=2[αx2+Xk(x,y)𝑑μ(y)]𝐹𝑥2delimited-[]𝛼superscriptnorm𝑥2subscript𝑋𝑘𝑥𝑦differential-dsuperscript𝜇𝑦F(x)=-2\left[\alpha\|x\|^{2}+\int_{X}k(x,y)d\mu^{*}(y)\right]italic_F ( italic_x ) = - 2 [ italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) ] is concave. From [3, Prop. 9.3.5], the interaction energy corresponding to W𝑊Witalic_W is generalized geodesically convex.

A.3 Proof of Lemma 1

Since I+γS𝐼𝛾𝑆I+\gamma Sitalic_I + italic_γ italic_S is a subgradient selector of a convex function, the optimal transport between μnsubscript𝜇𝑛\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and νn+1subscript𝜈𝑛1\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT is given by

Tμnνn+1=I+γS.superscriptsubscript𝑇subscript𝜇𝑛subscript𝜈𝑛1𝐼𝛾𝑆\displaystyle T_{\mu_{n}}^{\nu_{n+1}}=I+\gamma S.italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_I + italic_γ italic_S . (15)

and between μn+1subscript𝜇𝑛1\mu_{n+1}italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT and νn+1subscript𝜈𝑛1\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT [3, Lem. 10.1.2]

Tμn+1νn+1I+γ(G+)(μn+1).superscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝐼𝛾subscript𝐺subscript𝜇𝑛1\displaystyle T_{\mu_{n+1}}^{\nu_{n+1}}\in I+\gamma\partial\left(\mathcal{E}_{% G}+\mathscr{H}\right)(\mu_{n+1}).italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∈ italic_I + italic_γ ∂ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) . (16)

Since Hsubscript𝐻\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is convex along generalized geodesics [3, Prop. 9.3.2] and S𝑆Sitalic_S is a subgradient of Hsubscript𝐻\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT at μnsubscript𝜇𝑛\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT [4, Proposition 4.13], by Lem. 10 it holds, for any ν𝒫2,abs(X)𝜈subscript𝒫2abs𝑋\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ),

H(μn+1)subscript𝐻subscript𝜇𝑛1\displaystyle\mathcal{E}_{H}(\mu_{n+1})caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) H(μn)+XSTνμn(x),Tνμn+1(x)Tνμn(x)𝑑ν(x).absentsubscript𝐻subscript𝜇𝑛subscript𝑋𝑆superscriptsubscript𝑇𝜈subscript𝜇𝑛𝑥superscriptsubscript𝑇𝜈subscript𝜇𝑛1𝑥superscriptsubscript𝑇𝜈subscript𝜇𝑛𝑥differential-d𝜈𝑥\displaystyle\geq\mathcal{E}_{H}(\mu_{n})+\int_{X}{\langle S\circ T_{\nu}^{\mu% _{n}}(x),T_{\nu}^{\mu_{n+1}}(x)-T_{\nu}^{\mu_{n}}(x)\rangle}d\nu(x).≥ caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ italic_S ∘ italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ⟩ italic_d italic_ν ( italic_x ) .

By choosing ν=νn+1𝜈subscript𝜈𝑛1\nu=\nu_{n+1}italic_ν = italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT (note that νn+1𝒫2,abs(X)subscript𝜈𝑛1subscript𝒫2abs𝑋\nu_{n+1}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X )),

H(μn+1)subscript𝐻subscript𝜇𝑛1\displaystyle\mathcal{E}_{H}(\mu_{n+1})caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) H(μn)+XSTνn+1μn(x),Tνn+1μn+1(x)Tνn+1μn(x)𝑑νn+1(x)absentsubscript𝐻subscript𝜇𝑛subscript𝑋𝑆superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥differential-dsubscript𝜈𝑛1𝑥\displaystyle\geq\mathcal{E}_{H}(\mu_{n})+\int_{X}{\langle S\circ T_{\nu_{n+1}% }^{\mu_{n}}(x),T_{\nu_{n+1}}^{\mu_{n+1}}(x)-T_{\nu_{n+1}}^{\mu_{n}}(x)\rangle}% d\nu_{n+1}(x)≥ caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ italic_S ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ⟩ italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=H(μn)+1γX(Tμnνn+1I)Tνn+1μn(x),Tνn+1μn+1(x)Tνn+1μn(x)𝑑νn+1(x)absentsubscript𝐻subscript𝜇𝑛1𝛾subscript𝑋superscriptsubscript𝑇subscript𝜇𝑛subscript𝜈𝑛1𝐼superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥differential-dsubscript𝜈𝑛1𝑥\displaystyle=\mathcal{E}_{H}(\mu_{n})+\dfrac{1}{\gamma}\int_{X}{\langle(T_{% \mu_{n}}^{\nu_{n+1}}-I)\circ T_{\nu_{n+1}}^{\mu_{n}}(x),T_{\nu_{n+1}}^{\mu_{n+% 1}}(x)-T_{\nu_{n+1}}^{\mu_{n}}(x)\rangle}d\nu_{n+1}(x)= caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ ( italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ⟩ italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=H(μn)+1γXxTνn+1μn(x),Tνn+1μn+1(x)Tνn+1μn(x)𝑑νn+1(x)absentsubscript𝐻subscript𝜇𝑛1𝛾subscript𝑋𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥differential-dsubscript𝜈𝑛1𝑥\displaystyle=\mathcal{E}_{H}(\mu_{n})+\dfrac{1}{\gamma}\int_{X}{\langle x-T_{% \nu_{n+1}}^{\mu_{n}}(x),T_{\nu_{n+1}}^{\mu_{n+1}}(x)-T_{\nu_{n+1}}^{\mu_{n}}(x% )\rangle}d\nu_{n+1}(x)= caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ italic_x - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ⟩ italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) (17)

where the second equality uses (15) and the last one uses Lem. 3.

On the other hand, since G+subscript𝐺\mathcal{E}_{G}+\mathscr{H}caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H is convex along generalized geodesics, by applying Lem. 10 for G+subscript𝐺\mathcal{E}_{G}+\mathscr{H}caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H at μn+1subscript𝜇𝑛1\mu_{n+1}italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT with a subgradient γ1(Tμn+1νn+1I)(G+)(μn+1)superscript𝛾1superscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝐼subscript𝐺subscript𝜇𝑛1\gamma^{-1}(T_{\mu_{n+1}}^{\nu_{n+1}}-I)\in\partial(\mathcal{E}_{G}+\mathscr{H% })(\mu_{n+1})italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I ) ∈ ∂ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) (from (16)),

(G+)(μn)subscript𝐺subscript𝜇𝑛\displaystyle(\mathcal{E}_{G}+\mathscr{H})(\mu_{n})( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT )
(G+)(μn+1)+X(Tμn+1νn+1I)γTνn+1μn+1(x),Tνn+1μn(x)Tνn+1μn+1(x)𝑑νn+1(x)absentsubscript𝐺subscript𝜇𝑛1subscript𝑋superscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝐼𝛾superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥differential-dsubscript𝜈𝑛1𝑥\displaystyle\geq(\mathcal{E}_{G}+\mathscr{H})(\mu_{n+1})+\int_{X}{\left% \langle\dfrac{(T_{\mu_{n+1}}^{\nu_{n+1}}-I)}{\gamma}\circ T_{\nu_{n+1}}^{\mu_{% n+1}}(x),T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\right\rangle}% d\nu_{n+1}(x)≥ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ divide start_ARG ( italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I ) end_ARG start_ARG italic_γ end_ARG ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ⟩ italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=(G+)(μn+1)+1γXxTνn+1μn+1(x),Tνn+1μn(x)Tνn+1μn+1(x)𝑑νn+1(x)absentsubscript𝐺subscript𝜇𝑛11𝛾subscript𝑋𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥differential-dsubscript𝜈𝑛1𝑥\displaystyle=(\mathcal{E}_{G}+\mathscr{H})(\mu_{n+1})+\dfrac{1}{\gamma}\int_{% X}{\left\langle x-T_{\nu_{n+1}}^{\mu_{n+1}}(x),T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{% \nu_{n+1}}^{\mu_{n+1}}(x)\right\rangle}d\nu_{n+1}(x)= ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ⟨ italic_x - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) , italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ⟩ italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) (18)

where the last equality uses Lem. 3.

By adding (A.3) and (A.3) side by side,

(μn)(μn+1)+1γXTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x).subscript𝜇𝑛subscript𝜇𝑛11𝛾subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\mathcal{F}(\mu_{n})\geq\mathcal{F}(\mu_{n+1})+\dfrac{1}{\gamma}% \int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_% {n+1}(x).caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) . (19)

A.4 Proof of Theorem 1

Under Assumption 4, S=H𝑆𝐻S=\nabla Hitalic_S = ∇ italic_H and S𝑆Sitalic_S is continuous.

For item (i), Lemma 1 implies that (μn)(μ0)subscript𝜇𝑛subscript𝜇0\mathcal{F}(\mu_{n})\leq\mathcal{F}(\mu_{0})caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ caligraphic_F ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for all n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N, so the whole sequence {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is contained in the sublevel set of \mathcal{F}caligraphic_F at level (μ0)subscript𝜇0\mathcal{F}(\mu_{0})caligraphic_F ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) which is compact under Wasserstein topology by Assumption 2. Therefore, there exists a subsequence of {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT, denoted by {μnk}ksubscriptsubscript𝜇subscript𝑛𝑘𝑘\{\mu_{n_{k}}\}_{k\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT, such that μnkWassμWasssubscript𝜇subscript𝑛𝑘superscript𝜇\mu_{n_{k}}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

For item (ii), let μ𝒫2(X)superscript𝜇subscript𝒫2𝑋\mu^{*}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) and μnkWassμWasssubscript𝜇subscript𝑛𝑘superscript𝜇\mu_{n_{k}}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. It holds

lim infk(μnk)subscriptlimit-infimum𝑘subscript𝜇subscript𝑛𝑘\displaystyle\liminf_{k\to\infty}{\mathcal{F}(\mu_{n_{k}})}lim inf start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =lim infk((μnk)+F(μnk))absentsubscriptlimit-infimum𝑘subscript𝜇subscript𝑛𝑘subscript𝐹subscript𝜇subscript𝑛𝑘\displaystyle=\liminf_{k\to\infty}{\left(\mathscr{H}(\mu_{n_{k}})+\mathcal{E}_% {F}(\mu_{n_{k}})\right)}= lim inf start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ( script_H ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) )
=lim infk(μnk)+F(μ)absentsubscriptlimit-infimum𝑘subscript𝜇subscript𝑛𝑘subscript𝐹superscript𝜇\displaystyle=\liminf_{k\to\infty}{\mathscr{H}(\mu_{n_{k}})}+\mathcal{E}_{F}(% \mu^{*})= lim inf start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT script_H ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
(μ)+F(μ),absentsuperscript𝜇subscript𝐹superscript𝜇\displaystyle\geq\mathscr{H}(\mu^{*})+\mathcal{E}_{F}(\mu^{*}),≥ script_H ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,

since \mathscr{H}script_H is l.s.c. and Fsubscript𝐹\mathcal{E}_{F}caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is continuous w.r.t. Wasserstein topology. Therefore, (μ)<+superscript𝜇\mathscr{H}(\mu^{*})<+\inftyscript_H ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < + ∞, which further implies that μ𝒫2,abs(X)superscript𝜇subscript𝒫2abs𝑋\mu^{*}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ).

We have

XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x% )\|^{2}}d\nu_{n+1}(x)∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) =XTνn+1μn(x)Tνn+1μn+1(x)2𝑑Tμnνn+1#μn(x)absentsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscriptsubscriptsuperscript𝑇subscript𝜈𝑛1subscript𝜇𝑛#subscript𝜇𝑛𝑥\displaystyle=\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(% x)\|^{2}}d{T^{\nu_{n+1}}_{\mu_{n}}}_{\#}\mu_{n}(x)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_T start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x )
=XxTνn+1μn+1Tμnνn+1(x)2𝑑μn(x).absentsubscript𝑋superscriptnorm𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1subscriptsuperscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥2differential-dsubscript𝜇𝑛𝑥\displaystyle=\int_{X}{\|x-T_{\nu_{n+1}}^{\mu_{n+1}}\circ T^{\nu_{n+1}}_{\mu_{% n}}(x)\|^{2}}d\mu_{n}(x).= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ italic_T start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) . (20)

We observe that Tνn+1μn+1Tμnνn+1superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1subscriptsuperscript𝑇subscript𝜈𝑛1subscript𝜇𝑛T_{\nu_{n+1}}^{\mu_{n+1}}\circ T^{\nu_{n+1}}_{\mu_{n}}italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ italic_T start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is a (possibly non-optimal) transport pushing μnsubscript𝜇𝑛\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to μn+1subscript𝜇𝑛1\mu_{n+1}italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, by the optimality of Tμnμn+1superscriptsubscript𝑇subscript𝜇𝑛subscript𝜇𝑛1T_{\mu_{n}}^{\mu_{n+1}}italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT,

XxTνn+1μn+1Tμnνn+1(x)2𝑑μn(x)XxTμnμn+1(x)2𝑑μn(x)=W22(μn,μn+1).subscript𝑋superscriptnorm𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1subscriptsuperscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥2differential-dsubscript𝜇𝑛𝑥subscript𝑋superscriptnorm𝑥subscriptsuperscript𝑇subscript𝜇𝑛1subscript𝜇𝑛𝑥2differential-dsubscript𝜇𝑛𝑥superscriptsubscript𝑊22subscript𝜇𝑛subscript𝜇𝑛1\displaystyle\int_{X}{\|x-T_{\nu_{n+1}}^{\mu_{n+1}}\circ T^{\nu_{n+1}}_{\mu_{n% }}(x)\|^{2}}d\mu_{n}(x)\geq\int_{X}{\|x-T^{\mu_{n+1}}_{\mu_{n}}(x)\|^{2}}d\mu_% {n}(x)=W_{2}^{2}(\mu_{n},\mu_{n+1}).∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ italic_T start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) ≥ ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) = italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) . (21)

By Lem. 1 and (A.4), (21),

(μn)(μn+1)+1γW22(μn,μn+1).subscript𝜇𝑛subscript𝜇𝑛11𝛾superscriptsubscript𝑊22subscript𝜇𝑛subscript𝜇𝑛1\displaystyle\mathcal{F}(\mu_{n})\geq\mathcal{F}(\mu_{n+1})+\dfrac{1}{\gamma}W% _{2}^{2}(\mu_{n},\mu_{n+1}).caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) . (22)

Note that \mathcal{F}caligraphic_F is bounded below (Assumption 1), telesco** (22) gives us

n=0W22(μn,μn+1)<+.superscriptsubscript𝑛0superscriptsubscript𝑊22subscript𝜇𝑛subscript𝜇𝑛1\displaystyle\sum_{n=0}^{\infty}{W_{2}^{2}(\mu_{n},\mu_{n+1})}<+\infty.∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) < + ∞ . (23)

In particular, W2(μn,μn+1)0subscript𝑊2subscript𝜇𝑛subscript𝜇𝑛10W_{2}(\mu_{n},\mu_{n+1})\to 0italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) → 0. This together with μnkWassμWasssubscript𝜇subscript𝑛𝑘superscript𝜇\mu_{n_{k}}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT implies μnk+1WassμWasssubscript𝜇subscript𝑛𝑘1superscript𝜇\mu_{n_{k}+1}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Next, recall that νnk+1=(I+γS)#μnksubscript𝜈subscript𝑛𝑘1subscript𝐼𝛾𝑆#subscript𝜇subscript𝑛𝑘\nu_{n_{k}+1}=(I+\gamma S)_{\#}\mu_{n_{k}}italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT = ( italic_I + italic_γ italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT, we show that

νnk+1narrowν:=(I+γS)#μ as k+.narrowsubscript𝜈subscript𝑛𝑘1superscript𝜈assignsubscript𝐼𝛾𝑆#superscript𝜇 as 𝑘\displaystyle\nu_{n_{k}+1}\xrightarrow{\operatorname{\text{narrow}}}\nu^{*}:=(% I+\gamma S)_{\#}\mu^{*}\text{ as }k\to+\infty.italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_ARROW overna → end_ARROW italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := ( italic_I + italic_γ italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as italic_k → + ∞ .

Thus, let f𝑓fitalic_f be a continuous and bounded test functional in X𝑋Xitalic_X, by using transfer lemma 2,

limkXf(x)𝑑νnk+1(x)subscript𝑘subscript𝑋𝑓𝑥differential-dsubscript𝜈subscript𝑛𝑘1𝑥\displaystyle\lim_{k\to\infty}\int_{X}{f(x)}d\nu_{n_{k}+1}(x)roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x ) =limkXf(x)d(I+γS)#μnk(x)absentsubscript𝑘subscript𝑋𝑓𝑥𝑑subscript𝐼𝛾𝑆#subscript𝜇subscript𝑛𝑘𝑥\displaystyle=\lim_{k\to\infty}\int_{X}{f(x)}d(I+\gamma S)_{\#}\mu_{n_{k}}(x)= roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d ( italic_I + italic_γ italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x )
=limkXf(x+γS(x))𝑑μnk(x)absentsubscript𝑘subscript𝑋𝑓𝑥𝛾𝑆𝑥differential-dsubscript𝜇subscript𝑛𝑘𝑥\displaystyle=\lim_{k\to\infty}\int_{X}{f(x+\gamma S(x))}d\mu_{n_{k}}(x)= roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x + italic_γ italic_S ( italic_x ) ) italic_d italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) (24)
=Xf(x+γS(x))𝑑μ(x)absentsubscript𝑋𝑓𝑥𝛾𝑆𝑥differential-dsuperscript𝜇𝑥\displaystyle=\int_{X}{f(x+\gamma S(x))}d\mu^{*}(x)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x + italic_γ italic_S ( italic_x ) ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x )
=Xf(x)𝑑ν(x),absentsubscript𝑋𝑓𝑥differential-dsuperscript𝜈𝑥\displaystyle=\int_{X}{f(x)}d\nu^{*}(x),= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) , (25)

since S𝑆Sitalic_S is continuous. So νnk+1narrowνnarrowsubscript𝜈subscript𝑛𝑘1superscript𝜈\nu_{n_{k}+1}\xrightarrow{\operatorname{\text{narrow}}}\nu^{*}italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_ARROW overna → end_ARROW italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We go one step further and prove that νnk+1subscript𝜈subscript𝑛𝑘1\nu_{n_{k}+1}italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT actually converges to νsuperscript𝜈\nu^{*}italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the Wasserstein metric. This boils down to showing convergence in second-order moments, i.e.,

𝔪2(νnk+1)𝔪2(ν),subscript𝔪2subscript𝜈subscript𝑛𝑘1subscript𝔪2superscript𝜈\displaystyle\mathfrak{m}_{2}(\nu_{n_{k}+1})\to\mathfrak{m}_{2}(\nu^{*}),fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) → fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,

which is equivalent to showing that

Xx+γS(x)2𝑑μnk(x)Xx+γS(x)2𝑑μ(x).subscript𝑋superscriptnorm𝑥𝛾𝑆𝑥2differential-dsubscript𝜇subscript𝑛𝑘𝑥subscript𝑋superscriptnorm𝑥𝛾𝑆𝑥2differential-dsuperscript𝜇𝑥\displaystyle\int_{X}{\|x+\gamma S(x)\|^{2}}d\mu_{n_{k}}(x)\to\int_{X}{\|x+% \gamma S(x)\|^{2}}d\mu^{*}(x).∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x + italic_γ italic_S ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) → ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x + italic_γ italic_S ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) .

On the other hand, ψ(x):=x+γS(x)2assign𝜓𝑥superscriptnorm𝑥𝛾𝑆𝑥2\psi(x):=\|x+\gamma S(x)\|^{2}italic_ψ ( italic_x ) := ∥ italic_x + italic_γ italic_S ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT has quadratic growth (follows from Lem. 4) and μnkμsubscript𝜇subscript𝑛𝑘superscript𝜇\mu_{n_{k}}\to\mu^{*}italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT → italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in the Wasserstein metric, so [2, Prop. 2.4]

limkXx+γS(x)2𝑑μnk(x)=Xx+γS(x)2𝑑μ(x)subscript𝑘subscript𝑋superscriptnorm𝑥𝛾𝑆𝑥2differential-dsubscript𝜇subscript𝑛𝑘𝑥subscript𝑋superscriptnorm𝑥𝛾𝑆𝑥2differential-dsuperscript𝜇𝑥\displaystyle\lim_{k\to\infty}{\int_{X}{\|x+\gamma S(x)\|^{2}}d\mu_{n_{k}}(x)}% =\int_{X}\|x+\gamma S(x)\|^{2}d\mu^{*}(x)roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x + italic_γ italic_S ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x ) = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x + italic_γ italic_S ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x )

Therefore, νnk+1νsubscript𝜈subscript𝑛𝑘1superscript𝜈\nu_{n_{k}+1}\to\nu^{*}italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT → italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in Wasserstein metric.

To proceed further, we need the following theorem stating that the graph of the subdifferential of a geodesically convex function is closed under the product of Wasserstein and weak topologies.

Theorem 8 (Closedness of subdifferential graph).

[3, Lemma 10.1.3] Let ϕitalic-ϕ\phiitalic_ϕ be a geodesically convex functional satisfying dom(ϕ)𝒫2,abs(X).domitalic-ϕsubscript𝒫2abs𝑋\operatorname{dom}(\partial\phi)\subset\mathcal{P}_{2,\operatorname{abs}}(X).roman_dom ( ∂ italic_ϕ ) ⊂ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) . Let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be a sequence converging in Wasserstein metric to μdom(ϕ)𝜇domitalic-ϕ\mu\in\operatorname{dom}(\phi)italic_μ ∈ roman_dom ( italic_ϕ ). Let ξnϕ(μn)subscript𝜉𝑛italic-ϕsubscript𝜇𝑛\xi_{n}\in\partial\phi(\mu_{n})italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ ∂ italic_ϕ ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) be satisfying

supnXξn(x)2𝑑μn(x)<+,subscriptsupremum𝑛subscript𝑋superscriptnormsubscript𝜉𝑛𝑥2differential-dsubscript𝜇𝑛𝑥\displaystyle\sup_{n\in\mathbb{N}}\int_{X}{\|\xi_{n}(x)\|^{2}}d\mu_{n}(x)<+\infty,roman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) < + ∞ , (26)

and converging weakly to ξL2(X,X,μ)𝜉superscript𝐿2𝑋𝑋𝜇\xi\in L^{2}(X,X,\mu)italic_ξ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) in the following sense:

limnXζ(x)ξn(x)𝑑μn(x)=Xζ(x)ξ(x)𝑑μ(x),ζCc(X).formulae-sequencesubscript𝑛subscript𝑋𝜁𝑥subscript𝜉𝑛𝑥differential-dsubscript𝜇𝑛𝑥subscript𝑋𝜁𝑥𝜉𝑥differential-d𝜇𝑥for-all𝜁superscriptsubscript𝐶𝑐𝑋\displaystyle\lim_{n\to\infty}\int_{X}{\zeta(x)\xi_{n}(x)}d\mu_{n}(x)=\int_{X}% {\zeta(x)\xi(x)}d\mu(x),\quad\forall\zeta\in C_{c}^{\infty}(X).roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) italic_ξ ( italic_x ) italic_d italic_μ ( italic_x ) , ∀ italic_ζ ∈ italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_X ) . (27)

Then ξϕ(μ).𝜉italic-ϕ𝜇\xi\in\partial\phi(\mu).italic_ξ ∈ ∂ italic_ϕ ( italic_μ ) .

As a side note, we need the notion of weak convergence in the above theorem because – unlike subdifferentials in flat Euclidean space – each ξnsubscript𝜉𝑛\xi_{n}italic_ξ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT lives in its own L2(X,X,μn)superscript𝐿2𝑋𝑋subscript𝜇𝑛L^{2}(X,X,\mu_{n})italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) space.

Back to our proof, for item (26), we show that

supnXTμnνn(x)x2𝑑μn(x)<+.subscriptsupremum𝑛subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛subscript𝜈𝑛𝑥𝑥2differential-dsubscript𝜇𝑛𝑥\displaystyle\sup_{n\in\mathbb{N}}\int_{X}{\left\|{T_{\mu_{n}}^{\nu_{n}}(x)-x}% \right\|^{2}}d\mu_{n}(x)<+\infty.roman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) < + ∞ . (28)

We proceed as follows to prove (28). We first show that supn𝔪2(μn)<+subscriptsupremum𝑛subscript𝔪2subscript𝜇𝑛\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\mu_{n})}<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞. By contradiction, by assuming supn𝔪2(μn)=+subscriptsupremum𝑛subscript𝔪2subscript𝜇𝑛\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\mu_{n})}=+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = + ∞, we can extract a subsequence {μnk}ksubscriptsubscript𝜇subscript𝑛𝑘𝑘\{\mu_{n_{k}}\}_{k\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT such that

limk𝔪2(μnk)=+.subscript𝑘subscript𝔪2subscript𝜇subscript𝑛𝑘\displaystyle\lim_{k\to\infty}{\mathfrak{m}_{2}(\mu_{n_{k}})}=+\infty.roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = + ∞ . (29)

By compactness assumption 2, there further exists a subsequence {μnki}isubscriptsubscript𝜇subscript𝑛subscript𝑘𝑖𝑖\{\mu_{n_{k_{i}}}\}_{i\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT such that μnkisubscript𝜇subscript𝑛subscript𝑘𝑖\mu_{n_{k_{i}}}italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT converges (in Wasserstein metric) to some μ𝒫2(X)superscript𝜇absentsubscript𝒫2𝑋\mu^{**}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ). On the other hand, we have the following inequality: for all x,yX,𝑥𝑦𝑋x,y\in X,italic_x , italic_y ∈ italic_X ,

x22xy2+2y2.superscriptnorm𝑥22superscriptnorm𝑥𝑦22superscriptnorm𝑦2\displaystyle\|x\|^{2}\leq 2\|x-y\|^{2}+2\|y\|^{2}.∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (30)

By using (30) for (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) where (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) is the optimal coupling of (μnki,μ)subscript𝜇subscript𝑛subscript𝑘𝑖superscript𝜇absent(\mu_{n_{k_{i}}},\mu^{**})( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ) and taking the expectation of both sides, we get

𝔪2(μnki)2W22(μnki,μ)+2𝔪2(μ).subscript𝔪2subscript𝜇subscript𝑛subscript𝑘𝑖2superscriptsubscript𝑊22subscript𝜇subscript𝑛subscript𝑘𝑖superscript𝜇absent2subscript𝔪2superscript𝜇absent\displaystyle\mathfrak{m}_{2}(\mu_{n_{k_{i}}})\leq 2W_{2}^{2}(\mu_{n_{k_{i}}},% \mu^{**})+2\mathfrak{m}_{2}(\mu^{**}).fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ 2 italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ) + 2 fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ) .

It follows that lim supi𝔪2(μnki)2𝔪2(μ)subscriptlimit-supremum𝑖subscript𝔪2subscript𝜇subscript𝑛subscript𝑘𝑖2subscript𝔪2superscript𝜇absent\limsup_{i\to\infty}{\mathfrak{m}_{2}(\mu_{n_{k_{i}}})}\leq 2\mathfrak{m}_{2}(% \mu^{**})lim sup start_POSTSUBSCRIPT italic_i → ∞ end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ≤ 2 fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ ∗ end_POSTSUPERSCRIPT ), which contradicts (29). Therefore, supn𝔪2(μn)<+subscriptsupremum𝑛subscript𝔪2subscript𝜇𝑛\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\mu_{n})}<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞. We next show that supn𝔪2(νn)<+subscriptsupremum𝑛subscript𝔪2subscript𝜈𝑛\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\nu_{n})}<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞. Indeed, as S2superscriptnorm𝑆2\|S\|^{2}∥ italic_S ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT has quadratic growth,

S(x)2c(x2+1)superscriptnorm𝑆𝑥2𝑐superscriptnorm𝑥21\displaystyle\|S(x)\|^{2}\leq c(\|x\|^{2}+1)∥ italic_S ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_c ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 )

for some c>0𝑐0c>0italic_c > 0. So

𝔪2(νn+1)subscript𝔪2subscript𝜈𝑛1\displaystyle\mathfrak{m}_{2}(\nu_{n+1})fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) =Xx2𝑑νn+1(x)absentsubscript𝑋superscriptnorm𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle=\int_{X}{\|x\|^{2}}d\nu_{n+1}(x)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=Xx2d(I+γS)#μn(x)absentsubscript𝑋superscriptnorm𝑥2𝑑subscript𝐼𝛾𝑆#subscript𝜇𝑛𝑥\displaystyle=\int_{X}{\|x\|^{2}}d(I+\gamma S)_{\#}\mu_{n}(x)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ( italic_I + italic_γ italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x )
=Xx+γS(x)2𝑑μn(x)absentsubscript𝑋superscriptnorm𝑥𝛾𝑆𝑥2differential-dsubscript𝜇𝑛𝑥\displaystyle=\int_{X}{\|x+\gamma S(x)\|^{2}}d\mu_{n}(x)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x + italic_γ italic_S ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x )
2Xx2𝑑μn(x)+2γ2XS(x)2𝑑μn(x)absent2subscript𝑋superscriptnorm𝑥2differential-dsubscript𝜇𝑛𝑥2superscript𝛾2subscript𝑋superscriptnorm𝑆𝑥2differential-dsubscript𝜇𝑛𝑥\displaystyle\leq 2\int_{X}{\|x\|^{2}}d\mu_{n}(x)+2\gamma^{2}\int_{X}{\|S(x)\|% ^{2}}d\mu_{n}(x)≤ 2 ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) + 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_S ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x )
(2+2cγ2)𝔪2(μn)+2γ2c,absent22𝑐superscript𝛾2subscript𝔪2subscript𝜇𝑛2superscript𝛾2𝑐\displaystyle\leq(2+2c\gamma^{2})\mathfrak{m}_{2}(\mu_{n})+2\gamma^{2}c,≤ ( 2 + 2 italic_c italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c ,

implying that supn𝔪2(νn)<+.subscriptsupremum𝑛subscript𝔪2subscript𝜈𝑛\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\nu_{n})}<+\infty.roman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞ . This in conjunction with G𝐺Gitalic_G having quadratic growth implies that supn|G(νn)|<+subscriptsupremum𝑛subscript𝐺subscript𝜈𝑛\sup_{n\in\mathbb{N}}{|\mathcal{E}_{G}(\nu_{n})|}<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT | caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | < + ∞. Furthermore, infn(G+)(μn)>subscriptinfimum𝑛subscript𝐺subscript𝜇𝑛\inf_{n}{(\mathcal{E}_{G}+\mathscr{H})(\mu_{n})}>-\inftyroman_inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) > - ∞ otherwise by lower semicontinuity of \mathscr{H}script_H and compactness of {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT we get a contradiction.

Now, as γ1(Tμn+1νn+1I)(G+)(μn+1)superscript𝛾1superscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝐼subscript𝐺subscript𝜇𝑛1\gamma^{-1}(T_{\mu_{n+1}}^{\nu_{n+1}}-I)\in\partial\left(\mathcal{E}_{G}+% \mathscr{H}\right)(\mu_{n+1})italic_γ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I ) ∈ ∂ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) and G+subscript𝐺\mathcal{E}_{G}+\mathscr{H}caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H is geodesically convex, by applying Thm. 6, it holds

(G+)(νn+1)(G+)(μn+1)+1γXTμn+1νn+1(x)x2𝑑μn+1(x).subscript𝐺subscript𝜈𝑛1subscript𝐺subscript𝜇𝑛11𝛾subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝑥𝑥2differential-dsubscript𝜇𝑛1𝑥\displaystyle(\mathcal{E}_{G}+\mathscr{H})(\nu_{n+1})\geq(\mathcal{E}_{G}+% \mathscr{H})(\mu_{n+1})+\dfrac{1}{\gamma}\int_{X}{\|T_{\mu_{n+1}}^{\nu_{n+1}}(% x)-x\|^{2}}d\mu_{n+1}(x).( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≥ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) . (31)

The finiteness as in (28) then follows from supn|G(νn)|<+,supn(G+)(μn)<+formulae-sequencesubscriptsupremum𝑛subscript𝐺subscript𝜈𝑛subscriptsupremum𝑛subscript𝐺subscript𝜇𝑛\sup_{n\in\mathbb{N}}{|\mathcal{E}_{G}(\nu_{n})|}<+\infty,\sup_{n\in\mathbb{N}% }-(\mathcal{E}_{G}+\mathscr{H})(\mu_{n})<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT | caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) | < + ∞ , roman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT - ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞ as proved and supn(νn)<+subscriptsupremum𝑛subscript𝜈𝑛\sup_{n\in\mathbb{N}}\mathscr{H}(\nu_{n})<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT script_H ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞ as assumed.

We next prove that there is a subsequence of {μnk}ksubscriptsubscript𝜇subscript𝑛𝑘𝑘\{\mu_{n_{k}}\}_{k\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT such that

Tμnkj+1νnkj+1ITμνI weakly.subscriptsuperscript𝑇subscript𝜈subscript𝑛subscript𝑘𝑗1subscript𝜇subscript𝑛subscript𝑘𝑗1𝐼superscriptsubscript𝑇superscript𝜇superscript𝜈𝐼 weakly.\displaystyle T^{\nu_{n_{k_{j}}+1}}_{\mu_{n_{k_{j}}+1}}-I\to T_{\mu^{*}}^{\nu^% {*}}-I\text{ {weakly}.}italic_T start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_I → italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I weakly. (32)

We consider the sequence of optimal plans between μnsubscript𝜇𝑛\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and νnsubscript𝜈𝑛\nu_{n}italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as follows

ρn=(I,Tμnνn)#μn,n.formulae-sequencesubscript𝜌𝑛subscript𝐼superscriptsubscript𝑇subscript𝜇𝑛subscript𝜈𝑛#subscript𝜇𝑛for-all𝑛\displaystyle\rho_{{n}}=(I,T_{\mu_{n}}^{\nu_{n}})_{\#}\mu_{{n}},\quad\forall n% \in\mathbb{N}.italic_ρ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ( italic_I , italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , ∀ italic_n ∈ blackboard_N .

We observe that

𝔪2(ρn)=X×X(x2+y2)𝑑ρn(x,y)=Xx2𝑑μn(x)+Xy2𝑑νn(y)<+,subscript𝔪2subscript𝜌𝑛subscript𝑋𝑋superscriptnorm𝑥2superscriptnorm𝑦2differential-dsubscript𝜌𝑛𝑥𝑦subscript𝑋superscriptnorm𝑥2differential-dsubscript𝜇𝑛𝑥subscript𝑋superscriptnorm𝑦2differential-dsubscript𝜈𝑛𝑦\displaystyle\mathfrak{m}_{2}(\rho_{n})=\int_{X\times X}{(\|x\|^{2}+\|y\|^{2})% }d\rho_{n}(x,y)=\int_{X}{\|x\|^{2}}d\mu_{n}(x)+\int_{X}{\|y\|^{2}}d\nu_{n}(y)<% +\infty,fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_d italic_ρ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x , italic_y ) = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_y ) < + ∞ ,

so ρn𝒫2(X×X)subscript𝜌𝑛subscript𝒫2𝑋𝑋\rho_{n}\in\mathcal{P}_{2}(X\times X)italic_ρ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X × italic_X ) for all n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N. Since supn𝔪2(νn)<+subscriptsupremum𝑛subscript𝔪2subscript𝜈𝑛\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\nu_{n})}<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞ as proved, and as a Wasserstein ball is relatively compact under narrow topology, {νn}nsubscriptsubscript𝜈𝑛𝑛\{\nu_{n}\}_{n\in\mathbb{N}}{ italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is relatively compact under narrow topology. The same property holds for {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT. According to Prokhorov [2, Theorem 1.3], {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT and {νn}nsubscriptsubscript𝜈𝑛𝑛\{\nu_{n}\}_{n\in\mathbb{N}}{ italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT are tight. By [2, Remark 1.4], {ρn}nsubscriptsubscript𝜌𝑛𝑛\{\rho_{n}\}_{n\in\mathbb{N}}{ italic_ρ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is also tight, hence relatively compact under narrow topology in 𝒫2(X×X)subscript𝒫2𝑋𝑋\mathcal{P}_{2}(X\times X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X × italic_X ). Consequently, {ρnk+1}ksubscriptsubscript𝜌subscript𝑛𝑘1𝑘\{\rho_{n_{k}+1}\}_{k\in\mathbb{N}}{ italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT admits a subsequence converging narrowly to some ρ𝒫(X×X)superscript𝜌𝒫𝑋𝑋\rho^{*}\in\mathcal{P}(X\times X)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P ( italic_X × italic_X ). Let’s say

ρnki+1narrowρ as i.narrowsubscript𝜌subscript𝑛subscript𝑘𝑖1superscript𝜌 as 𝑖\displaystyle\rho_{n_{k_{i}}+1}\xrightarrow{\operatorname{\text{narrow}}}\rho^% {*}\text{ as }i\to\infty.italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_ARROW overna → end_ARROW italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as italic_i → ∞ . (33)

We can see that proj1#ρ=μ,proj2#ρ=νformulae-sequencesubscriptsubscriptproj1#superscript𝜌superscript𝜇subscriptsubscriptproj2#superscript𝜌superscript𝜈{\operatorname{proj}_{1}}_{\#}\rho^{*}=\mu^{*},{\operatorname{proj}_{2}}_{\#}% \rho^{*}=\nu^{*}roman_proj start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , roman_proj start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Indeed, take any continuous and bounded function fCb(X)𝑓subscript𝐶𝑏𝑋f\in C_{b}(X)italic_f ∈ italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_X ), it holds

Xf(x)𝑑μ(x)subscript𝑋𝑓𝑥differential-dsuperscript𝜇𝑥\displaystyle\int_{X}{f(x)}d\mu^{*}(x)∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) =limiXf(x)𝑑μnki+1(x)absentsubscript𝑖subscript𝑋𝑓𝑥differential-dsubscript𝜇subscript𝑛subscript𝑘𝑖1𝑥\displaystyle=\lim_{i\to\infty}\int_{X}f(x)d\mu_{n_{k_{i}}+1}(x)= roman_lim start_POSTSUBSCRIPT italic_i → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x )
=limiX×Xf(x)𝑑ρnki+1(x,y)absentsubscript𝑖subscript𝑋𝑋𝑓𝑥differential-dsubscript𝜌subscript𝑛subscript𝑘𝑖1𝑥𝑦\displaystyle=\lim_{i\to\infty}\int_{X\times X}f(x)d\rho_{n_{k_{i}}+1}(x,y)= roman_lim start_POSTSUBSCRIPT italic_i → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x , italic_y )
=X×Xf(x)𝑑ρ(x,y)absentsubscript𝑋𝑋𝑓𝑥differential-dsuperscript𝜌𝑥𝑦\displaystyle=\int_{X\times X}f(x)d\rho^{*}(x,y)= ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y )
=Xf(x)d(proj1#ρ)(x).absentsubscript𝑋𝑓𝑥𝑑subscriptsubscriptproj1#superscript𝜌𝑥\displaystyle=\int_{X}f(x)d({\operatorname{proj}_{1}}_{\#}\rho^{*})(x).= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d ( roman_proj start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_x ) .

Therefore,

Xf(x)𝑑μ(x)=Xf(x)d(proj1#ρ)(x)fCb(X).formulae-sequencesubscript𝑋𝑓𝑥differential-dsuperscript𝜇𝑥subscript𝑋𝑓𝑥𝑑subscriptsubscriptproj1#superscript𝜌𝑥for-all𝑓subscript𝐶𝑏𝑋\displaystyle\int_{X}{f(x)}d\mu^{*}(x)=\int_{X}f(x)d({\operatorname{proj}_{1}}% _{\#}\rho^{*})(x)\quad\forall f\in C_{b}(X).∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_d ( roman_proj start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_x ) ∀ italic_f ∈ italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_X ) .

It then follows from [11, Thm. 1.2] that proj1#ρ=μsubscriptsubscriptproj1#superscript𝜌superscript𝜇{\operatorname{proj}_{1}}_{\#}\rho^{*}=\mu^{*}roman_proj start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Similarly, proj2#ρ=νsubscriptsubscriptproj2#superscript𝜌superscript𝜈{\operatorname{proj}_{2}}_{\#}\rho^{*}=\nu^{*}roman_proj start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

We further show that ρ𝒫2(X×X)superscript𝜌subscript𝒫2𝑋𝑋\rho^{*}\in\mathcal{P}_{2}(X\times X)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X × italic_X ), or equivalently,

X×X(x2+y2)𝑑ρ(x,y)<+.subscript𝑋𝑋superscriptnorm𝑥2superscriptnorm𝑦2differential-dsuperscript𝜌𝑥𝑦\displaystyle\int_{X\times X}{(\|x\|^{2}+\|y\|^{2})}d\rho^{*}(x,y)<+\infty.∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_d italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) < + ∞ . (34)

Let C𝐶C\in\mathbb{N}italic_C ∈ blackboard_N, thanks to the narrow convergence in (33), we have

X×Xmin{x2+y2,C}𝑑ρnki+1(x,y)X×Xmin{x2+y2,C}𝑑ρ(x,y).subscript𝑋𝑋superscriptnorm𝑥2superscriptnorm𝑦2𝐶differential-dsubscript𝜌subscript𝑛subscript𝑘𝑖1𝑥𝑦subscript𝑋𝑋superscriptnorm𝑥2superscriptnorm𝑦2𝐶differential-dsuperscript𝜌𝑥𝑦\displaystyle\int_{X\times X}{\min\{\|x\|^{2}+\|y\|^{2},C}\}d\rho_{n_{k_{i}}+1% }(x,y)\to\int_{X\times X}{\min\{\|x\|^{2}+\|y\|^{2},C}\}d\rho^{*}(x,y).∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT roman_min { ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_C } italic_d italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x , italic_y ) → ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT roman_min { ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_C } italic_d italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) .

Furthermore,

X×Xmin{x2+y2,C}𝑑ρnki+1(x,y)supn𝔪2(μn)+supn𝔪2(νn):=M<+.subscript𝑋𝑋superscriptnorm𝑥2superscriptnorm𝑦2𝐶differential-dsubscript𝜌subscript𝑛subscript𝑘𝑖1𝑥𝑦subscriptsupremum𝑛subscript𝔪2subscript𝜇𝑛subscriptsupremum𝑛subscript𝔪2subscript𝜈𝑛assign𝑀\displaystyle\int_{X\times X}{\min\{\|x\|^{2}+\|y\|^{2},C}\}d\rho_{n_{k_{i}}+1% }(x,y)\leq\sup_{n\in\mathbb{N}}\mathfrak{m}_{2}(\mu_{n})+\sup_{n\in\mathbb{N}}% \mathfrak{m}_{2}(\nu_{n}):=M<+\infty.∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT roman_min { ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_C } italic_d italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x , italic_y ) ≤ roman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + roman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) := italic_M < + ∞ .

Passing to the limit, we get

X×Xmin{x2+y2,C}𝑑ρ(x,y)Msubscript𝑋𝑋superscriptnorm𝑥2superscriptnorm𝑦2𝐶differential-dsuperscript𝜌𝑥𝑦𝑀\displaystyle\int_{X\times X}{\min\{\|x\|^{2}+\|y\|^{2},C}\}d\rho^{*}(x,y)\leq M∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT roman_min { ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_C } italic_d italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ≤ italic_M

for all C𝐶C\in\mathbb{N}italic_C ∈ blackboard_N. Sending C𝐶Citalic_C to \infty and applying Monotone Convergence Theorem we derive (34).

Back to the main proof, since {ρnki+1}isubscriptsubscript𝜌subscript𝑛subscript𝑘𝑖1𝑖\{\rho_{n_{k_{i}}+1}\}_{i\in\mathbb{N}}{ italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ blackboard_N end_POSTSUBSCRIPT is a sequence of optimal plans, its limit, ρsuperscript𝜌\rho^{*}italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is also optimal [2, Proposition 2.5]. Therefore,

ρ=(I,Tμν)#μ.superscript𝜌subscript𝐼superscriptsubscript𝑇superscript𝜇superscript𝜈#superscript𝜇\displaystyle\rho^{*}=(I,T_{\mu^{*}}^{\nu^{*}})_{\#}\mu^{*}.italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = ( italic_I , italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .

Moreover, as

𝔪2(ρnki+1)=𝔪2(μnki+1)+𝔪2(νnki+1)𝔪2(μ)+𝔪2(ν)=𝔪2(ρ),subscript𝔪2subscript𝜌subscript𝑛subscript𝑘𝑖1subscript𝔪2subscript𝜇subscript𝑛subscript𝑘𝑖1subscript𝔪2subscript𝜈subscript𝑛subscript𝑘𝑖1subscript𝔪2superscript𝜇subscript𝔪2superscript𝜈subscript𝔪2superscript𝜌\displaystyle\mathfrak{m}_{2}(\rho_{n_{k_{i}}+1})=\mathfrak{m}_{2}(\mu_{n_{k_{% i}}+1})+\mathfrak{m}_{2}(\nu_{n_{k_{i}}+1})\to\mathfrak{m}_{2}(\mu^{*})+% \mathfrak{m}_{2}(\nu^{*})=\mathfrak{m}_{2}(\rho^{*}),fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) = fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) + fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ) → fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ,

we have

ρnki+1Wassρ as i.Wasssubscript𝜌subscript𝑛subscript𝑘𝑖1superscript𝜌 as 𝑖\displaystyle\rho_{n_{k_{i}}+1}\xrightarrow{\operatorname{\text{Wass}}}\rho^{*% }\text{ as }i\to\infty.italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as italic_i → ∞ .

Now let’s take any test function ζCc(X)𝜁superscriptsubscript𝐶𝑐𝑋\zeta\in C_{c}^{\infty}(X)italic_ζ ∈ italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_X ), we show

limjXζ(x)Tμnkj+1νnkj+1(x)𝑑μnkj+1(x)=Xζ(x)Tμν(x)𝑑μ(x).subscript𝑗subscript𝑋𝜁𝑥superscriptsubscript𝑇subscript𝜇subscript𝑛subscript𝑘𝑗1subscript𝜈subscript𝑛subscript𝑘𝑗1𝑥differential-dsubscript𝜇subscript𝑛subscript𝑘𝑗1𝑥subscript𝑋𝜁𝑥superscriptsubscript𝑇superscript𝜇superscript𝜈𝑥differential-dsuperscript𝜇𝑥\displaystyle\lim_{j\to\infty}\int_{X}{\zeta(x)T_{\mu_{n_{k_{j}}+1}}^{\nu_{n_{% k_{j}}+1}}(x)}d\mu_{n_{k_{j}}+1}(x)=\int_{X}{\zeta(x)T_{\mu^{*}}^{\nu^{*}}(x)}% d\mu^{*}(x).roman_lim start_POSTSUBSCRIPT italic_j → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) italic_d italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x ) = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) . (35)

Indeed, (x,y)ζ(x)proji(y)maps-to𝑥𝑦𝜁𝑥subscriptproj𝑖𝑦(x,y)\mapsto\zeta(x)\operatorname{proj}_{i}(y)( italic_x , italic_y ) ↦ italic_ζ ( italic_x ) roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) where projisubscriptproj𝑖\operatorname{proj}_{i}roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the projection into the i𝑖iitalic_i-th coordinate is continuous and has quadratic growth since ζ(x)𝜁𝑥\zeta(x)italic_ζ ( italic_x ) is bounded and proji(y)subscriptproj𝑖𝑦\operatorname{proj}_{i}(y)roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) is linear.

Since ρnkj+1WassρWasssubscript𝜌subscript𝑛subscript𝑘𝑗1superscript𝜌\rho_{n_{k_{j}}+1}\xrightarrow{\operatorname{\text{Wass}}}\rho^{*}italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, it holds: for each i[d]𝑖delimited-[]𝑑i\in[d]italic_i ∈ [ italic_d ],

limjXζ(x)proji(Tμnkj+1νnkj+1(x))𝑑μnkj+1(x)subscript𝑗subscript𝑋𝜁𝑥subscriptproj𝑖superscriptsubscript𝑇subscript𝜇subscript𝑛subscript𝑘𝑗1subscript𝜈subscript𝑛subscript𝑘𝑗1𝑥differential-dsubscript𝜇subscript𝑛subscript𝑘𝑗1𝑥\displaystyle\lim_{j\to\infty}\int_{X}{\zeta(x)\operatorname{proj}_{i}\left(T_% {\mu_{n_{k_{j}}+1}}^{\nu_{n_{k_{j}}+1}}(x)\right)}d\mu_{n_{k_{j}}+1}(x)roman_lim start_POSTSUBSCRIPT italic_j → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ) italic_d italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x )
=limjXζ(x)proji(y)d(I,Tμnkj+1νnkj+1)#μnkj+1(x,y)absentsubscript𝑗subscript𝑋𝜁𝑥subscriptproj𝑖𝑦𝑑subscript𝐼superscriptsubscript𝑇subscript𝜇subscript𝑛subscript𝑘𝑗1subscript𝜈subscript𝑛subscript𝑘𝑗1#subscript𝜇subscript𝑛subscript𝑘𝑗1𝑥𝑦\displaystyle=\lim_{j\to\infty}\int_{X}{\zeta(x)\operatorname{proj}_{i}(y)}d(I% ,T_{\mu_{n_{k_{j}}+1}}^{\nu_{n_{k_{j}}+1}})_{\#}\mu_{n_{k_{j}}+1}(x,y)= roman_lim start_POSTSUBSCRIPT italic_j → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) italic_d ( italic_I , italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x , italic_y )
=limjXζ(x)proji(y)𝑑ρnkj+1(x,y)absentsubscript𝑗subscript𝑋𝜁𝑥subscriptproj𝑖𝑦differential-dsubscript𝜌subscript𝑛subscript𝑘𝑗1𝑥𝑦\displaystyle=\lim_{j\to\infty}\int_{X}{\zeta(x)\operatorname{proj}_{i}(y)}d% \rho_{n_{k_{j}}+1}(x,y)= roman_lim start_POSTSUBSCRIPT italic_j → ∞ end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) italic_d italic_ρ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x , italic_y )
=Xζ(x)proji(y)𝑑ρ(x,y)absentsubscript𝑋𝜁𝑥subscriptproj𝑖𝑦differential-dsuperscript𝜌𝑥𝑦\displaystyle=\int_{X}{\zeta(x)\operatorname{proj}_{i}(y)}d\rho^{*}(x,y)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) italic_d italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y )
=Xζ(x)proji(y)d(I,Tμν)#μ(x,y)absentsubscript𝑋𝜁𝑥subscriptproj𝑖𝑦𝑑subscript𝐼superscriptsubscript𝑇superscript𝜇superscript𝜈#superscript𝜇𝑥𝑦\displaystyle=\int_{X}{\zeta(x)\operatorname{proj}_{i}(y)}d(I,T_{\mu^{*}}^{\nu% ^{*}})_{\#}\mu^{*}(x,y)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_y ) italic_d ( italic_I , italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y )
=Xζ(x)proji(Tμν(x))𝑑μ(x),absentsubscript𝑋𝜁𝑥subscriptproj𝑖superscriptsubscript𝑇superscript𝜇superscript𝜈𝑥differential-dsuperscript𝜇𝑥\displaystyle=\int_{X}{\zeta(x)\operatorname{proj}_{i}(T_{\mu^{*}}^{\nu^{*}}(x% ))}d\mu^{*}(x),= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ζ ( italic_x ) roman_proj start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ,

so (35) holds. Consequently, (32) also holds by noticing that

Xxζ(x)𝑑μnkj+1(x)Xxζ(x)𝑑μ(x) as j.subscript𝑋𝑥𝜁𝑥differential-dsubscript𝜇subscript𝑛subscript𝑘𝑗1𝑥subscript𝑋𝑥𝜁𝑥differential-dsuperscript𝜇𝑥 as 𝑗\displaystyle\int_{X}x\zeta(x)d\mu_{n_{k_{j}}+1}(x)\to\int_{X}{x\zeta(x)}d\mu^% {*}(x)\text{ as }j\to\infty.∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_x italic_ζ ( italic_x ) italic_d italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( italic_x ) → ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_x italic_ζ ( italic_x ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) as italic_j → ∞ .

By the closedness of subdifferential graph of (G+)subscript𝐺\partial(\mathcal{E}_{G}+\mathscr{H})∂ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) (Thm. 8), we obtain

TμνIγ(G+)(μ).superscriptsubscript𝑇superscript𝜇superscript𝜈𝐼𝛾subscript𝐺superscript𝜇\displaystyle\dfrac{T_{\mu^{*}}^{\nu^{*}}-I}{\gamma}\in\partial(\mathcal{E}_{G% }+\mathscr{H})(\mu^{*}).divide start_ARG italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_I end_ARG start_ARG italic_γ end_ARG ∈ ∂ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

Therefore, S(G+)(μ)H(μ)𝑆subscript𝐺superscript𝜇subscript𝐻superscript𝜇S\in\partial(\mathcal{E}_{G}+\mathscr{H})(\mu^{*})\cap\partial\mathcal{E}_{H}(% \mu^{*})italic_S ∈ ∂ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∩ ∂ caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), or μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a critical point of .\mathcal{F}.caligraphic_F .

A.5 Proof of Theorem 2

We see that

𝒢γ(μn)L2(X,X;μn)2superscriptsubscriptnormsubscript𝒢𝛾subscript𝜇𝑛superscript𝐿2𝑋𝑋subscript𝜇𝑛2\displaystyle\|\mathcal{G}_{\gamma}(\mu_{n})\|_{L^{2}(X,X;\mu_{n})}^{2}∥ caligraphic_G start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X ; italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =1γ2XxTμnJKOγ(G+)((I+γS)#μn)(x)2𝑑μn(x)absent1superscript𝛾2subscript𝑋superscriptnorm𝑥superscriptsubscript𝑇subscript𝜇𝑛subscriptJKO𝛾subscript𝐺subscript𝐼𝛾𝑆#subscript𝜇𝑛𝑥2differential-dsubscript𝜇𝑛𝑥\displaystyle=\dfrac{1}{\gamma^{2}}\int_{X}{\left\|x-T_{\mu_{n}}^{% \operatorname{JKO}_{\gamma(\mathcal{E}_{G}+\mathscr{H})}((I+\gamma S)_{\#}\mu_% {n})}(x)\right\|^{2}}d\mu_{n}(x)= divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_JKO start_POSTSUBSCRIPT italic_γ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) end_POSTSUBSCRIPT ( ( italic_I + italic_γ italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x )
=1γ2W22(μn,JKOγ(G+)((I+γS)#μn))absent1superscript𝛾2superscriptsubscript𝑊22subscript𝜇𝑛subscriptJKO𝛾subscript𝐺subscript𝐼𝛾𝑆#subscript𝜇𝑛\displaystyle=\dfrac{1}{\gamma^{2}}W_{2}^{2}(\mu_{n},\operatorname{JKO}_{% \gamma(\mathcal{E}_{G}+\mathscr{H})}((I+\gamma S)_{\#}\mu_{n}))= divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , roman_JKO start_POSTSUBSCRIPT italic_γ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) end_POSTSUBSCRIPT ( ( italic_I + italic_γ italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) )
=1γ2W22(μn,μn+1).absent1superscript𝛾2superscriptsubscript𝑊22subscript𝜇𝑛subscript𝜇𝑛1\displaystyle=\dfrac{1}{\gamma^{2}}W_{2}^{2}(\mu_{n},\mu_{n+1}).= divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) .

On the other hand, it follows from (23) of the proof of Thm. 1 that

mini=1,N¯W22(μn,μn+1)=O(N1).subscript𝑖¯1𝑁superscriptsubscript𝑊22subscript𝜇𝑛subscript𝜇𝑛1𝑂superscript𝑁1\displaystyle\min_{i=\overline{1,N}}W_{2}^{2}(\mu_{n},\mu_{n+1})=O(N^{-1}).roman_min start_POSTSUBSCRIPT italic_i = over¯ start_ARG 1 , italic_N end_ARG end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = italic_O ( italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) .

A.6 Proof of Theorem 3

H𝐻Hitalic_H has uniformly bounded Hessian, by [36, Prop. 2.12], Hsubscript𝐻\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is Wasserstein differentiable and WH(μ)=Hsubscript𝑊subscript𝐻𝜇𝐻\nabla_{W}\mathcal{E}_{H}(\mu)=\nabla H∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ ) = ∇ italic_H for all μ𝒫2(X).𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X).italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) . According to Lem. 8 and (16),

F(μn+1)=(G+)(μn+1)HTμn+1νn+1IγH.superscriptsubscript𝐹subscript𝜇𝑛1subscript𝐺subscript𝜇𝑛1𝐻containssuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝐼𝛾𝐻\displaystyle\partial_{F}^{-}\mathcal{F}(\mu_{n+1})=\partial(\mathcal{E}_{G}+% \mathscr{H})(\mu_{n+1})-\nabla H\ni\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}% -\nabla H.∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = ∂ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - ∇ italic_H ∋ divide start_ARG italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I end_ARG start_ARG italic_γ end_ARG - ∇ italic_H . (36)

We then have the following evaluations:

dist(0,F(μn+1))dist0superscriptsubscript𝐹subscript𝜇𝑛1\displaystyle\operatorname{dist}{(0,\partial_{F}^{-}\mathcal{F}(\mu_{n+1}))}roman_dist ( 0 , ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ) =infξF(μn+1)ξL2(X,X,μn+1)absentsubscriptinfimum𝜉subscriptsuperscript𝐹subscript𝜇𝑛1subscriptnorm𝜉superscript𝐿2𝑋𝑋subscript𝜇𝑛1\displaystyle=\inf_{\xi\in\partial^{-}_{F}\mathcal{F}(\mu_{n+1})}\|\xi\|_{L^{2% }(X,X,\mu_{n+1})}= roman_inf start_POSTSUBSCRIPT italic_ξ ∈ ∂ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ∥ italic_ξ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT
Tμn+1νn+1IγHL2(X,X,μn+1)absentsubscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝐼𝛾𝐻superscript𝐿2𝑋𝑋subscript𝜇𝑛1\displaystyle\leq\left\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}-\nabla H% \right\|_{L^{2}(X,X,\mu_{n+1})}≤ ∥ divide start_ARG italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I end_ARG start_ARG italic_γ end_ARG - ∇ italic_H ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT
=(XTμn+1νn+1(x)xγH(x)2𝑑μn+1(x))12absentsuperscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝑥𝑥𝛾𝐻𝑥2differential-dsubscript𝜇𝑛1𝑥12\displaystyle=\left(\int_{X}{\left\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x}{% \gamma}-\nabla H(x)\right\|^{2}}d\mu_{n+1}(x)\right)^{\frac{1}{2}}= ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ divide start_ARG italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_x end_ARG start_ARG italic_γ end_ARG - ∇ italic_H ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
=1γ(XTμn+1νn+1(x)xγH(x)2𝑑μn+1(x))12.absent1𝛾superscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝑥𝑥𝛾𝐻𝑥2differential-dsubscript𝜇𝑛1𝑥12\displaystyle=\dfrac{1}{\gamma}\left(\int_{X}\|T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x-% \gamma\nabla H(x)\|^{2}d\mu_{n+1}(x)\right)^{\frac{1}{2}}.= divide start_ARG 1 end_ARG start_ARG italic_γ end_ARG ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_x - italic_γ ∇ italic_H ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . (37)

By transfer lemma 2,

XTμn+1νn+1(x)xγH(x)2𝑑μn+1(x)subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝑥𝑥𝛾𝐻𝑥2differential-dsubscript𝜇𝑛1𝑥\displaystyle\int_{X}{\left\|T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x-\gamma\nabla H(x)% \right\|^{2}}d\mu_{n+1}(x)∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_x - italic_γ ∇ italic_H ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=XTμn+1νn+1(x)xγH(x)2𝑑Tνn+1μn+1#νn+1(x)absentsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝑥𝑥𝛾𝐻𝑥2differential-dsubscriptsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1#subscript𝜈𝑛1𝑥\displaystyle=\int_{X}{\left\|T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x-\gamma\nabla H(x)% \right\|^{2}}d{T_{\nu_{n+1}}^{\mu_{n+1}}}_{\#}\nu_{n+1}(x)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_x - italic_γ ∇ italic_H ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=XTμn+1νn+1Tνn+1μn+1(x)(I+γH)Tνn+1μn+1(x)2𝑑νn+1(x)absentsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle=\int_{X}{\left\|T_{\mu_{n+1}}^{\nu_{n+1}}\circ T_{\nu_{n+1}}^{% \mu_{n+1}}(x)-(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)\right\|^{2}% }d\nu_{n+1}(x)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - ( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=Xx(I+γH)Tνn+1μn+1(x)2𝑑νn+1(x).absentsubscript𝑋superscriptnorm𝑥𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle=\int_{X}{\left\|x-(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+% 1}}(x)\right\|^{2}}d\nu_{n+1}(x).= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - ( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) . (38)

On the other hand, by using the trivial identity

H=(I+γH)Iγ𝐻𝐼𝛾𝐻𝐼𝛾\displaystyle\nabla H=\dfrac{(I+\gamma\nabla H)-I}{\gamma}∇ italic_H = divide start_ARG ( italic_I + italic_γ ∇ italic_H ) - italic_I end_ARG start_ARG italic_γ end_ARG

we compute,

XH(Tνn+1μn(x))H(Tνn+1μn+1(x))2𝑑νn+1(x)subscript𝑋superscriptnorm𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\int_{X}{\|\nabla H(T_{\nu_{n+1}}^{\mu_{n}}(x))-\nabla H(T_{\nu_{% n+1}}^{\mu_{n+1}}(x))\|^{2}}d\nu_{n+1}(x)∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ ∇ italic_H ( italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ) - ∇ italic_H ( italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=X(I+γH)Tνn+1μn(x)Tνn+1μn(x)γ(I+γH)Tνn+1μn+1(x)Tνn+1μn+1(x)γ2𝑑νn+1(x)absentsubscript𝑋superscriptnorm𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥𝛾𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥𝛾2differential-dsubscript𝜈𝑛1𝑥\displaystyle=\int_{X}{\left\|\dfrac{(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{% \mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n}}(x)}{\gamma}-\dfrac{(I+\gamma\nabla H)\circ T% _{\nu_{n+1}}^{\mu_{n+1}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)}{\gamma}\right\|^{2}}% d\nu_{n+1}(x)= ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ divide start_ARG ( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG italic_γ end_ARG - divide start_ARG ( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) end_ARG start_ARG italic_γ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=1γ2XxTνn+1μn(x)(I+γH)Tνn+1μn+1(x)+Tνn+1μn+1(x)2𝑑νn+1(x),absent1superscript𝛾2subscript𝑋superscriptnorm𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle=\dfrac{1}{\gamma^{2}}\int_{X}{\left\|x-T_{\nu_{n+1}}^{\mu_{n}}(x% )-{(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)+T_{\nu_{n+1}}^{\mu_{n+% 1}}(x)}\right\|^{2}}d\nu_{n+1}(x),= divide start_ARG 1 end_ARG start_ARG italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - ( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) + italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) , (39)

where the last equality uses (I+γH)Tνn+1μn=I𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝐼(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n}}=I( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_I νn+1subscript𝜈𝑛1\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT-a.e.

The Hessian of H𝐻Hitalic_H is bounded uniformly, H𝐻\nabla H∇ italic_H is Lipschitz, let’s say H(x)H(y)LHxynorm𝐻𝑥𝐻𝑦subscript𝐿𝐻norm𝑥𝑦\|\nabla H(x)-\nabla H(y)\|\leq L_{H}\|x-y\|∥ ∇ italic_H ( italic_x ) - ∇ italic_H ( italic_y ) ∥ ≤ italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ for all x,yX.𝑥𝑦𝑋x,y\in X.italic_x , italic_y ∈ italic_X . We continue evaluating (A.6) as follows

Xx(I+γH)Tνn+1μn+1(x)2𝑑νn+1(x)subscript𝑋superscriptnorm𝑥𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\int_{X}{\|x-(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)% \|^{2}}d\nu_{n+1}(x)∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - ( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
X(xTνn+1μn(x)(I+γH)Tνn+1μn+1(x)+Tνn+1μn+1(x)+Tνn+1μn(x)Tνn+1μn+1(x))2𝑑νn+1(x)absentsubscript𝑋superscriptnorm𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥normsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\leq\int_{X}{\left(\|x-T_{\nu_{n+1}}^{\mu_{n}}(x)-{(I+\gamma% \nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)+T_{\nu_{n+1}}^{\mu_{n+1}}(x)}\|+\|% T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|\right)^{2}}d\nu_{n+1% }(x)≤ ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ( ∥ italic_x - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - ( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) + italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ + ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
2XxTνn+1μn(x)(I+γH)Tνn+1μn+1(x)+Tνn+1μn+1(x)2𝑑νn+1(x)absent2subscript𝑋superscriptnorm𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥𝐼𝛾𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\leq 2\int_{X}{\left\|x-T_{\nu_{n+1}}^{\mu_{n}}(x)-{(I+\gamma% \nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)+T_{\nu_{n+1}}^{\mu_{n+1}}(x)}% \right\|^{2}}d\nu_{n+1}(x)≤ 2 ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - ( italic_I + italic_γ ∇ italic_H ) ∘ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) + italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
+2XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)2subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\quad+2\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{% n+1}}(x)\|^{2}}d\nu_{n+1}(x)+ 2 ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=2γ2XH(Tνn+1μn(x))H(Tνn+1μn+1(x))2𝑑νn+1(x)+2XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)absent2superscript𝛾2subscript𝑋superscriptnorm𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥𝐻superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥2subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle=2\gamma^{2}\int_{X}{\|\nabla H(T_{\nu_{n+1}}^{\mu_{n}}(x))-% \nabla H(T_{\nu_{n+1}}^{\mu_{n+1}}(x))\|^{2}}d\nu_{n+1}(x)+2\int_{X}{\|T_{\nu_% {n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x)= 2 italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ ∇ italic_H ( italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ) - ∇ italic_H ( italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) + 2 ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
2(γ2LH2+1)XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)absent2superscript𝛾2superscriptsubscript𝐿𝐻21subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\leq 2(\gamma^{2}L_{H}^{2}+1)\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x% )-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x)≤ 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) (40)

where the third equality uses (A.6).

From (A.6), (A.6), and (A.6), we derive

dist(0,F(μn+1))2(γ2LH2+1)γ(XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x))12.dist0superscriptsubscript𝐹subscript𝜇𝑛12superscript𝛾2superscriptsubscript𝐿𝐻21𝛾superscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥12\displaystyle\operatorname{dist}{(0,\partial_{F}^{-}\mathcal{F}(\mu_{n+1}))}% \leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\gamma}\left(\int_{X}{\|T_{\nu_{n+% 1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x)\right)^{% \frac{1}{2}}.roman_dist ( 0 , ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ) ≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG italic_γ end_ARG ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT . (41)

On the other hand, by telesco** Lem. 1, we obtain

n=1XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)<+.superscriptsubscript𝑛1subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\sum_{n=1}^{\infty}{\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_% {n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x)}<+\infty.∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) < + ∞ .

Therefore,

n=0N1dist(0,F(μn+1))superscriptsubscript𝑛0𝑁1dist0superscript𝐹subscript𝜇𝑛1\displaystyle\sum_{n=0}^{N-1}{\operatorname{dist}{(0,\partial^{F}\mathcal{F}(% \mu_{n+1}))}}∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT roman_dist ( 0 , ∂ start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) )
2(γ2LH2+1)γn=0N1(XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x))12absent2superscript𝛾2superscriptsubscript𝐿𝐻21𝛾superscriptsubscript𝑛0𝑁1superscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥12\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\gamma}\sum_{n=0}^{N-% 1}{\left(\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^% {2}}d\nu_{n+1}(x)\right)^{\frac{1}{2}}}≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG italic_γ end_ARG ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
2(γ2LH2+1)γ(N(n=0N1XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)))12absent2superscript𝛾2superscriptsubscript𝐿𝐻21𝛾superscript𝑁superscriptsubscript𝑛0𝑁1subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥12\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\gamma}\left(N\left(% \sum_{n=0}^{N-1}{\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1% }}(x)\|^{2}}d\nu_{n+1}(x)}\right)\right)^{\frac{1}{2}}≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG italic_γ end_ARG ( italic_N ( ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
2(γ2LH2+1)Nγ(n=0+XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x))12absent2superscript𝛾2superscriptsubscript𝐿𝐻21𝑁𝛾superscriptsuperscriptsubscript𝑛0subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥12\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)N}}{\gamma}\left(\sum_{n% =0}^{+\infty}{\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(% x)\|^{2}}d\nu_{n+1}(x)}\right)^{\frac{1}{2}}≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) italic_N end_ARG end_ARG start_ARG italic_γ end_ARG ( ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

We derive

minn=1,N¯dist(0,F(μn))=O(1N).subscript𝑛¯1𝑁dist0superscript𝐹subscript𝜇𝑛𝑂1𝑁\displaystyle\min_{n=\overline{1,N}}{\operatorname{dist}{(0,\partial^{F}% \mathcal{F}(\mu_{n}))}}=O\left(\dfrac{1}{\sqrt{N}}\right).roman_min start_POSTSUBSCRIPT italic_n = over¯ start_ARG 1 , italic_N end_ARG end_POSTSUBSCRIPT roman_dist ( 0 , ∂ start_POSTSUPERSCRIPT italic_F end_POSTSUPERSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) = italic_O ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_N end_ARG end_ARG ) .

A.7 Proof of Theorem 4

Convergence in terms of objective values

Since HC2(X)𝐻superscript𝐶2𝑋H\in C^{2}(X)italic_H ∈ italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X ) whose Hessian is uniformly bounded, recall from (36) that

F(μn+1)=(G+)(μn+1)HTμn+1νn+1IγH.superscriptsubscript𝐹subscript𝜇𝑛1subscript𝐺subscript𝜇𝑛1𝐻containssuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝐼𝛾𝐻\displaystyle\partial_{F}^{-}\mathcal{F}(\mu_{n+1})=\partial(\mathcal{E}_{G}+% \mathscr{H})(\mu_{n+1})-\nabla H\ni\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}% -\nabla H.∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = ∂ ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - ∇ italic_H ∋ divide start_ARG italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I end_ARG start_ARG italic_γ end_ARG - ∇ italic_H .

Since (μ0)<r0subscript𝜇0superscriptsubscript𝑟0\mathcal{F}(\mu_{0})-\mathcal{F}^{*}<r_{0}caligraphic_F ( italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT < italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and the sequence {(μn)}nsubscriptsubscript𝜇𝑛𝑛\{\mathcal{F}(\mu_{n})\}_{n\in\mathbb{N}}{ caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is not increasing (Lem. 1), (μn)<r0subscript𝜇𝑛superscriptsubscript𝑟0\mathcal{F}(\mu_{n})-\mathcal{F}^{*}<r_{0}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT < italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for all n𝑛n\in\mathbb{N}italic_n ∈ blackboard_N. Łojasiewicz condition implies

c((μn+1))θ𝑐superscriptsubscript𝜇𝑛1superscript𝜃\displaystyle c(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})^{\theta}italic_c ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT Tμn+1νn+1IγHL2(X,X,μn+1)absentsubscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝐼𝛾𝐻superscript𝐿2𝑋𝑋subscript𝜇𝑛1\displaystyle\leq\left\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}-\nabla H% \right\|_{L^{2}(X,X,\mu_{n+1})}≤ ∥ divide start_ARG italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - italic_I end_ARG start_ARG italic_γ end_ARG - ∇ italic_H ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT
=(XTμn+1νn+1(x)xγH(x)2𝑑μn+1(x))12absentsuperscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜇𝑛1subscript𝜈𝑛1𝑥𝑥𝛾𝐻𝑥2differential-dsubscript𝜇𝑛1𝑥12\displaystyle=\left(\int_{X}{\left\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x}{% \gamma}-\nabla H(x)\right\|^{2}}d\mu_{n+1}(x)\right)^{\frac{1}{2}}= ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ divide start_ARG italic_T start_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_x end_ARG start_ARG italic_γ end_ARG - ∇ italic_H ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
2(γ2LH2+1)γ(XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x))12absent2superscript𝛾2superscriptsubscript𝐿𝐻21𝛾superscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥12\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\gamma}\left(\int_{X}% {\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x)% \right)^{\frac{1}{2}}≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG italic_γ end_ARG ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT (42)

where the last inequality follows from (A.6) and (A.6) and LHsubscript𝐿𝐻L_{H}italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is the Lipschitz constant of H𝐻\nabla H∇ italic_H. Combining with Lem. 1, we derive

c((μn+1))θ2(γ2LH2+1)γ((μn)(μn+1))12𝑐superscriptsubscript𝜇𝑛1superscript𝜃2superscript𝛾2superscriptsubscript𝐿𝐻21𝛾superscriptsubscript𝜇𝑛subscript𝜇𝑛112\displaystyle c\left(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*}\right)^{\theta}% \leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\sqrt{\gamma}}\left(\mathcal{F}(% \mu_{n})-\mathcal{F}(\mu_{n+1})\right)^{\frac{1}{2}}italic_c ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_θ end_POSTSUPERSCRIPT ≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG square-root start_ARG italic_γ end_ARG end_ARG ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT

or

((μn+1))2θ2(γ2LH2+1)c2γ(((μn))((μn+1))).superscriptsubscript𝜇𝑛1superscript2𝜃2superscript𝛾2superscriptsubscript𝐿𝐻21superscript𝑐2𝛾subscript𝜇𝑛superscriptsubscript𝜇𝑛1superscript\displaystyle\left(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*}\right)^{2\theta}\leq% \dfrac{{2(\gamma^{2}L_{H}^{2}+1)}}{c^{2}{\gamma}}\left((\mathcal{F}(\mu_{n})-% \mathcal{F}^{*})-(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})\right).( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_θ end_POSTSUPERSCRIPT ≤ divide start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ end_ARG ( ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) . (43)

We then use the following lemma [63, Lem. 4].

Lemma 11.

Let {sk}ksubscriptsubscript𝑠𝑘𝑘\{s_{k}\}_{k\in\mathbb{N}}{ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT be a nonincreasing and nonnegative real sequence. Assume that there exist α0𝛼0\alpha\geq 0italic_α ≥ 0 and β>0𝛽0\beta>0italic_β > 0 such that for all sufficiently large k𝑘kitalic_k,

sk+1αβ(sksk+1).superscriptsubscript𝑠𝑘1𝛼𝛽subscript𝑠𝑘subscript𝑠𝑘1\displaystyle s_{k+1}^{\alpha}\leq\beta(s_{k}-s_{k+1}).italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ≤ italic_β ( italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) . (44)

Then

  • (i)

    if α=0𝛼0\alpha=0italic_α = 0, the sequence {sk}ksubscriptsubscript𝑠𝑘𝑘\{s_{k}\}_{k\in\mathbb{N}}{ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT converges to 00 in a finite number of steps;

  • (ii)

    if α(0,1]𝛼01\alpha\in(0,1]italic_α ∈ ( 0 , 1 ], the sequence {sk}ksubscriptsubscript𝑠𝑘𝑘\{s_{k}\}_{k\in\mathbb{N}}{ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT converges linearly to 00 with rate ββ+1𝛽𝛽1\frac{\beta}{\beta+1}divide start_ARG italic_β end_ARG start_ARG italic_β + 1 end_ARG;

  • (iii)

    if α>1𝛼1\alpha>1italic_α > 1, the sequence {sk}ksubscriptsubscript𝑠𝑘𝑘\{s_{k}\}_{k\in\mathbb{N}}{ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT converges sublinearly to 00, i.e., there exists η>0𝜂0\eta>0italic_η > 0:

    skηk1α1subscript𝑠𝑘𝜂superscript𝑘1𝛼1\displaystyle s_{k}\leq\eta k^{\frac{-1}{\alpha-1}}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_η italic_k start_POSTSUPERSCRIPT divide start_ARG - 1 end_ARG start_ARG italic_α - 1 end_ARG end_POSTSUPERSCRIPT

    for sufficiently large k𝑘kitalic_k.

Compared to [63, Lem. 4], we have dropped the assumption sk0subscript𝑠𝑘0s_{k}\to 0italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT → 0 in Lem. 11 because this assumption is vacuous, i.e., it can be induced by (44) and nonnegativity of {sk}ksubscriptsubscript𝑠𝑘𝑘\{s_{k}\}_{k\in\mathbb{N}}{ italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k ∈ blackboard_N end_POSTSUBSCRIPT.

We now apply Lem. 11 for sk=(μk)subscript𝑠𝑘subscript𝜇𝑘superscripts_{k}=\mathcal{F}(\mu_{k})-\mathcal{F}^{*}italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT using (43) to derive the followings

  • (i)

    if θ=0𝜃0\theta=0italic_θ = 0, (μn)subscript𝜇𝑛superscript\mathcal{F}(\mu_{n})-\mathcal{F}^{*}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT converges to 00 in a finite number of steps;

  • (ii)

    if θ(0,1/2]𝜃012\theta\in(0,1/2]italic_θ ∈ ( 0 , 1 / 2 ], (μn)subscript𝜇𝑛superscript\mathcal{F}(\mu_{n})-\mathcal{F}^{*}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT converges to 00 linearly (exponentially fast) with rate

    (μn)=O((MM+1)n) where M=2(γ2LH2+1)c2γ;subscript𝜇𝑛superscript𝑂superscript𝑀𝑀1𝑛 where 𝑀2superscript𝛾2superscriptsubscript𝐿𝐻21superscript𝑐2𝛾\displaystyle\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=O\left(\left(\dfrac{M}{M+1}% \right)^{n}\right)\text{ where }M=\dfrac{{2(\gamma^{2}L_{H}^{2}+1)}}{c^{2}{% \gamma}};caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_O ( ( divide start_ARG italic_M end_ARG start_ARG italic_M + 1 end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) where italic_M = divide start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ end_ARG ;
  • (iii)

    if θ(1/2,1)𝜃121\theta\in(1/2,1)italic_θ ∈ ( 1 / 2 , 1 ), (μn)subscript𝜇𝑛superscript\mathcal{F}(\mu_{n})-\mathcal{F}^{*}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT converges sublinearly to 00, i.e.,

    (μn)=O(n12θ1).subscript𝜇𝑛superscript𝑂superscript𝑛12𝜃1\displaystyle\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=O\left(n^{-\frac{1}{2\theta-% 1}}\right).caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_O ( italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ) .

A.8 Proof of Theorem 5

Cauchy sequence.

By replacing n:=n1assign𝑛𝑛1n:=n-1italic_n := italic_n - 1 in (A.7) and rearranging

12(γ2LH2+1)cγ(XTνnμn1(x)Tνnμn(x)2𝑑νn(x))12((μn))θ.12superscript𝛾2superscriptsubscript𝐿𝐻21𝑐𝛾superscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛subscript𝜇𝑛𝑥2differential-dsubscript𝜈𝑛𝑥12superscriptsubscript𝜇𝑛superscript𝜃\displaystyle 1\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{c\gamma}\left(\int_% {X}{\|T_{\nu_{n}}^{\mu_{n-1}}(x)-T_{\nu_{n}}^{\mu_{n}}(x)\|^{2}}d\nu_{n}(x)% \right)^{\frac{1}{2}}\left(\mathcal{F}(\mu_{n})-\mathcal{F}^{*}\right)^{-% \theta}.1 ≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG italic_c italic_γ end_ARG ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - italic_θ end_POSTSUPERSCRIPT . (45)

It follows from Lem. 1 and (45) that

XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)γ((μn)(μn+1))subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥𝛾subscript𝜇𝑛subscript𝜇𝑛1\displaystyle\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x% )\|^{2}}d\nu_{n+1}(x)\leq\gamma(\mathcal{F}(\mu_{n})-\mathcal{F}(\mu_{n+1}))∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ≤ italic_γ ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) )
2(γ2LH2+1)c(XTνnμn1(x)Tνnμn(x)2𝑑νn(x))12((μn))θ((μn)(μn+1)).absent2superscript𝛾2superscriptsubscript𝐿𝐻21𝑐superscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛subscript𝜇𝑛𝑥2differential-dsubscript𝜈𝑛𝑥12superscriptsubscript𝜇𝑛superscript𝜃subscript𝜇𝑛subscript𝜇𝑛1\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{c}\left(\int_{X}{\|T_% {\nu_{n}}^{\mu_{n-1}}(x)-T_{\nu_{n}}^{\mu_{n}}(x)\|^{2}}d\nu_{n}(x)\right)^{% \frac{1}{2}}\left(\mathcal{F}(\mu_{n})-\mathcal{F}^{*}\right)^{-\theta}(% \mathcal{F}(\mu_{n})-\mathcal{F}(\mu_{n+1})).≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG italic_c end_ARG ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - italic_θ end_POSTSUPERSCRIPT ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ) . (46)

Since the function s:+:𝑠superscripts:\mathbb{R}^{+}\to\mathbb{R}italic_s : blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT → blackboard_R, s(t)=t1θ𝑠𝑡superscript𝑡1𝜃s(t)=t^{1-\theta}italic_s ( italic_t ) = italic_t start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT is concave if θ[0,1)𝜃01\theta\in[0,1)italic_θ ∈ [ 0 , 1 ), tangent inequality holds

s(a)(ab)s(a)s(b).superscript𝑠𝑎𝑎𝑏𝑠𝑎𝑠𝑏\displaystyle s^{\prime}(a)(a-b)\leq s(a)-s(b).italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_a ) ( italic_a - italic_b ) ≤ italic_s ( italic_a ) - italic_s ( italic_b ) .

Note that s(t)=(1θ)tθsuperscript𝑠𝑡1𝜃superscript𝑡𝜃s^{\prime}(t)=(1-\theta)t^{-\theta}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) = ( 1 - italic_θ ) italic_t start_POSTSUPERSCRIPT - italic_θ end_POSTSUPERSCRIPT, the above inequality further implies

(1θ)((μn))θ((μn)(μn+1))((μn))1θ((μn+1))1θ.1𝜃superscriptsubscript𝜇𝑛superscript𝜃subscript𝜇𝑛subscript𝜇𝑛1superscriptsubscript𝜇𝑛superscript1𝜃superscriptsubscript𝜇𝑛1superscript1𝜃\displaystyle(1-\theta)\left(\mathcal{F}(\mu_{n})-\mathcal{F}^{*}\right)^{-% \theta}(\mathcal{F}(\mu_{n})-\mathcal{F}(\mu_{n+1}))\leq(\mathcal{F}(\mu_{n})-% \mathcal{F}^{*})^{1-\theta}-(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})^{1-\theta}.( 1 - italic_θ ) ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - italic_θ end_POSTSUPERSCRIPT ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ) ≤ ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT - ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT . (47)

From (A.8) and (47)

XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥absent\displaystyle\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x% )\|^{2}}d\nu_{n+1}(x)\leq∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) ≤ 2(γ2LH2+1)(1θ)c(XTνnμn1(x)Tνnμn(x)2𝑑νn(x))122superscript𝛾2superscriptsubscript𝐿𝐻211𝜃𝑐superscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛subscript𝜇𝑛𝑥2differential-dsubscript𝜈𝑛𝑥12\displaystyle\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{(1-\theta)c}\left(\int_{X% }{\|T_{\nu_{n}}^{\mu_{n-1}}(x)-T_{\nu_{n}}^{\mu_{n}}(x)\|^{2}}d\nu_{n}(x)% \right)^{\frac{1}{2}}divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG ( 1 - italic_θ ) italic_c end_ARG ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT
×[((μn))1θ((μn+1))1θ]absentdelimited-[]superscriptsubscript𝜇𝑛superscript1𝜃superscriptsubscript𝜇𝑛1superscript1𝜃\displaystyle\times\left[(\mathcal{F}(\mu_{n})-\mathcal{F}^{*})^{1-\theta}-(% \mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})^{1-\theta}\right]× [ ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT - ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT ]

or equivalently,

rnrn1subscript𝑟𝑛subscript𝑟𝑛1\displaystyle\dfrac{r_{n}}{\sqrt{r_{n-1}}}divide start_ARG italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_ARG end_ARG :=XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x)(XTνnμn1(x)Tνnμn(x)2𝑑νn(x))12assignabsentsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥superscriptsubscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛subscript𝜇𝑛1𝑥superscriptsubscript𝑇subscript𝜈𝑛subscript𝜇𝑛𝑥2differential-dsubscript𝜈𝑛𝑥12\displaystyle:=\dfrac{\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu% _{n+1}}(x)\|^{2}}d\nu_{n+1}(x)}{\left(\int_{X}{\|T_{\nu_{n}}^{\mu_{n-1}}(x)-T_% {\nu_{n}}^{\mu_{n}}(x)\|^{2}}d\nu_{n}(x)\right)^{\frac{1}{2}}}:= divide start_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) end_ARG start_ARG ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT end_ARG
2(γ2LH2+1)(1θ)c[((μn))1θ((μn+1))1θ]absent2superscript𝛾2superscriptsubscript𝐿𝐻211𝜃𝑐delimited-[]superscriptsubscript𝜇𝑛superscript1𝜃superscriptsubscript𝜇𝑛1superscript1𝜃\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{(1-\theta)c}\left[(% \mathcal{F}(\mu_{n})-\mathcal{F}^{*})^{1-\theta}-(\mathcal{F}(\mu_{n+1})-% \mathcal{F}^{*})^{1-\theta}\right]≤ divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG ( 1 - italic_θ ) italic_c end_ARG [ ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT - ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT ] (48)

where rn:=XTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x).assignsubscript𝑟𝑛subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥r_{n}:=\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2% }}d\nu_{n+1}(x).italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT := ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) .

By telesco** (A.8) from n=1𝑛1n=1italic_n = 1 to ++\infty+ ∞ we obtain

n=1+rnrn1<+.superscriptsubscript𝑛1subscript𝑟𝑛subscript𝑟𝑛1\displaystyle\sum_{n=1}^{+\infty}{\dfrac{r_{n}}{\sqrt{r_{n-1}}}}<+\infty.∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT divide start_ARG italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_ARG end_ARG < + ∞ .

On the other hand

rnrn1+rn1subscript𝑟𝑛subscript𝑟𝑛1subscript𝑟𝑛1\displaystyle\dfrac{r_{n}}{\sqrt{r_{n-1}}}+\sqrt{r_{n-1}}divide start_ARG italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_ARG 2rn,absent2subscript𝑟𝑛\displaystyle\geq 2\sqrt{r_{n}},≥ 2 square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG , (49)

we derive n=0+rn<+superscriptsubscript𝑛0subscript𝑟𝑛\sum_{n=0}^{+\infty}{\sqrt{r_{n}}}<+\infty∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG < + ∞. From (A.4), (21) of the proof of Thm. 1, rnW22(μn,μn+1)subscript𝑟𝑛superscriptsubscript𝑊22subscript𝜇𝑛subscript𝜇𝑛1r_{n}\geq W_{2}^{2}(\mu_{n},\mu_{n+1})italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ), we obtain

n=0+W2(μn,μn+1)<+.superscriptsubscript𝑛0subscript𝑊2subscript𝜇𝑛subscript𝜇𝑛1\displaystyle\sum_{n=0}^{+\infty}{W_{2}(\mu_{n},\mu_{n+1})}<+\infty.∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + ∞ end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) < + ∞ .

or, in other words, {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is a Cauchy sequence under Wasserstein topology. The Wasserstein space (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is complete [4, Thm. 2.2], every Cauchy sequence is convergent, i.e., there exists μ𝒫2(X)superscript𝜇subscript𝒫2𝑋\mu^{*}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) such that μnWassμ.Wasssubscript𝜇𝑛superscript𝜇\mu_{n}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}.italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT .

We prove that μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is actually an optimal solution of \mathcal{F}caligraphic_F by showing (μ)=superscript𝜇superscript\mathcal{F}(\mu^{*})=\mathcal{F}^{*}caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Indeed, firstly, as G𝐺Gitalic_G and H𝐻Hitalic_H have quadratic growth, it holds

G(μn)G(μ),H(μn)H(μ).formulae-sequencesubscript𝐺subscript𝜇𝑛subscript𝐺superscript𝜇subscript𝐻subscript𝜇𝑛subscript𝐻superscript𝜇\displaystyle\mathcal{E}_{G}(\mu_{n})\to\mathcal{E}_{G}(\mu^{*}),\mathcal{E}_{% H}(\mu_{n})\to\mathcal{E}_{H}(\mu^{*}).caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

On the other hand,

=limn(μn)superscriptsubscript𝑛subscript𝜇𝑛\displaystyle\mathcal{F}^{*}=\lim_{n\to\infty}{\mathcal{F}(\mu_{n})}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =lim infn(μn)=lim infn(μn)+G(μ)H(μ)absentsubscriptlimit-infimum𝑛subscript𝜇𝑛subscriptlimit-infimum𝑛subscript𝜇𝑛subscript𝐺superscript𝜇subscript𝐻superscript𝜇\displaystyle=\liminf_{n\to\infty}\mathcal{F}(\mu_{n})=\liminf_{n\to\infty}{% \mathscr{H}}(\mu_{n})+\mathcal{E}_{G}(\mu^{*})-\mathcal{E}_{H}(\mu^{*})= lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = lim inf start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT script_H ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )
(μ)+G(μ)H(μ)=(μ)absentsuperscript𝜇subscript𝐺superscript𝜇subscript𝐻superscript𝜇superscript𝜇\displaystyle\geq{\mathscr{H}}(\mu^{*})+\mathcal{E}_{G}(\mu^{*})-\mathcal{E}_{% H}(\mu^{*})=\mathcal{F}(\mu^{*})≥ script_H ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) - caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

since \mathscr{H}script_H is l.s.c. The equality has to occur, i.e., =(μ)superscriptsuperscript𝜇\mathcal{F}^{*}=\mathcal{F}(\mu^{*})caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ), due to the optimality of superscript\mathcal{F}^{*}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Convergence rate of {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT.

(i) If θ=0𝜃0\theta=0italic_θ = 0

From item (i) of Thm. 4, there exists n0subscript𝑛0n_{0}\in\mathbb{N}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_N such that (μn)=subscript𝜇𝑛superscript\mathcal{F}(\mu_{n})=\mathcal{F}^{*}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for all nn0𝑛subscript𝑛0n\geq n_{0}italic_n ≥ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. It then follows from (22) that μn0=μn0+1=μn0+2=subscript𝜇subscript𝑛0subscript𝜇subscript𝑛01subscript𝜇subscript𝑛02\mu_{n_{0}}=\mu_{n_{0}+1}=\mu_{n_{0}+2}=\ldotsitalic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 2 end_POSTSUBSCRIPT = …, which further implies that μn=μsubscript𝜇𝑛superscript𝜇\mu_{n}=\mu^{*}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT for all nn0𝑛subscript𝑛0n\geq n_{0}italic_n ≥ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

(ii) If θ(0,1/2]𝜃012\theta\in(0,1/2]italic_θ ∈ ( 0 , 1 / 2 ]

Let si=n=irnsubscript𝑠𝑖superscriptsubscript𝑛𝑖subscript𝑟𝑛s_{i}=\sum_{n=i}^{\infty}\sqrt{r_{n}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_n = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG. We have

sin=iW2(μn,μn+1)W2(μi,μ)subscript𝑠𝑖superscriptsubscript𝑛𝑖subscript𝑊2subscript𝜇𝑛subscript𝜇𝑛1subscript𝑊2subscript𝜇𝑖superscript𝜇\displaystyle s_{i}\geq\sum_{n=i}^{\infty}{W_{2}(\mu_{n},\mu_{n+1})}\geq W_{2}% (\mu_{i},\mu^{*})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ ∑ start_POSTSUBSCRIPT italic_n = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≥ italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) (50)

where the last inequality uses triangle inequality

W2(μi,μ)n=iN1W2(μn,μn+1)+W2(μN,μ)subscript𝑊2subscript𝜇𝑖superscript𝜇superscriptsubscript𝑛𝑖𝑁1subscript𝑊2subscript𝜇𝑛subscript𝜇𝑛1subscript𝑊2subscript𝜇𝑁superscript𝜇\displaystyle W_{2}(\mu_{i},\mu^{*})\leq\sum_{n=i}^{N-1}{W_{2}(\mu_{n},\mu_{n+% 1})}+W_{2}(\mu_{N},\mu^{*})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ ∑ start_POSTSUBSCRIPT italic_n = italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

and lets N𝑁N\to\inftyitalic_N → ∞ with a notice that μNWassμWasssubscript𝜇𝑁superscript𝜇\mu_{N}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}italic_μ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

From (A.8) and (49),

2rnrn1+2(γ2LH2+1)(1θ)c[((μn))1θ((μn+1))1θ].2subscript𝑟𝑛subscript𝑟𝑛12superscript𝛾2superscriptsubscript𝐿𝐻211𝜃𝑐delimited-[]superscriptsubscript𝜇𝑛superscript1𝜃superscriptsubscript𝜇𝑛1superscript1𝜃\displaystyle 2\sqrt{r_{n}}\leq\sqrt{r_{n-1}}+\dfrac{\sqrt{2(\gamma^{2}L_{H}^{% 2}+1)}}{(1-\theta)c}\left[(\mathcal{F}(\mu_{n})-\mathcal{F}^{*})^{1-\theta}-(% \mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})^{1-\theta}\right].2 square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ≤ square-root start_ARG italic_r start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_ARG + divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG ( 1 - italic_θ ) italic_c end_ARG [ ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT - ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT ] . (51)

Telescope (51) for n=i𝑛𝑖n=iitalic_n = italic_i to ++\infty+ ∞,

siri1+2(γ2LH2+1)(1θ)c((μi))1θri1+(2(γ2LH2+1))12θ(1θ)γ1θθc1θri11θ2θ.subscript𝑠𝑖subscript𝑟𝑖12superscript𝛾2superscriptsubscript𝐿𝐻211𝜃𝑐superscriptsubscript𝜇𝑖superscript1𝜃subscript𝑟𝑖1superscript2superscript𝛾2superscriptsubscript𝐿𝐻2112𝜃1𝜃superscript𝛾1𝜃𝜃superscript𝑐1𝜃superscriptsubscript𝑟𝑖11𝜃2𝜃\displaystyle s_{i}\leq\sqrt{r_{i-1}}+\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{% (1-\theta)c}(\mathcal{F}(\mu_{i})-\mathcal{F}^{*})^{1-\theta}\leq\sqrt{r_{i-1}% }+\dfrac{(2(\gamma^{2}L_{H}^{2}+1))^{\frac{1}{2\theta}}}{(1-\theta)\gamma^{% \frac{1-\theta}{\theta}}c^{\frac{1}{\theta}}}r_{i-1}^{\frac{1-\theta}{2\theta}}.italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ square-root start_ARG italic_r start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG + divide start_ARG square-root start_ARG 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) end_ARG end_ARG start_ARG ( 1 - italic_θ ) italic_c end_ARG ( caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 1 - italic_θ end_POSTSUPERSCRIPT ≤ square-root start_ARG italic_r start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG + divide start_ARG ( 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_θ end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_θ ) italic_γ start_POSTSUPERSCRIPT divide start_ARG 1 - italic_θ end_ARG start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT end_ARG italic_r start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 - italic_θ end_ARG start_ARG 2 italic_θ end_ARG end_POSTSUPERSCRIPT . (52)

where the last inequality uses (A.7). Since ri0subscript𝑟𝑖0r_{i}\to 0italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT → 0 as i𝑖i\to\inftyitalic_i → ∞, ri<1subscript𝑟𝑖1r_{i}<1italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT < 1 for i𝑖iitalic_i sufficiently large. It follows from (52) that: for i𝑖iitalic_i sufficiently large

siMri1=M(si1si)subscript𝑠𝑖𝑀subscript𝑟𝑖1𝑀subscript𝑠𝑖1subscript𝑠𝑖\displaystyle s_{i}\leq M\sqrt{r_{i-1}}=M(s_{i-1}-s_{i})italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_M square-root start_ARG italic_r start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG = italic_M ( italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

where

M=1+(2(γ2LH2+1))12θ(1θ)γ1θθc1θ.𝑀1superscript2superscript𝛾2superscriptsubscript𝐿𝐻2112𝜃1𝜃superscript𝛾1𝜃𝜃superscript𝑐1𝜃\displaystyle M=1+\dfrac{(2(\gamma^{2}L_{H}^{2}+1))^{\frac{1}{2\theta}}}{(1-% \theta)\gamma^{\frac{1-\theta}{\theta}}c^{\frac{1}{\theta}}}.italic_M = 1 + divide start_ARG ( 2 ( italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 italic_θ end_ARG end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 - italic_θ ) italic_γ start_POSTSUPERSCRIPT divide start_ARG 1 - italic_θ end_ARG start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT end_ARG . (53)

Rewriting as siMM+1si1subscript𝑠𝑖𝑀𝑀1subscript𝑠𝑖1s_{i}\leq\frac{M}{M+1}s_{i-1}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ divide start_ARG italic_M end_ARG start_ARG italic_M + 1 end_ARG italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, we derive W2(μi,μ)=O((MM+1)i).subscript𝑊2subscript𝜇𝑖superscript𝜇𝑂superscript𝑀𝑀1𝑖W_{2}(\mu_{i},\mu^{*})=O\left(\left(\frac{M}{M+1}\right)^{i}\right).italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_O ( ( divide start_ARG italic_M end_ARG start_ARG italic_M + 1 end_ARG ) start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) .

(iii) If θ(1/2,1)𝜃121\theta\in(1/2,1)italic_θ ∈ ( 1 / 2 , 1 )

(52) implies: for all i𝑖iitalic_i sufficiently large,

siMri11θ2θ=M(si1si)1θθsubscript𝑠𝑖𝑀superscriptsubscript𝑟𝑖11𝜃2𝜃𝑀superscriptsubscript𝑠𝑖1subscript𝑠𝑖1𝜃𝜃\displaystyle s_{i}\leq Mr_{i-1}^{\frac{1-\theta}{2\theta}}=M(s_{i-1}-s_{i})^{% \frac{1-\theta}{\theta}}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_M italic_r start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT divide start_ARG 1 - italic_θ end_ARG start_ARG 2 italic_θ end_ARG end_POSTSUPERSCRIPT = italic_M ( italic_s start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 - italic_θ end_ARG start_ARG italic_θ end_ARG end_POSTSUPERSCRIPT

where M𝑀Mitalic_M is the same as in (53).

Applying Lem. 11(iii), si=O(i1θ2θ1)subscript𝑠𝑖𝑂superscript𝑖1𝜃2𝜃1s_{i}=O\left(i^{-\frac{1-\theta}{2\theta-1}}\right)italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_O ( italic_i start_POSTSUPERSCRIPT - divide start_ARG 1 - italic_θ end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ), which implies (by (50)) W2(μi,μ)=O(i1θ2θ1)subscript𝑊2subscript𝜇𝑖superscript𝜇𝑂superscript𝑖1𝜃2𝜃1W_{2}(\mu_{i},\mu^{*})=O\left(i^{-\frac{1-\theta}{2\theta-1}}\right)italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = italic_O ( italic_i start_POSTSUPERSCRIPT - divide start_ARG 1 - italic_θ end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ).

Appendix B Implementation details

We present implementations for FB Euler and semi FB Euler when \mathscr{H}script_H is the negative entropy. The main recipe of these implementations is the deep learning approach to approximate the JKO operator presented in [45] (MIT license).

B.1 FB Euler

The push forward step νn+1=(IγF)#μnsubscript𝜈𝑛1subscript𝐼𝛾𝐹#subscript𝜇𝑛\nu_{n+1}=(I-\gamma\nabla F)_{\#}\mu_{n}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I - italic_γ ∇ italic_F ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is rather straightforward: if Z𝑍Zitalic_Z are samples from μnsubscript𝜇𝑛\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT then ZγF(Z)𝑍𝛾𝐹𝑍Z-\gamma\nabla F(Z)italic_Z - italic_γ ∇ italic_F ( italic_Z ) are samples from νn+1subscript𝜈𝑛1\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT.

On the other hand, to move from νn+1subscript𝜈𝑛1\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT to μn+1subscript𝜇𝑛1\mu_{n+1}italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT we have to work out the JKO operator. The idea goes as follows [45]. We wish to compute the optimal Monge map pushing νn+1subscript𝜈𝑛1\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT to μn+1subscript𝜇𝑛1\mu_{n+1}italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT. From Brenier theorem, we know this map has to be a (sub)gradient field of some convex function. Therefore, one can "parametrize" μ𝒫2,abs(X)𝜇subscript𝒫2abs𝑋\mu\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) as μ=ψ#νn+1𝜇subscript𝜓#subscript𝜈𝑛1\mu=\nabla\psi_{\#}\nu_{n+1}italic_μ = ∇ italic_ψ start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT for some convex function ψ𝜓\psiitalic_ψ. We then can write the objective function of the JKO operator as follows

(μ)+12γW22(μ,νn+1)𝜇12𝛾superscriptsubscript𝑊22𝜇subscript𝜈𝑛1\displaystyle\mathscr{H}(\mu)+\dfrac{1}{2\gamma}W_{2}^{2}(\mu,\nu_{n+1})script_H ( italic_μ ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ , italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) =(μ)+12γXxTνn+1μ(x)2𝑑νn+1(x)absent𝜇12𝛾subscript𝑋superscriptnorm𝑥superscriptsubscript𝑇subscript𝜈𝑛1𝜇𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle=\mathscr{H}(\mu)+\dfrac{1}{2\gamma}\int_{X}{\|x-T_{\nu_{n+1}}^{% \mu}(x)\|^{2}}d\nu_{n+1}(x)= script_H ( italic_μ ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x )
=(ψ#νn+1)+12γXxψ(x)2𝑑νn+1(x).absentsubscript𝜓#subscript𝜈𝑛112𝛾subscript𝑋superscriptnorm𝑥𝜓𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle=\mathscr{H}(\nabla\psi_{\#}\nu_{n+1})+\dfrac{1}{2\gamma}\int_{X}% {\|x-\nabla\psi(x)\|^{2}}d\nu_{n+1}(x).= script_H ( ∇ italic_ψ start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - ∇ italic_ψ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) . (54)

We next use the following result: for any T:XX:𝑇𝑋𝑋T:X\to Xitalic_T : italic_X → italic_X be a diffeomorphism, any ρ𝒫2,abs(X)𝜌subscript𝒫2abs𝑋\rho\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_ρ ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ),

(T#ρ)=(ρ)+Xlog|detT(x)|dρ(x),subscript𝑇#𝜌𝜌subscript𝑋𝑇𝑥𝑑𝜌𝑥\displaystyle-\mathscr{H}(T_{\#}\rho)=-\mathscr{H}(\rho)+\int_{X}{\log|\det% \nabla T(x)|}d\rho(x),- script_H ( italic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ρ ) = - script_H ( italic_ρ ) + ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT roman_log | roman_det ∇ italic_T ( italic_x ) | italic_d italic_ρ ( italic_x ) ,

so (B.1) can be written as (up to a constant that does not depend on ψ𝜓\psiitalic_ψ)

Xlogdet2ψ(x)dνn+1(x)+12γXxψ(x)2𝑑νn+1(x).subscript𝑋superscript2𝜓𝑥𝑑subscript𝜈𝑛1𝑥12𝛾subscript𝑋superscriptnorm𝑥𝜓𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle-\int_{X}{\log\det\nabla^{2}\psi(x)}d\nu_{n+1}(x)+\dfrac{1}{2% \gamma}\int_{X}{\|x-\nabla\psi(x)\|^{2}}d\nu_{n+1}(x).- ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT roman_log roman_det ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ψ ( italic_x ) italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - ∇ italic_ψ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) .

We can now leverage on a class of input convex neural networks (ICNNs) [5] ψθ(x)subscript𝜓𝜃𝑥\psi_{\theta}(x)italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) (θ𝜃\thetaitalic_θ is the neural network’s parameters, x𝑥xitalic_x is the input) in which xψθ(x)maps-to𝑥subscript𝜓𝜃𝑥x\mapsto\psi_{\theta}(x)italic_x ↦ italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) is convex. The JKO operator boils down to the following optimization problem

minθXlogdetx2ψθ(x)dνn+1(x)+12γXxxψθ(x)2𝑑νn+1(x).subscript𝜃subscript𝑋subscriptsuperscript2𝑥subscript𝜓𝜃𝑥𝑑subscript𝜈𝑛1𝑥12𝛾subscript𝑋superscriptnorm𝑥subscript𝑥subscript𝜓𝜃𝑥2differential-dsubscript𝜈𝑛1𝑥\displaystyle\min_{\theta}-\int_{X}{\log\det\nabla^{2}_{x}\psi_{\theta}(x)}d% \nu_{n+1}(x)+\dfrac{1}{2\gamma}\int_{X}{\|x-\nabla_{x}\psi_{\theta}(x)\|^{2}}d% \nu_{n+1}(x).roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT - ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT roman_log roman_det ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) + divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) .

This problem can be solved effectively by standard deep learning optimizers like Adam or stochastic gradient descent as samples from νn+1subscript𝜈𝑛1\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT can be drawn in a recursive manner. A detailed implementation of FB Euler is in Alg. 1.

Algorithm 1 FB Euler for sampling
Input: Initial measure μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), discretization stepsize γ>0𝛾0\gamma>0italic_γ > 0, number of steps K>0𝐾0K>0italic_K > 0, batch size B𝐵Bitalic_B.
for k=1𝑘1k=1italic_k = 1 to K𝐾Kitalic_K do
     for i=1,2,𝑖12i=1,2,\ldotsitalic_i = 1 , 2 , … do
         Draw a batch of samples Zμ0similar-to𝑍subscript𝜇0Z\sim\mu_{0}italic_Z ∼ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of size B𝐵Bitalic_B;
         Ξ(IγF)xψθk(IγF)xψθk1(IγF)(Z)Ξ𝐼𝛾𝐹subscript𝑥subscript𝜓subscript𝜃𝑘𝐼𝛾𝐹subscript𝑥subscript𝜓subscript𝜃𝑘1𝐼𝛾𝐹𝑍\Xi\leftarrow(I-\gamma\nabla F)\circ\nabla_{x}\psi_{\theta_{k}}\circ(I-\gamma% \nabla F)\circ\nabla_{x}\psi_{\theta_{k-1}}\circ\ldots\circ(I-\gamma\nabla F)(Z)roman_Ξ ← ( italic_I - italic_γ ∇ italic_F ) ∘ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ ( italic_I - italic_γ ∇ italic_F ) ∘ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ … ∘ ( italic_I - italic_γ ∇ italic_F ) ( italic_Z );
         W22^1BξΞxψθ(ξ)ξ2^superscriptsubscript𝑊221𝐵subscript𝜉Ξsuperscriptnormsubscript𝑥subscript𝜓𝜃𝜉𝜉2\widehat{W_{2}^{2}}\leftarrow\frac{1}{B}\sum_{\xi\in\Xi}\|\nabla_{x}\psi_{% \theta}(\xi)-\xi\|^{2}over^ start_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ← divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_ξ ∈ roman_Ξ end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ ) - italic_ξ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT;
         Δ^1BξΞlogdetx2ψθ(ξ)^Δ1𝐵subscript𝜉Ξsuperscriptsubscript𝑥2subscript𝜓𝜃𝜉\widehat{\Delta\mathscr{H}}\leftarrow-\frac{1}{B}\sum_{\xi\in\Xi}{\log\det% \nabla_{x}^{2}\psi_{\theta}(\xi)}over^ start_ARG roman_Δ script_H end_ARG ← - divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_ξ ∈ roman_Ξ end_POSTSUBSCRIPT roman_log roman_det ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ ).
         ^12γW22^+Δ^.^12𝛾^superscriptsubscript𝑊22^Δ\widehat{\mathcal{L}}\leftarrow\frac{1}{2\gamma}\widehat{W_{2}^{2}}+\widehat{% \Delta\mathscr{H}}.over^ start_ARG caligraphic_L end_ARG ← divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG over^ start_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + over^ start_ARG roman_Δ script_H end_ARG .
         Apply an optimization step (e.g., Adam) over θ𝜃\thetaitalic_θ using θ^subscript𝜃^\nabla_{\theta}\widehat{\mathcal{L}}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT over^ start_ARG caligraphic_L end_ARG.
     end for
     θk+1θ.subscript𝜃𝑘1𝜃\theta_{k+1}\leftarrow\theta.italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ← italic_θ .
end for

B.2 Semi FB Euler

Similar to the idea presented in the implementation of FB Euler, a detailed implementation of semi FB Euler is presented in Alg. 2.

Algorithm 2 Semi FB Euler for sampling
Input: Initial measure μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), discretization step size γ>0𝛾0\gamma>0italic_γ > 0, number of steps K>0𝐾0K>0italic_K > 0, batch size B𝐵Bitalic_B.
for k=1𝑘1k=1italic_k = 1 to K𝐾Kitalic_K do
     for i=1,2,𝑖12i=1,2,\ldotsitalic_i = 1 , 2 , … do
         Draw a batch of samples Zμ0similar-to𝑍subscript𝜇0Z\sim\mu_{0}italic_Z ∼ italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT of size B𝐵Bitalic_B;
         Ξ(I+γH)xψθk(I+γH)xψθk1(I+γH)(Z)Ξ𝐼𝛾𝐻subscript𝑥subscript𝜓subscript𝜃𝑘𝐼𝛾𝐻subscript𝑥subscript𝜓subscript𝜃𝑘1𝐼𝛾𝐻𝑍\Xi\leftarrow(I+\gamma\nabla H)\circ\nabla_{x}\psi_{\theta_{k}}\circ(I+\gamma% \nabla H)\circ\nabla_{x}\psi_{\theta_{k-1}}\circ\ldots\circ(I+\gamma\nabla H)(Z)roman_Ξ ← ( italic_I + italic_γ ∇ italic_H ) ∘ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ ( italic_I + italic_γ ∇ italic_H ) ∘ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∘ … ∘ ( italic_I + italic_γ ∇ italic_H ) ( italic_Z );
         W22^1BξΞxψθ(ξ)ξ2^superscriptsubscript𝑊221𝐵subscript𝜉Ξsuperscriptnormsubscript𝑥subscript𝜓𝜃𝜉𝜉2\widehat{W_{2}^{2}}\leftarrow\frac{1}{B}\sum_{\xi\in\Xi}\|\nabla_{x}\psi_{% \theta}(\xi)-\xi\|^{2}over^ start_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ← divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_ξ ∈ roman_Ξ end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ ) - italic_ξ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT;
         𝒰^1BξΞG(xψθ(ξ))^𝒰1𝐵subscript𝜉Ξ𝐺subscript𝑥subscript𝜓𝜃𝜉\widehat{\mathcal{U}}\leftarrow\frac{1}{B}\sum_{\xi\in\Xi}{G(\nabla_{x}\psi_{% \theta}(\xi))}over^ start_ARG caligraphic_U end_ARG ← divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_ξ ∈ roman_Ξ end_POSTSUBSCRIPT italic_G ( ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ ) );
         Δ^1BξΞlogdetx2ψθ(ξ)^Δ1𝐵subscript𝜉Ξsuperscriptsubscript𝑥2subscript𝜓𝜃𝜉\widehat{\Delta\mathscr{H}}\leftarrow-\frac{1}{B}\sum_{\xi\in\Xi}{\log\det% \nabla_{x}^{2}\psi_{\theta}(\xi)}over^ start_ARG roman_Δ script_H end_ARG ← - divide start_ARG 1 end_ARG start_ARG italic_B end_ARG ∑ start_POSTSUBSCRIPT italic_ξ ∈ roman_Ξ end_POSTSUBSCRIPT roman_log roman_det ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ψ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_ξ ).
         ^12γW22^+𝒰^+Δ^.^12𝛾^superscriptsubscript𝑊22^𝒰^Δ\widehat{\mathcal{L}}\leftarrow\frac{1}{2\gamma}\widehat{W_{2}^{2}}+\widehat{% \mathcal{U}}+\widehat{\Delta\mathscr{H}}.over^ start_ARG caligraphic_L end_ARG ← divide start_ARG 1 end_ARG start_ARG 2 italic_γ end_ARG over^ start_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + over^ start_ARG caligraphic_U end_ARG + over^ start_ARG roman_Δ script_H end_ARG .
         Apply an optimization step (e.g., Adam) over θ𝜃\thetaitalic_θ using θ^subscript𝜃^\nabla_{\theta}\widehat{\mathcal{L}}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT over^ start_ARG caligraphic_L end_ARG.
     end for
     θk+1θ.subscript𝜃𝑘1𝜃\theta_{k+1}\leftarrow\theta.italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ← italic_θ .
end for

Appendix C Numerical illustrations

We perform the numerical experiments in a high-performance computing cluster with GPU supports. We use Python version 3.8.0. We allocate 8G memory for the experiments. The running time is a couple of hours.

C.1 Gaussian mixture

Consider a target Gaussian mixture of the following form:

π(x)exp(F(x)):=i=1Kπiexp(xxi2σ2).proportional-to𝜋𝑥𝐹𝑥assignsuperscriptsubscript𝑖1𝐾subscript𝜋𝑖superscriptnorm𝑥subscript𝑥𝑖2superscript𝜎2\displaystyle\pi(x)\propto\exp(-F(x)):=\sum_{i=1}^{K}{\pi_{i}\exp\left(-\dfrac% {\|x-x_{i}\|^{2}}{\sigma^{2}}\right)}.italic_π ( italic_x ) ∝ roman_exp ( - italic_F ( italic_x ) ) := ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) .

We write

F(x)𝐹𝑥\displaystyle F(x)italic_F ( italic_x ) =log(i=1Kπiexp(xxi2σ2))absentsuperscriptsubscript𝑖1𝐾subscript𝜋𝑖superscriptnorm𝑥subscript𝑥𝑖2superscript𝜎2\displaystyle=-\log\left(\sum_{i=1}^{K}{\pi_{i}\exp\left(-\dfrac{\|x-x_{i}\|^{% 2}}{\sigma^{2}}\right)}\right)= - roman_log ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x - italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) )
=log(i=1Kπiexp(x2+xi22x,xiσ2))absentsuperscriptsubscript𝑖1𝐾subscript𝜋𝑖superscriptnorm𝑥2superscriptnormsubscript𝑥𝑖22𝑥subscript𝑥𝑖superscript𝜎2\displaystyle=-\log\left(\sum_{i=1}^{K}{\pi_{i}\exp\left(-\dfrac{\|x\|^{2}+\|x% _{i}\|^{2}-2\langle x,x_{i}\rangle}{\sigma^{2}}\right)}\right)= - roman_log ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) )
=log(i=1Kπiexp(x2σ2)×exp(xi22x,xiσ2))absentsuperscriptsubscript𝑖1𝐾subscript𝜋𝑖superscriptnorm𝑥2superscript𝜎2superscriptnormsubscript𝑥𝑖22𝑥subscript𝑥𝑖superscript𝜎2\displaystyle=-\log\left(\sum_{i=1}^{K}{\pi_{i}\exp\left(-\dfrac{\|x\|^{2}}{% \sigma^{2}}\right)\times\exp\left(-\dfrac{\|x_{i}\|^{2}-2\langle x,x_{i}% \rangle}{\sigma^{2}}\right)}\right)= - roman_log ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) × roman_exp ( - divide start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) )
=x2σ2log(i=1Kπiexp(xi22x,xiσ2))convexabsentsuperscriptnorm𝑥2superscript𝜎2subscriptsuperscriptsubscript𝑖1𝐾subscript𝜋𝑖superscriptnormsubscript𝑥𝑖22𝑥subscript𝑥𝑖superscript𝜎2convex\displaystyle=\dfrac{\|x\|^{2}}{\sigma^{2}}-\underbrace{\log\left(\sum_{i=1}^{% K}{\pi_{i}\exp\left(-\dfrac{\|x_{i}\|^{2}-2\langle x,x_{i}\rangle}{\sigma^{2}}% \right)}\right)}_{\text{convex}}= divide start_ARG ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - under⏟ start_ARG roman_log ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_exp ( - divide start_ARG ∥ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_x , italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ) end_ARG start_POSTSUBSCRIPT convex end_POSTSUBSCRIPT

which is DC. Note that the convexity of the second component is thanks to (a) log-sum-exp is convex and (b) the composite of a convex function and an affine function is convex.

Experiment details

We set K=5𝐾5K=5italic_K = 5 and randomly generate x1,x2,,x52subscript𝑥1subscript𝑥2subscript𝑥5superscript2x_{1},x_{2},\ldots,x_{5}\in\mathbb{R}^{2}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We set σ=1𝜎1\sigma=1italic_σ = 1. The initial distribution is μ0=𝒩(0,16I)subscript𝜇0𝒩016𝐼\mu_{0}=\mathcal{N}(0,16I)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , 16 italic_I ). We use γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1 for both FB Euler and semi FB Euler. We train both algorithms for 40404040 iterations using Adam optimizer with a batch size of 512512512512 in which the first 20202020 iterations use a learning rate of 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT while the latter 20202020 iterations use 2×1032superscript1032\times 10^{-3}2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT.

We run the above experiment 5555 times where x1,x2,,x5subscript𝑥1subscript𝑥2subscript𝑥5x_{1},x_{2},\ldots,x_{5}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are randomly generated each time. Fig. 1 (b) reports the mean and 2×std2std2\times\text{std}2 × std curves of the KL divergence along the training process.

C.2 Distance-to-set prior

Let π𝜋\piitalic_π be the original prior, ΘΘ\Thetaroman_Θ be the constraint set that we want to impose, and the distance-to-set prior [53] is defined by, for some ρ>0𝜌0\rho>0italic_ρ > 0,

π~(θ)π(θ)exp(ρ2d(θ,Θ)2),proportional-to~𝜋𝜃𝜋𝜃𝜌2𝑑superscript𝜃Θ2\displaystyle\tilde{\pi}(\theta)\propto\pi(\theta)\exp\left(-\dfrac{\rho}{2}d(% \theta,\Theta)^{2}\right),over~ start_ARG italic_π end_ARG ( italic_θ ) ∝ italic_π ( italic_θ ) roman_exp ( - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_d ( italic_θ , roman_Θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

that penalize exponentially θ𝜃\thetaitalic_θ deviating from the constraint set.

Given data y𝑦yitalic_y, using the this distance-to-set prior, the posterior reads

π¯(θ|y)L(θ|y)π(θ)exp(ρ2d(θ,Θ)2).proportional-to¯𝜋conditional𝜃𝑦𝐿conditional𝜃𝑦𝜋𝜃𝜌2𝑑superscript𝜃Θ2\displaystyle\bar{\pi}(\theta|y)\propto L(\theta|y)\pi(\theta)\exp\left(-% \dfrac{\rho}{2}d(\theta,\Theta)^{2}\right).over¯ start_ARG italic_π end_ARG ( italic_θ | italic_y ) ∝ italic_L ( italic_θ | italic_y ) italic_π ( italic_θ ) roman_exp ( - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG italic_d ( italic_θ , roman_Θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

where L(θ|y)𝐿conditional𝜃𝑦L(\theta|y)italic_L ( italic_θ | italic_y ) is the likelihood.

The structure of π¯(θ|y)¯𝜋conditional𝜃𝑦\bar{\pi}(\theta|y)over¯ start_ARG italic_π end_ARG ( italic_θ | italic_y ) depends on three separate components: the original prior, the likelihood, and the constraint set ΘΘ\Thetaroman_Θ. In the ideal case, π(θ)𝜋𝜃\pi(\theta)italic_π ( italic_θ ) and L(θ|y)𝐿conditional𝜃𝑦L(\theta|y)italic_L ( italic_θ | italic_y ) are given in nice forms (e.g., log-concave), and ΘΘ\Thetaroman_Θ is a convex set. As a fact, if ΘΘ\Thetaroman_Θ is a convex set, θd(θ,Θ)2maps-to𝜃𝑑superscript𝜃Θ2\theta\mapsto d(\theta,\Theta)^{2}italic_θ ↦ italic_d ( italic_θ , roman_Θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is convex, making the whole posterior log concave. If ΘΘ\Thetaroman_Θ is additionally closed, d(θ,Θ)2𝑑superscript𝜃Θ2d(\theta,\Theta)^{2}italic_d ( italic_θ , roman_Θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is L-smooth.

However, whenever ΘΘ\Thetaroman_Θ is nonconvex, the function θd(θ,Θ)2maps-to𝜃𝑑superscript𝜃Θ2\theta\mapsto d(\theta,\Theta)^{2}italic_θ ↦ italic_d ( italic_θ , roman_Θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is not continuously differentiable. This is induced by the Motzkin-Bunt theorem [16, Thm. 9.2.5] asserting that any Chebyshev set (a set SX𝑆𝑋S\subset Xitalic_S ⊂ italic_X is called Chebyshev if every point in X𝑋Xitalic_X has a unique nearest point in S𝑆Sitalic_S) has to be closed and convex.

On the other hand, d(θ,Θ)2𝑑superscript𝜃Θ2d(\theta,\Theta)^{2}italic_d ( italic_θ , roman_Θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is always DC regardless the geometric structure of ΘΘ\Thetaroman_Θ:

d(θ,Θ)2𝑑superscript𝜃Θ2\displaystyle d(\theta,\Theta)^{2}italic_d ( italic_θ , roman_Θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =infxΘθx2absentsubscriptinfimum𝑥Θsuperscriptnorm𝜃𝑥2\displaystyle=\inf_{x\in\Theta}\|\theta-x\|^{2}= roman_inf start_POSTSUBSCRIPT italic_x ∈ roman_Θ end_POSTSUBSCRIPT ∥ italic_θ - italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=infxΘ(θ2+x22x,θ)absentsubscriptinfimum𝑥Θsuperscriptnorm𝜃2superscriptnorm𝑥22𝑥𝜃\displaystyle=\inf_{x\in\Theta}\left(\|\theta\|^{2}+\|x\|^{2}-2\langle x,% \theta\rangle\right)= roman_inf start_POSTSUBSCRIPT italic_x ∈ roman_Θ end_POSTSUBSCRIPT ( ∥ italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_x , italic_θ ⟩ )
=θ2+infxΘ(x22x,θ)absentsuperscriptnorm𝜃2subscriptinfimum𝑥Θsuperscriptnorm𝑥22𝑥𝜃\displaystyle=\|\theta\|^{2}+\inf_{x\in\Theta}\left(\|x\|^{2}-2\langle x,% \theta\rangle\right)= ∥ italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_inf start_POSTSUBSCRIPT italic_x ∈ roman_Θ end_POSTSUBSCRIPT ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ⟨ italic_x , italic_θ ⟩ )
=θ2supxΘ(x2+2x,θ).absentsuperscriptnorm𝜃2subscriptsupremum𝑥Θsuperscriptnorm𝑥22𝑥𝜃\displaystyle=\|\theta\|^{2}-\sup_{x\in\Theta}\left(-\|x\|^{2}+2\langle x,% \theta\rangle\right).= ∥ italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_sup start_POSTSUBSCRIPT italic_x ∈ roman_Θ end_POSTSUBSCRIPT ( - ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ⟨ italic_x , italic_θ ⟩ ) . (55)

Note that the supremum of an arbitrary family of affine functions is convex.

Therefore, the log-DC structure of the whole posterior only depends on whether the original prior and the likelihood are log-DC, which is likely to be the case.

Distance-to-set prior relaxed von Mises-Fisher In directional statistics, the von Mises-Fisher distribution is a distribution over unit-length vectors (unit sphere). It can be described as a restriction of a Gaussian distribution in a sphere. By using the distance-to-set prior, we can relax the spherical constraint as

π¯(θ)exp(F(θ)):=exp(κ(θμ)(θμ)2)×exp(ρ2dist(θ,S)2)\displaystyle\bar{\pi}(\theta)\propto\exp(-F(\theta)):=\exp\left(-\kappa\dfrac% {(\theta-\mu)^{\top}(\theta-\mu)}{2}\right)\times\exp\left(-\dfrac{\rho}{2}% \operatorname{dist}(\theta,S)^{2}\right)over¯ start_ARG italic_π end_ARG ( italic_θ ) ∝ roman_exp ( - italic_F ( italic_θ ) ) := roman_exp ( - italic_κ divide start_ARG ( italic_θ - italic_μ ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_θ - italic_μ ) end_ARG start_ARG 2 end_ARG ) × roman_exp ( - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG roman_dist ( italic_θ , italic_S ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )

where S𝑆Sitalic_S denote the unit sphere in some dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT space. By the DC structure (C.2) of the distance function, F𝐹Fitalic_F is DC with the following composition

F(θ)𝐹𝜃\displaystyle F(\theta)italic_F ( italic_θ ) =(κθμ22+ρ2θ2)ρ2supxS(x2+2x,θ)absent𝜅superscriptnorm𝜃𝜇22𝜌2superscriptnorm𝜃2𝜌2subscriptsupremum𝑥𝑆superscriptnorm𝑥22𝑥𝜃\displaystyle=\left(\kappa\dfrac{\|\theta-\mu\|^{2}}{2}+\dfrac{\rho}{2}\|% \theta\|^{2}\right)-\dfrac{\rho}{2}\sup_{x\in S}\left(-\|x\|^{2}+2\langle x,% \theta\rangle\right)= ( italic_κ divide start_ARG ∥ italic_θ - italic_μ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG ∥ italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - divide start_ARG italic_ρ end_ARG start_ARG 2 end_ARG roman_sup start_POSTSUBSCRIPT italic_x ∈ italic_S end_POSTSUBSCRIPT ( - ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ⟨ italic_x , italic_θ ⟩ )
:=G(θ)H(θ).assignabsent𝐺𝜃𝐻𝜃\displaystyle:=G(\theta)-H(\theta).:= italic_G ( italic_θ ) - italic_H ( italic_θ ) .

We also note that H𝐻Hitalic_H is not continuously differentiable because S𝑆Sitalic_S is nonconvex. Furthermore, ρprojS(θ)H(θ)𝜌subscriptproj𝑆𝜃𝐻𝜃\rho\operatorname{proj}_{S}(\theta)\in\partial H(\theta)italic_ρ roman_proj start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_θ ) ∈ ∂ italic_H ( italic_θ ) where projS(θ)subscriptproj𝑆𝜃\operatorname{proj}_{S}(\theta)roman_proj start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_θ ) is the projection of θ𝜃\thetaitalic_θ onto S𝑆Sitalic_S, which can be computed explicitly in this case.

Experiment details

We consider a unit circle in 2superscript2\mathbb{R}^{2}blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with centre μ=(1,1.5)𝜇11.5\mu=(1,1.5)italic_μ = ( 1 , 1.5 ). We set κ=1𝜅1\kappa=1italic_κ = 1 and ρ=100𝜌100\rho=100italic_ρ = 100. The initial distribution is μ0=𝒩(0,16I)subscript𝜇0𝒩016𝐼\mu_{0}=\mathcal{N}(0,16I)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_N ( 0 , 16 italic_I ). We use γ=0.1𝛾0.1\gamma=0.1italic_γ = 0.1 for both FB Euler and semi FB Euler. We train both algorithms for 40404040 iterations using Adam optimizer with a batch size of 512512512512 in which the first 20202020 iterations use a learning rate of 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT while the latter 20202020 iterations use 2×1032superscript1032\times 10^{-3}2 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT.