Non-geodesically-convex optimization in the Wasserstein space

Hoang Phuc Hau Luu Department of Computer Science, University of Helsinki Hanlin Yu Department of Computer Science, University of Helsinki Bernardo Williams Department of Computer Science, University of Helsinki Petrus Mikkola Department of Computer Science, University of Helsinki
Marcelo Hartmann Department of Computer Science, University of Helsinki Kai Puolamäki Department of Computer Science, University of Helsinki Arto Klami Department of Computer Science, University of Helsinki

Abstract

We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. When the regularization term is the negative entropy, the optimization problem becomes a sampling problem where it minimizes the Kullback-Leibler divergence between a probability measure (optimization variable) and a target probability measure whose logarithmic probability density is a nonconvex function. We derive multiple convergence insights for a novel semi Forward-Backward Euler scheme under several nonconvex (and possibly nonsmooth) regimes. Notably, the semi Forward-Backward Euler is just a slight modification of the Forward-Backward Euler whose convergence is—to our knowledge—still unknown in our very general non-geodesically-convex setting.

1 Introduction

Sampling and optimization are intertwined. For example, the (overdamped) Langevin dynamics, typically considered a sampling algorithm, can be considered gradient descent optimization where a suitable amount of Gaussian noise is injected at each step. There are also deeper connections. At the limit of infinitesimal stepsize, the law of the Langevin dynamics is governed by the Fokker-Planck equation describing a diffusion over time of probability measures. In the seminal paper [35], Jordan, Kinderlehrer, and Otto reinterpreted the Fokker-Planck equation as the gradient flow of the functional relative entropy, a.k.a. Kullback-Leibler (KL) divergence, in the (Wasserstein) space of finite second-moment probability measures equipped with the Wasserstein metric. The discovery connects the two fields and encourages optimization in the Wasserstein space, even conceptually, as it directly gives insight into the sampling context. Studies in continuous-time dynamics [21, 12, 58, 29] seem natural and enjoy nice theoretical properties without discretization error. Another line of research studies discretization of Wasserstein gradient flow by either quantifying the discretization error between the continuous-time flow and the discrete-time flow [35, 59, 26, 24, 27] or viewing discrete-time flows as iterative optimization schemes in the Wasserstein space [57, 25, 62, 10] where the primary focus is on (geodesically) convex optimization problems.

Nonconvex, nonsmooth optimization is challenging, even in Euclidean space, quoting Rockafellar [55]: “In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity.” The landscape of nonconvex problems is mostly underexplored in the Wasserstein space. In the sampling language, it amounts to sampling from a non-log-concave and possibly non-log-Lipschitz-smooth target distribution. Recently, Balasubramanian et al. [8] advocated the need for a sound theory for non-log-concave sampling and provided some guarantees for the unadjusted Langevin algorithm (ULA) in sampling from log-smooth (Lipschitz/Hölder smooth) densities. These results are preliminary for the ULA (and its stochastic/smoothing variants) with a specific class of densities (smooth). Theoretical understandings of other classes of algorithms and densities are needed.

We approach the subject through the lens of nonconvex optimization in the space of probability distributions and pose discretized Wasserstein gradient flows as iterative minimization algorithms. This allows us to, on the one hand, use and extend tools from classical nonconvex optimization and, on the other hand, derive more connections between sampling and optimization.

We study the following non-geodesically-convex optimization problem defined over the space $\mathcal{P}_{2}(X)$ of probability measures $\mu$ over $X=\mathbb{R}^{d}$ with finite second moment, i.e., $\int{\|x\|^{2}}d\mu(x)<+\infty$ ,

\min_{\mu\in\mathcal{P}_{2}(X)}\mathcal{F}(\mu):=\mathcal{E}_{F}(\mu)+\mathscr% {H}(\mu):=\mathcal{E}_{G-H}(\mu)+\mathscr{H}(\mu)

(1)

where $F:X\to\mathbb{R}$ is a nonconvex function which can be represented as a difference of two convex functions $G$ and $H$ , $\mathcal{E}_{F}(\mu):=\int{F(x)}d\mu(x)$ is the potential energy, and $\mathscr{H}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ plays a role as the regularizer which is assumed to be a convex function along generalized geodesics. Informally, these are curves connecting two points and are “straight” from the view of a third point.

Why difference-of-convex structure?

Nonconvexity lies at the difference-of-convex (DC) structure $F=G-H$ , where $G$ and $H$ are called the first and second DC components, respectively. $F$ being nonconvex implies $\mathcal{E}_{F}$ being non-geodesically-convex in general. First, it is well-known that the class of DC functions is very rich, and DC structures are present everywhere in real-world applications [52, 39, 40, 1, 23, 48, 50]. Weakly convex and Lipschitz smooth (L-smooth) functions are two subclasses of the class of DC functions. Furthermore, any continuous function can be approximated by a sequence of DC functions over a compact, convex domain [7]. Second, the class of DC functions still retains some structural information, making the extension of convex analysis possible [52]. Geometric characteristics of subdifferentials of DC components help define stationarity concepts that are more practical than analysis-flavored concepts like Fréchet or Clarke stationarity [22] where a structure is missing. Such structural information is crucial in the context of classical DC programming [52] and in our analysis in Wasserstein space using tools from optimal transport.

Context

Many problems in machine learning and sampling fall into the spectrum of problem (1). The regularizer $\mathscr{H}$ can be the internal energy [3, Sect. 10.4.3]. Under McCann condition, the internal energy is convex along generalized geodesics [3, Prop. 9.3.9]. In particular, the negative entropy, $\mathscr{H}(\mu)=\int\log(\mu(x))d\mu(x)$ if $\mu$ is absolutely continuous w.r.t. Lebesgue measure, $+\infty$ otherwise, is a special case of internal energy satisfying McCann condition. In the latter case, $\mathcal{F}(\mu)=\operatorname{KL}(\mu\|\mu^{*})+\text{const}$ where $\mu^{*}(x)\propto\exp(-F(x))$ , the optimization problem reduces to a sampling problem with log-DC target distribution. In the context of infinitely wide two-layer neural networks and Maximum Mean Discrepancy [43, 6, 21], let $\mu^{*}$ be the optimal distribution over a network’s parameters, $k$ be a given kernel, the regularizer is then the interaction energy $\mathscr{H}(\mu)=\int\int{k(x,y)}d\mu(x)d\mu(y)$ and $F(x)=-2\int{k(x,y)}d\mu^{*}(y).$ In general, $\mathscr{H}$ is not convex along generalized geodesics and $F$ is nonconvex but not necessarily DC. However, when the kernel has Lipschitz gradient (as the case considered in [6]), we can adjust both $\mathscr{H}$ and $F$ as $\mathscr{H}(\mu)=\int\int{k(x,y)}+\alpha\|x\|^{2}+\alpha\|y\|^{2}d\mu(x)d\mu(y)$ and $F(x)=-2\int{k(x,y)}d\mu^{*}(y)-2\alpha\|x\|^{2}$ for some $\alpha>0$ making $\mathscr{H}$ generalized geodesically convex and $F$ concave (hence DC); see Appx. A.2.

Our idea is to minimize (1) in the space of probability distributions by discretization of the gradient flow of $\mathcal{F}$ , leveraging on the JKO (Jordan, Kinderlehrer, and Otto) operator (2). In the previous work [62], this has been done with the Forward-Backward (FB) Euler discretization, but it lacks convergence analysis. Recently, Salim et al. [57] did some study on FB Euler, but their results do not apply here because $F$ is nonconvex and possibly nonsmooth. Further leveraging on the DC structure of $F$ and inspired by classical DC programming literature [52], we subtly modify the FB Euler to give rise to a scheme named semi FB Euler that enjoys major theoretical advantages as we can provide a wide range of convergence analysis. A detailed discussion is in Sect. 3.1.

Our contributions

To our knowledge, no prior work studies problem (1) when $F$ is DC. Therefore, most of the derived results in this paper are novel. We propose and analyze the semi FB Euler scheme (4) and provide the following hierarchical set of new insights:

Thm. 1

We show that if the $H$ is continuously differentiable, every cluster point of the sequence of distributions $\{\mu_{n}\}_{n\in\mathbb{N}}$ generated by semi FB Euler is a critical point to $\mathcal{F}$ . Note that criticality is a notion from the DC programming literature [52] and it is a necessary condition for local optimality; See Sect. 3.3.
Thm. 2

We provide convergence rate of $O(N^{-1})$ in terms of Wasserstein (sub)gradient map** in the general non-smooth setting. Again, the notion of gradient map** [30, 34, 47] is from the context of proximal algorithms in Euclidean space that is applicable to nonconvex programs where the notion of distance to global solution is—in general—not possible to work out.
Thm. 3

Under the extra assumption that $H$ is continuously twice differentiable and has bounded Hessian, we provide a convergence rate of $O(N^{-\frac{1}{2}})$ in terms of distance of $0$ to the Fréchet subdifferential of $\mathcal{F}$ . One can think of this as convergence rate to Fréchet stationarity, i.e., if $\mu^{*}$ is a Fréchet stationary point of $\mathcal{F}$ , then, by definition, $0$ is in the Fréchet subdifferential of $\mathcal{F}$ at $\mu^{*}$ . Fréchet stationarity is a relatively sharp necessary condition for local optimality.
Thm. 4, 5

Under the assumptions of Thm. 3 and additionally $\mathcal{F}$ satisfying the Łojasciewicz-type inequality for some Łojasciewicz exponent of $\theta\in[0,1)$ , we show that $\{\mu_{n}\}_{n\in\mathbb{N}}$ is a Cauchy sequence under Wasserstein topology, and thanks to the completeness of the Wasserstein space, the whole sequence $\{\mu_{n}\}_{n\in\mathbb{N}}$ converges to some $\mu^{*}$ . We show that $\mu^{*}$ is in fact a global minimizer to $\mathcal{F}$ . Furthermore, we provide convergence rate of $\mu_{n}\to\mu^{*}$ in three different regimes ( $W_{2}$ denotes the Wasserstein metric): (1) if $\theta=0$ , $W_{2}(\mu_{n},\mu^{*})$ converges to $0$ after a finite number of steps; (2) if $\theta\in(0,1/2]$ , both $\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})$ and $W_{2}(\mu_{n},\mu^{*})$ converges to $0$ exponentially fast; (3) if $\theta\in(1/2,1)$ , both $\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})$ and $W_{2}(\mu_{n},\mu^{*})$ converges sublinearly to $0$ with rates $O\left(n^{-\frac{1}{2\theta-1}}\right)$ and $O\left(n^{-\frac{1-\theta}{2\theta-1}}\right)$ , respectively. When $\mathscr{H}$ is the negative entropy, $\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})=\operatorname{KL}(\mu_{n}\|\mu^{*})$ ; Therefore, in the sampling context, we provide convergence guarantees in both Wasserstein and KL distances. See Sect. 4.3 for additional observations and implications.

2 Preliminaries

2.1 Notations and basic results in measure theory and functional analysis

We denote by $X=\mathbb{R}^{d}$ , $\mathcal{B}(X)$ the Borel $\sigma$ -algebra over $X$ , and $\mathscr{L}^{d}$ the Lebesgue measure on $X$ . $\mathcal{P}(X)$ is the set of Borel probability measures on $X$ . For $\mu\in\mathcal{P}(X)$ , we denote its second-order moment by $\mathfrak{m}_{2}(\mu):=\int_{X}{\|x\|^{2}}d\mu(x)$ , where $\mathfrak{m}_{2}(\mu)$ can be infinity. $\mathcal{P}_{2}(X)\subset\mathcal{P}(X)$ denotes a set of finite second-order moment probability measures. $\mathcal{P}_{2,\operatorname{abs}}(X)\subset\mathcal{P}_{2}(X)$ is the set of measures that are absolutely continuous w.r.t. $\mathscr{L}^{d}$ . $\mu$ -a.e. stands for almost everywhere w.r.t. $\mu$ .

$C^{p}(X),C^{\infty}_{c}(X),C_{b}(X)$ are the classes of $p$ -time continuously differentiable functions, infinitely differentiable functions with compact support, bounded and continuous functions, respectively.

From functional analysis [20], for each $p\geq 1$ , $L^{p}(X,\mu)$ denotes the Banach space of measurable (where measurable is understood as Borel measurable from now on) functions $f$ such that $\int_{X}{|f(x)|^{p}}d\mu(x)<+\infty$ . We shall consider an element of $L^{p}(X,\mu)$ as an equivalent class of functions that agree $\mu$ -a.e. on $X$ rather than a sole function. The norm of $f\in L^{p}(X,\mu)$ is $\|f\|_{L^{p}(X,\mu)}=(\int_{X}{|f(x)|^{p}}d\mu(x))^{1/p}$ . When $p=2$ , $L^{2}(X,\mu)$ is actually a Hilbert space with the inner product $\langle f,g\rangle_{L^{2}(X,\mu)}=\int_{X}{f(x)g(x)}d\mu(x)$ which induces the mentioned norm. These results can be extended to vector-valued functions. In particular, we denote by $L^{2}(X,X,\mu)$ the Hilbert space of $\xi:X\to X$ in which $\|\xi\|\in L^{2}(X,\mu)$ . The norm $\|\xi\|_{L^{2}(X,X,\mu)}:=(\int_{X}\|\xi(x)\|^{2}d\mu(x))^{1/2}$ .

We say that $f:X\to\mathbb{R}$ has quadratic growth if there exists $a>0$ such that $|f(x)|\leq a(\|x\|^{2}+1)$ for all $x\in X$ . It is clear that if $f$ has quadratic growth and $\mu\in\mathcal{P}_{2}(X)$ , then $f\in L^{1}(X,\mu).$

The pushforward of a measure $\mu\in\mathcal{P}(X)$ through a Borel map $T:X\to\mathbb{R}^{m}$ , denoted by $T_{\#}\mu$ is defined by $(T_{\#}\mu)(A):=\mu(T^{-1}(A))$ for every Borel sets $A\subset\mathbb{R}^{m}.$

2.2 Optimal transport [3, 4, 61, 60]

Given $\mu,\nu\in\mathcal{P}(X)$ , the principal problem in optimal transport is to find a transport map $T$ pushing $\mu$ to $\nu$ , i.e., $T_{\#}\mu=\nu$ , in the most cost-efficient way, i.e., minimizing $\|x-T(x)\|^{2}$ on $\mu$ -average. Monge’s formulation for this problem is $\inf_{T:T_{\#}\mu=\nu}\int_{X}{\|x-T(x)\|^{2}}d\mu(x)$ , where the optimal solution, if exists, is denoted by $T_{\mu}^{\nu}$ and called the optimal (Monge) map. Monge’s problem can be ill-posed, e.g., no such $T_{\mu}^{\nu}$ exists when $\mu$ is a Dirac mass and $\nu$ is absolutely continuous [4].

By relaxing Monge’s formulation, Kantorovich considers $\min_{\gamma\in\Gamma(\mu,\nu)}\int_{X\times X}\|x-y\|^{2}d\gamma(x,y)$ , where $\Gamma(\mu,\nu)$ denotes the set of probabilities over $X\times X$ whose marginals are $\mu$ and $\nu$ , i.e, $\gamma\in\Gamma(\mu,\nu)$ iff ${\operatorname{proj}_{1}}_{\#}\gamma=\mu,{\operatorname{proj}_{2}}_{\#}\gamma=\nu$ where $\operatorname{proj}_{1},\operatorname{proj}_{2}$ are the projections onto the first $X$ space and the second $X$ space, respectively. Such $\gamma$ is called a plan. Kantorovich’s formulation is well-posed because $\Gamma(\mu,\nu)$ is non-empty (at least $\mu\times\nu\in\Gamma(\mu,\nu)$ ) and the $\operatorname*{arg\,min}$ element actually exists (see [4, Sect. 2.2]). The set of optimal plans between $\mu$ and $\nu$ is denoted by $\Gamma_{o}(\mu,\nu).$ In terms of random variables, any pairs $(X,Y)$ where $X\sim\mu,Y\sim\nu$ is called a coupling of $\mu$ and $\nu$ while it is called an optimal coupling if the joint law of $X$ and $Y$ is in $\Gamma_{o}(\mu,\nu)$ .

In $\mathcal{P}_{2}(X)$ , the $\min$ value in Kantorovich’s problem specifies a valid metric referred to as Wasserstein distance, $W_{2}(\mu,\nu)=(\int_{X\times X}\|x-y\|^{2}d\gamma(x,y))^{1/2}$ for some, and thus all, $\gamma\in\Gamma_{o}(\mu,\nu)$ . The metric space $(\mathcal{P}_{2}(X),W_{2})$ is then called the Wasserstein space. In $\mathcal{P}_{2}(X)$ , beside the convergence notion induced by the Wasserstein metric, there is a weaker notion of convergence called narrow convergence: we say a sequence $\{\mu_{n}\}_{n\in\mathbb{N}}\subset\mathcal{P}_{2}(X)$ converges narrowly to $\mu\in\mathcal{P}_{2}(X)$ if $\int_{X}{\phi(x)}d\mu_{n}(x)\to\int_{X}{\phi(x)}d\mu(x)$ for all $\phi\in C_{b}(X).$ Convergence in the Wasserstein metric implies narrow convergence but the converse is not necessarily true. The extra condition to make it true is $\mathfrak{m}_{2}(\mu_{n})\to\mathfrak{m}_{2}(\mu)$ . We denote Wasserstein and narrow convergence by $\xrightarrow{\operatorname{\text{Wass}}}$ and $\xrightarrow{\operatorname{\text{narrow}}}$ , respectively.

If $\mu\in\mathcal{P}_{2,\operatorname{abs}}(X),\nu\in\mathcal{P}_{2}(X)$ , Monge’s formulation is well-posed and the unique ( $\mu$ -a.e.) solution exists, and in this case, it is safe to talk about (and use) the optimal transport map $T_{\mu}^{\nu}$ . Moreover, there exists some convex function $f$ such that $T_{\mu}^{\nu}=\nabla f$ $\mu$ -a.e. Kantorovich’s problem also has a unique solution $\gamma$ and it is given by $\gamma=(I,T_{\mu}^{\nu})_{\#}\mu$ where $I$ is the identity map. This is known as Brenier theorem or polar factorization theorem [18].

2.3 Subdifferential calculus in the Wasserstein space

Apart from being a metric space, $(\mathcal{P}_{2}(X),W_{2})$ also enjoys some pre-Riemannian structure making subdifferential calculus on it possible. Let us have a picture of a manifold in mind. Firstly, the tangent space [3] of $\mathcal{P}_{2}(X)$ at $\mu$ is $\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X):=\overline{\{\nabla\psi:\psi\in C_{% c}^{\infty}(X)\}}^{L^{2}(X,X,\mu)}$ , where the closure is w.r.t. the $L^{2}(X,X,\mu)$ -topology. Intuitively, for $\psi\in C_{c}^{\infty}(X)$ , $I+\epsilon\nabla\psi$ is an optimal transport map if $\epsilon>0$ is small enough [36], so $\nabla\psi$ plays a role as "tangent vector".

Let $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ , we denote $\operatorname{dom}(\phi)=\{\mu\in\mathcal{P}_{2}(X):\phi(\mu)<+\infty\}$ . Let $\mu\in\operatorname{dom}(\phi)$ , we say that a map $\xi\in L^{2}(X,X,\mu)$ belongs to the Fréchet subdifferential [15, 36] $\partial_{F}^{-}\phi(\mu)$ if $\phi(\nu)-\phi(\mu)\geq\sup_{\gamma\in\Gamma_{o}(\mu,\nu)}\int_{X\times X}{% \langle\xi(x),y-x\rangle}d\gamma(x,y)+o(W_{2}(\mu,\nu))$ for all $\nu\in\mathcal{P}_{2}(X)$ , where the little-o notation means $\lim_{s\to 0}{o(s)/s}=0.$ If $\partial_{F}^{-}\phi(\mu)\neq\emptyset$ , we say $\phi$ is Fréchet subdifferentiable at $\mu$ . We also denote $\operatorname{dom}(\partial_{F}^{-}\phi)=\{\mu\in\mathcal{P}_{2}(X):\partial_{% F}^{-}\phi(\mu)\neq\emptyset\}$ .

Similarly, we say that $\xi\in L^{2}(X,X,\mu)$ belongs to the (Fréchet) superdifferential $\partial_{F}^{+}\phi(\mu)$ of $\phi$ at $\mu$ if $-\xi\in\partial_{F}^{-}(-\phi)(\mu)$ . In other words, $\partial_{F}^{-}(-\phi)(\mu)=-\partial_{F}^{+}\phi(\mu).$

We say $\phi$ is Wassertein differentiable [15, 36] at $\mu\in\operatorname{dom}(\phi)$ if $\partial_{F}^{-}\phi(\mu)\cap\partial_{F}^{+}\phi(\mu)\neq\emptyset$ . We call an element of the intersection, denoted by $\nabla_{W}\phi(\mu)$ , a Wasserstein gradient of $\phi$ at $\mu$ , and it holds $\phi(\nu)-\phi(\mu)=\int_{X\times X}{\langle\nabla_{W}\phi(\mu)(x),y-x\rangle}% d\gamma(x,y)+o(W_{2}(\mu,\nu))$ , for all $\nu\in\mathcal{P}_{2}(X)$ and any $\gamma\in\Gamma_{o}(\mu,\nu).$ The Wasserstein gradient is not unique in general, but its parallel component in $\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X)$ is unique, and this parallel component is again a valid Wasserstein gradient as the orthogonal component plays no role in the above definitions, i.e., if $\xi^{\perp}\in\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X)^{\perp}$ , it holds $\int_{X\times X}\langle\xi^{\perp}(x),y-x\rangle d\gamma(x,y)=0$ for any $\nu\in\mathcal{P}_{2}(X)$ and $\gamma\in\Gamma_{o}(\mu,\nu)$ [36, Prop. 2.5]. We may refer to this parallel component as the unique Wasserstein gradient of $\phi$ at $\mu$ .

2.4 Optimization in the Wasserstein space

A function $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ is called proper if $\operatorname{dom}(\phi)\neq\emptyset$ , while it is called lower semicontinuous (l.s.c) if for any sequence $\mu_{n}\xrightarrow{\operatorname{\text{Wass}}}\mu$ , it holds $\liminf_{n}\phi(\mu_{n})\geq\phi(\mu)$ .

We next recall (a simplified version of) generalized geodesic convexity.

Definition 1.

[57] Let $\mathcal{\phi}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ . We say $\phi$ is convex along generalized geodesics if $\forall\mu,\pi\in\mathcal{P}_{2}(X)$ , $\forall\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ , $\phi((tT_{\nu}^{\mu}+(1-t)T_{\nu}^{\pi})_{\#}\nu)\leq t\phi(\mu)+(1-t)\phi(\pi)$ , $\forall t\in[0,1]$ .

The curve $t\mapsto(tT_{\nu}^{\mu}+(1-t)T_{\nu}^{\pi})_{\#}\nu$ (called a generalized geodesic) interpolates from $\pi$ to $\mu$ as $t$ runs from $0$ to $1$ . The definition says that $\phi$ is convex along these curves. If $\mu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ and $\nu=\mu$ , the curve is a geodesic in $(\mathcal{P}_{2}(X),W_{2})$ . If the definition is relaxed to the class of geodesics only, we say that $\phi$ is convex along geodesics.

An important characterization of Fréchet subdifferential of a geodesically convex function is that we can drop the little-o notation in its definition in Sect. 2.3 [3, Sect 10.1.1]. As a convention, for a geodesically convex function $\phi$ , the Fréchet subdifferential $\partial_{F}^{-}$ will be simply written as $\partial$ .

First-order optimality conditions

Let $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ be a proper function. $\mu^{*}\in\mathcal{P}_{2}(X)$ is a global minimizer of $\phi$ if $\phi(\mu^{*})\leq\phi(\mu),\forall\mu\in\mathcal{P}_{2}(X).$ For local optimality, we shall use the Wasserstein metric to define neighborhoods. $\mu^{*}\in\mathcal{P}_{2}(X)$ is a local minimizer if there exists $r>0$ such that $\phi(\mu^{*})\leq\phi(\mu)$ for all $\mu:W_{2}(\mu,\mu^{*})<r.$ We shall denote $B(\mu^{*},r):=\{\mu\in\mathcal{P}_{2}(X):W_{2}(\mu,\mu^{*})<r\}$ the (open) Wasserstein ball centered at $\mu^{*}$ with radius $r$ . If we replace $<$ by $\leq$ we obtain the notion of a closed Wasserstein ball.

We call $\mu^{*}$ a Fréchet stationary point of $\phi$ if $0\in\partial_{F}^{-}\phi(\mu^{*}).$ Fréchet stationarity is a necessary condition for local optimality. In other words, if $\mu^{*}$ is a local minimizer, it is a Fréchet stationary point (Lem. 5 in Appendix). In addition, if $\phi$ is Wasserstein differentiable at $\mu^{*}$ , $\nabla_{W}\phi(\mu^{*})(x)=0$ $\mu^{*}$ -a.e. [36]. When $\phi$ is geodesically convex, Fréchet stationarity is a sufficient condition for global optimality (Lem. 6 in Appendix).

3 Semi Forward-Backward Euler for difference-of-convex structures

3.1 Wasserstein gradient flows: different types of discretizations

To present the idea of minimizing $\mathcal{F}$ by using discretizations of its gradient flow in a neat way, we first assume for a moment that $F$ is infinitely differentiable and $\mathscr{H}$ is the negative entropy.

We wish to minimize (1) in the space of probability distributions. A natural idea is to apply discretizations of the gradient flow of $\mathcal{F}$ , where the gradient flow is defined (under some technical assumptions [35]) as the limit $\gamma\to 0^{+}$ of the following scheme with some simple time-interpolation

\displaystyle\mu_{n+1}\in\operatorname{JKO}_{\gamma\mathcal{F}}(\mu_{n}),\text% { where }\operatorname{JKO}_{\gamma\mathcal{F}}(\mu):=\operatorname*{arg\,min}% _{\nu\in\mathcal{P}_{2}(X)}\mathcal{F}(\mu)+\dfrac{1}{2\gamma}W_{2}^{2}(\mu,% \nu).

(2)

Straightforwardly, given a fixed $\gamma>0$ , (2) gives back a discretization for this flow known as Backward Euler. On the other hand, if $\mathcal{F}$ is Wasserstein differentiable (Sect. 2.2), the Forward Euler discretization reads [62] $\mu_{n+1}=(I-\gamma\nabla_{W}\mathcal{F}(\mu_{n}))_{\#}\mu_{n}$ , which is reinterpreted as doing gradient descent in the space of probability distributions. These are optimization methods that work directly on the objective function $\mathcal{F}$ itself. However, the composite structure of $\mathcal{F}$ (a sum of several terms) can also be exploited. One such scheme is the unadjusted Langevin algorithm (ULA), where it first takes a gradient step w.r.t. the potential part, then follows the heat flow corresponding to the entropy part [62]: $\nu_{n+1}=(I-\gamma\nabla F)_{\#}\mu_{n},\text{ and }\mu_{n+1}=\mathcal{N}(0,2% \gamma I)*\nu_{n+1}$ , where $*$ is the convolution. This ULA is "viewed" in the space of distributions (Eulerian approach), a more familiar and equivalent form of the ULA from the particle perspective (Lagrangian approach) goes like $x_{n+1}=x_{n}-\gamma\nabla F(x_{n})+\sqrt{2\gamma}z_{k}$ where $z_{k}\sim\mathcal{N}(0,I)$ . The ULA is known to be asymptotically biased even for Gaussian target measure (Ornstein-Uhlenbeck process). To correct this bias, the Metropolis-Hasting accept-reject step [54] is sometimes introduced. Metropolis-Hasting algorithm [44, 32] is a much more general framework that works with quite any proposal (e.g., a random walk) whose convergence analysis is based on the Markov kernel satisfying the detailed balance condition. This convergence framework is different from what is considered in this work: we are more interested in the underlying dynamics of the chain. Metropolis-Hasting algorithm is indeed another story.

In optimization, for composite structure, Forward-Backward (FB) Euler and its variants are methods of choice [51, 9]. The corresponding FB Euler for $\mathcal{F}$ will take the gradient step (forward) according to the potential, and JKO step (backward) w.r.t. the negative entropy

\displaystyle\text{(FB Euler)}\quad\nu_{n+1}=(I-\gamma\nabla F)_{\#}\mu_{n},% \text{ and }\mu_{n+1}\in\operatorname{JKO}_{\gamma\mathscr{H}}(\nu_{n+1}).

(3)

This scheme appears in [62] without convergence analysis, and later on [57] derives non-asymptotic convergence guarantees under the assumption $F$ being convex and Lipschitz smooth.

In this work, as $F$ is nonconvex and nonsmooth, the theory in [57] does not apply, and the convergence (if any) of (3) remains mysterious. The DC structure of $F$ can be further exploited. In DC programming [52], the forward step should be applied to the concave part, while the backward step should be applied to the convex part. We hence propose the following semi FB Euler

\displaystyle\text{(semi FB Euler)}\quad\nu_{n+1}=(I+\gamma\nabla H)_{\#}\mu_{% n},\text{ and }\mu_{n+1}\in\operatorname{JKO}_{\gamma(\mathscr{H}+\mathcal{E}_% {G})}(\nu_{n+1})

(4)

for which we can provide convergence guarantees. Apparently, the difference between semi FB Euler and FB Euler is subtle: while FB Euler does forward on $\mathcal{E}_{G-H}=\mathcal{E}_{G}-\mathcal{E}_{H}$ and backward on $\mathscr{H}$ , semi FB Euler does forward on $-\mathcal{E}_{H}$ and backward on $\mathscr{H}+\mathcal{E}_{G}$ ; recall that $\mathcal{F}=\mathcal{E}_{G}-\mathcal{E}_{H}+\mathscr{H}$ .

Theoretically, semi FB Euler enjoys some advantages compared to FB Euler. Thanks to Brenier theorem (Sect. 2.2), the pushing step in semi FB Euler is optimal since $H$ is convex; Meanwhile, the pushing in FB Euler is non-optimal whose optimal Monge map is not identifiable in general. The convergence of FB Euler is still an open question, even when $F$ is (DC) differentiable. In contrast, we can provide a solid theoretical guarantee for semi FB Euler, especially when $H$ is differentiable. Additionally, we also offer convergence guarantees when $H$ is nonsmooth.

3.2 Problem setting

Our goal is to minimize the non-geodesically-convex functional $\mathcal{F}(\mu)=\mathcal{E}_{F}(\mu)+\mathscr{H}(\mu)$ over $\mathcal{P}_{2}(X)$ , where $F=G-H$ is a DC function. We make Assumption 1 throughout the paper:

Assumption 1.

(i)

The objective function $\mathcal{F}$ is bounded below.
(ii)

$G,H:X\to\mathbb{R}$ are convex functions and have quadratic growth.
(iii)

$\mathscr{H}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ is proper, l.s.c, and convex along generalized geodesics in $(\mathcal{P}_{2}(X),W_{2})$ , and $\operatorname{dom}(\mathcal{H})\subset\mathcal{P}_{2,\operatorname{abs}}(X).$
(iv)

There exists $\gamma_{0}>0$ such that $\forall\gamma\in(0,\gamma_{0})$ , $\operatorname{JKO}_{\gamma(\mathcal{E}_{G}+\mathscr{H})}(\mu)\neq\emptyset$ for every $\mu\in\mathcal{P}_{2}(X).$

Note that Assumption 1(iv) is a commonly-used assumption to simplify technical complication when working with the JKO operator [3, 15, 57]. Assumption 1(ii) implies $\mathcal{E}_{G}$ and $\mathcal{E}_{H}$ are continuous w.r.t. Wasserstein topology [2, Prop. 2.4] ( $G,H$ are continuous [46, Cor. 2.27] and have quadratic growth).

We only make Assumption 2 in the asymptotic convergence analysis (Thm. 1).

Assumption 2 (Compactness).

Every sublevel set of $\mathcal{F}$ , $S_{\lambda}:=\{\mu\in\mathcal{P}_{2}(X):\mathcal{F}(\mu)\leq\lambda\}$ , is compact with respect to the Wasserstein topology.

In Euclidean space, compactness of sublevel sets of $f$ is usually enforced via coercivity assumption: $f(x)\to+\infty$ whenever $\|x\|\to+\infty$ , which holds for a wide class of functions to be minimized. A striking difference in the Wasserstein space is that closed Wasserstein balls are not compact in the Wasserstein topology (only compact under narrow topology) [36, Prop. 4.2], making coercivity not sufficient to induce (Wasserstein) compactness. Assumption 2 is meant to simplify these difficulties.

3.3 Optimality charactizations

First, it follows from Assumption 1(iii), $\operatorname{dom}(\mathcal{F})\subset\mathcal{P}_{2,\operatorname{abs}}(X).$ By analogy to DC programming in Euclidean space, we call $\mu^{*}\in\operatorname{dom}(\mathcal{F})$ a critical point of $\mathcal{F}=\mathscr{H}+\mathcal{E}_{G}-\mathcal{E}_{H}$ if $\partial(\mathscr{H}+\mathcal{E}_{G})(\mu^{*})\cap\partial\mathcal{E}_{H}(\mu^% {*})\neq\emptyset.$ Criticality is a necessary condition for local optimality (Lem. 7). Moreover, if either $\mathscr{H}+\mathcal{E}_{G}$ or $\mathcal{E}_{H}$ is Wasserstein differentiable at $\mu^{*}$ , criticality becomes Fréchet stationarity (Lem. 8).

3.4 Semi FB Euler: a general setting

In this work, we allow $H$ to be non-differentiable, meaning that $\partial H$ (convex subdifferential [46]) contains multiple elements in general. We first pick a selector $S$ of $\partial H$ , i.e., $S:X\to X$ , such that $S(x)\in\partial H(x)$ . By the axiom of choice (Zermelo, 1904, see, e.g., [33]), such selection always exists. However, an arbitrary selector can behave badly, e.g., not Borel measurable. We shall first restrict ourselves to the class of Borel measurable selectors (see Appx. A.1 for an existence discussion).

Assumption 3 (Measurability).

The selector $S$ is Borel measurable.

We recall the semi FB scheme (4) but for nonsmooth $F$ as follows: start with an initial distribution $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ , given a discretization stepsize $0<\gamma<\gamma_{0}$ , we repeat the following two steps:

\displaystyle\nu_{n+1}

\displaystyle=(I+\gamma S)_{\#}\mu_{n}\quad\triangleleft\text{ push forward % step;}\quad\mu_{n+1}=\operatorname{JKO}_{\gamma(\mathcal{E}_{G}+\mathscr{H})}(% \nu_{n+1})\quad\triangleleft\text{ JKO step}.

Well-definiteness and properties: Given $\mu_{n}\in\mathcal{P}_{2}(X)$ , it follows from Lem. (4) that $\nu_{n+1}\in\mathcal{P}_{2}(X)$ . The two generated sequences are then in $\mathcal{P}_{2}(X)$ . Moreover, it follows from Assumption 1 that $\{\mu_{n}\}_{n\in\mathbb{N}}$ are in $\mathcal{P}_{2,\operatorname{abs}}(X)$ , so are $\{\nu_{n}\}_{n\in\mathbb{N}}$ using Lem. 9 by noting that $I+\gamma S$ is subgradient of a strongly convex function $x\mapsto(1/2)\|x\|^{2}+\gamma H(x)$ .

4 Convergence analysis

4.1 Asymptotic analysis

Lemma 1 (Descent lemma).

Under Assumptions 1 and 3, let $\{\mu_{n}\}_{n\in\mathbb{N}}$ be the sequence of distributions produced by semi FB Euler starting from some $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ with $\gamma<\gamma_{0}$ . Then it holds $\mathcal{F}(\mu_{n+1})\leq\mathcal{F}(\mu_{n})-\frac{1}{\gamma}\int_{X}{\|T_{% \nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x),\quad% \forall n\in\mathbb{N}$ .

Lem. 1 shows that the objective does not increase along semi FB Euler’s iterates. Proof of Lem. 1 is in Appx. A.3. By using Lem. 1, we establish asymptotic convergence for semi FB Euler as follows.

For the asymptotic convergence analysis, we need the following assumption on the second DC component $H$ .

Assumption 4.

$H$ is continuously differentiable.

Theorem 1 (Asymptotic convergence).

Under Assumptions 1, 2, 4, let $\{\mu_{n}\}_{n\in\mathbb{N}}$ and $\{\nu_{n}\}_{n\in\mathbb{N}}$ be sequences produced by semi FB Euler starting from some $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ with $\gamma<\gamma_{0}$ . Then,

(i)

$\{\mu_{n}\}_{n\in\mathbb{N}}$ has a cluster point.
(ii)

If $\sup_{n\in\mathbb{N}}\mathscr{H}(\nu_{n})<+\infty$ , every cluster point of $\{\mu_{n}\}_{n\in\mathbb{N}}$ is a critical point of $\mathcal{F}$ .

Proof of Thm.1 is in Appx. A.4. Thm. 1 does not ensure convergence of the whole sequence $\{\mu_{n}\}_{n\in\mathbb{N}}$ ; Rather, it guarantees subsequential convergence to critical points of $\mathcal{F}$ .

4.2 Non asymptotic analysis

To measure how fast the algorithm converges, we need some convergence measurement. First, for proximal-type algorithms in Euclidean space, the notion of gradient map** $\mathcal{G}_{\gamma}(x_{n})$ is usually used (see, e.g., [30, 47] and [34, Eq. (5)]) and we measure the rate $\|\mathcal{G}_{\gamma}(x_{n})\|^{2}\to 0$ . In analogy as in Euclidean space, we define the Wasserstein (sub)gradient map** as follows $\mathcal{G}_{\gamma}(\mu):=\frac{1}{\gamma}\left(I-T_{\mu}^{\operatorname{JKO}% _{\gamma(\mathcal{E}_{G}+\mathscr{H})}((I+\gamma S)_{\#}\mu)}\right)$ , and we measure the rate of $\|\mathcal{G}_{\gamma}(\mu_{n})\|^{2}_{L^{2}(X,X,\mu_{n})}\to 0$ .

Theorem 2 (Convergence rate: Wasserstein (sub)gradient map**).

Under Assumptions 1, 3, let $\{\mu_{n}\}_{n\in\mathbb{N}}$ be the sequence of distributions produced by semi FB Euler starting from some $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ with $\gamma<\gamma_{0}$ . Then it holds $\min_{n=\overline{1,N}}\|\mathcal{G}_{\gamma}(\mu_{n})\|^{2}_{L^{2}(X,X,\mu_{n% })}=O(N^{-1})$ .

Proof of Thm. 2 is in Appx. A.5. This theorem holds without requiring $G$ and $H$ to be differentiable.

Next, if $H$ is twice differentiable with uniformly bounded Hessian, we can derive a stronger convergence guarantee based on Fréchet stationarity (see Sect. 2.4). In other words, we evaluate the rate of $\operatorname{dist}{(0,\partial_{F}^{-}\mathcal{F}(\mu_{n}))}:=\inf_{\xi\in% \partial^{-}_{F}\mathcal{F}(\mu_{n})}\|\xi\|_{L^{2}(X,X;\mu_{n})}\to 0$ .

Assumption 5.

$H\in C^{2}(X)$ whose Hessian is bounded uniformly ( $H$ is then $L_{H}$ -smooth).

Theorem 3 (Convergence rate: Fréchet subdifferentials).

Under Assumptions 1, 5, let $\{\mu_{n}\}_{n\in\mathbb{N}}$ be the sequence of distributions produced by semi FB Euler starting from some $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ with $\gamma<\gamma_{0}$ , then $\min_{n=\overline{1,N}}{\operatorname{dist}{(0,\partial^{-}_{F}\mathcal{F}(\mu% _{n}))}}=O\left(N^{-\frac{1}{2}}\right).$

Proof of Thm. 3 is in Appx. A.6.

4.3 Fast convergence under isoperimetry and beyond

Fast convergence can be obtained under isoperimetry, e.g., log-Sobolev inequality (LSI). There are certain connections between LSI in sampling and the Łojasiewicz condition in optimization. Since we are working with the non-Wasserstein-differentiable objective function, Łojasiewicz condition is the right tool to employ. In nonconvex optimization in Euclidean space, analytic and subanalytic functions are a large class satisfying Łojasiewicz condition [37, 14]. Subanalytic DC programs are studied in [38]. In the infinite-dimensional setting of the Wasserstein space, the Łojasiewicz condition should be regarded as functional inequalities [12].

Assumption 6 (Łojasiewicz condition in the Wasserstein space).

Assume that $\mathcal{F}^{*}$ is the optimal value of $\mathcal{F}$ , and assume there exist $r_{0}\in(\mathcal{F}^{*},+\infty]$ , $\theta\in[0,1)$ , and $c>0$ such that for all $\mu\in\mathcal{P}_{2}(X)$ , $\mathcal{F}(\mu)-\mathcal{F}^{*}<r_{0}\Rightarrow c\left(\mathcal{F}(\mu)-% \mathcal{F}^{*}\right)^{\theta}\leq\|\xi\|_{L^{2}(X,X,\mu)},~{}\forall\xi\in% \partial_{F}^{-}\mathcal{F}(\mu)$ , where the convention $0^{0}=0$ is used. $\theta\in[0,1)$ is called Łojasiewicz exponent of $\mathcal{F}$ at optimality.

Remark 1.

If $\mathscr{H}$ is the is negative entropy, $F\in C^{2}(X)$ whose Hessian is bounded uniformly, then $\mathcal{F}$ is Wasserstein differentiable at any $\mu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ and [36, Prop. 2.12, E.g. 2.3] $\nabla_{W}\mathcal{F}(\mu)=\frac{\nabla\mu}{\mu}+\nabla F$ , where, by abuse of notation, the probability density function of $\mu$ is still denoted by $\mu$ . We have $\|\nabla_{W}\mathcal{F}(\mu)\|^{2}_{L^{2}(X,X,\mu)}=\int{\left\|\frac{\nabla% \mu(x)}{\mu(x)}+\nabla F(x)\right\|^{2}}d\mu(x)=\int\mu(x)\left\|\nabla\log% \frac{\mu(x)}{\mu^{*}(x)}\right\|^{2}dx$ , where $\mu^{*}\propto\exp(-F)$ . On the other hand, $\mathcal{F}(\mu)-\mathcal{F}^{*}=\operatorname{KL}(\mu\|\mu^{*})$ . The log-Sobolev inequality with parameter $\alpha>0$ inequality reads [49] $\operatorname{KL}(\mu\|\mu^{*})\leq\frac{1}{2\alpha}\operatorname{FI}(\mu\|\mu% ^{*}):=\frac{1}{2\alpha}\int\mu(x)\left\|\nabla\log\frac{\mu(x)}{\mu^{*}(x)}% \right\|^{2}dx$ , where $\operatorname{FI}(\mu\|\mu^{*})$ is the relative Fisher information of $\mu$ with respect to $\mu^{*}$ . Therefore, log-Sobolev inequality is a special case of Łojasiewicz condition with $r_{0}=+\infty,c=\sqrt{2\alpha}$ , and $\theta=1/2.$

Theorem 4.

Under Assumptions 1, 5 and Assumption 6 with parameters $(r_{0},c,\theta)$ . Let $\{\mu_{n}\}_{n\in\mathbb{N}}$ be the sequence of distributions produced by semi FB Euler starting from some sufficiently warm-up $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ such that $\mathcal{F}(\mu_{0})<r_{0}$ and with stepsize $\gamma<\gamma_{0}$ , then

(i)

if $\theta=0$ , $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}$ converges to $0$ in a finite number of steps;
(ii)

if $\theta\in(0,1/2]$ , $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=O\left(\left(\frac{M}{M+1}\right)^{n}% \right)\text{ where }M=\frac{{2(\gamma^{2}L_{H}^{2}+1)}}{c^{2}{\gamma}};$
(iii)

if $\theta\in(1/2,1)$ , $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}$ converges sublinearly to $0$ , i.e., $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=O\left(n^{-\frac{1}{2\theta-1}}\right).$

Proof of Thm. 4 is in Appx. A.7.

Remark 2.

In the usual sampling case, i.e., $\mathscr{H}$ is the negative entropy, and under log-Sobolev condition, $r_{0}=+\infty$ . Therefore, $\mu_{0}$ can be arbitrarily in $\mathcal{P}_{2,\operatorname{abs}}(X)$ . In the general case, however, a good enough starting point (i.e., $\mathcal{F}(\mu_{0})<r_{0}$ ) is needed to guarantee we are in the region where Łojasiewicz condition comes into play. In such a case, $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=\operatorname{KL}(\mu_{n}\|\mu^{*})$ where $\mu^{*}(x)\propto\exp(-F(x))$ is the target distribution (see Rmk. 1), so Thm. 4 provides convergence rate of $\{\mu_{n}\}_{n\in\mathbb{N}}$ to $\mu^{*}$ in terms of KL divergence and this convergence is exponentially fast if $\theta\in(0,1/2]$ .

Theorem 5.

Under the same set of assumptions as in Thm. 4, the sequence $\{\mu_{n}\}_{n\in\mathbb{N}}$ is a Cauchy sequence under Wasserstein topology. Furthermore, as the Wasserstein space $(\mathcal{P}_{2}(X),W_{2})$ is complete [4, Thm. 2.2], every Cauchy sequence is convergent, i.e., there exists $\mu^{*}\in\mathcal{P}_{2}(X)$ such that $\mu_{n}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}.$ The limit distribution $\mu^{*}$ is indeed the global minimizer of $\mathcal{F}$ . In addition:

(i)

if $\theta=0$ , $W_{2}(\mu_{n},\mu^{*})$ converges to $0$ in a finite number of steps;
(ii)

if $\theta\in(0,1/2]$ , $W_{2}(\mu_{n},\mu^{*})=O\left(\left(\frac{M}{M+1}\right)^{n}\right),\text{ % where }M=1+\frac{(2(\gamma^{2}L_{H}^{2}+1))^{\frac{1}{2\theta}}}{(1-\theta)% \gamma^{\frac{1-\theta}{\theta}}c^{\frac{1}{\theta}}}$ ;
(iii)

if $\theta\in(1/2,1)$ , $W_{2}(\mu_{n},\mu^{*})=O\left(n^{-\frac{1-\theta}{2\theta-1}}\right)$ .

Proof of Thm. 5 is in Appx. A.8. This theorem provides convergence to optimality in terms of Wasserstein distance.

5 Numerical illustrations

The JKO operator is a great theoretical tool to study Wasserstein gradient flows, e.g., it is the main recipe used in the seminal paper [35] that gives variational structure for the Fokker-Planck equation. However, the JKO operator is not quite scalable (at least for now). To learn the JKO, recent advances use the gradient of an input-convex neural network [5] to approximate the optimal Monge map pushing $\nu_{n+1}$ to $\mu_{n+1}$ [45]. This approach is inspired by Brenier theorem asserting that an optimal Monge map has to be the (sub)gradient field of some convex function. We use this neural network approach to perform some numerical sampling experiments from non-log-concave distributions: the Gaussian mixture distribution and the distance-to-set-prior [53] relaxed von Mises–Fisher distribution. Both are log-DC and the latter has non-differentiable logarithmic probability density (see Appx. C). Fig. 1 presents the sampling results. Implementation details are in Appx. B and experiment details are in Appx. C.

Refer to caption — Figure 1: (a) and (b): Mixture of Gaussians. (a) shows samples obtained from semi FB Euler at iteration $40$ and (b) shows KL divergence along the training process: semi FB Euler with sound theory is as fast as FB Euler; (c) and (d): Relaxed von Mises-Fisher. (c) shows true probability density, and (d) shows the sample histogram obtained from semi FB Euler. In this experiment, FB Euler fails to work, attributed to the high curvature of the relaxed von Mises-Fisher.

6 Related work

We first narrow down our discussion on FB Euler and its variants in the Wasserstein space. When $\mathscr{H}$ is the negative entropy, Wibisono [62] provides some insightful discussion on how FB Euler should be consistent (no asymptotic bias) because the backward step is adjoint to the forward step, hence preserves stationarity. However, no convergence theory is presented for FB Euler in the Wasserstein space in [62]. Recently, Salim et al. [57] provide convergence guarantee for FB Euler within the following setting: $\mathscr{H}$ is convex along generalized geodesics, $F$ is Lipschitz smooth and convex/strongly convex.

Some other papers have tangential relations to our work, mainly from the ULA (and its variants) literature. Durmus et al. [25] analyze the ULA from the convex optimization perspective. Vempala et al. [59] show that LSI and Hessian boundedness suffice for fast convergence of the ULA where "fast" is understood as fast to the biased target since ULA is a biased algorithm. Balasubramanian et al. [8] analyze the ULA under quite mild conditions: log-density is Lipschitz/Hölder smooth. Bernton [10] studies the proximal-ULA also under the convex assumption, where the difference to the ULA is the first step: gradient descent is replaced by the proximal operator. Similar to ULA, proximal-ULA is asymptotically biased. To address nonsmoothness, another line of research utilizes Moreau-Yosida envelopes to create smooth approximations of the ULA dynamics [28, 42]. This approach is also applicable to certain classes of non-log-concave distributions [42] and is more of a flavour of discretization error quantification.

7 Conclusion

We propose a new semi FB Euler scheme as a discretization of Wasserstein gradient flow and show that it has favourably theoretical guarantees that the commonly used FB Euler does not yet have if the objective function is not convex along generalized geodesics. Our theoretical analysis opens up interesting avenues for future work.

Acknowledgments and Disclosure of Funding

This work is supported by the Research Council of Finland Flagship programme: Finnish Center for Artificial Intelligence FCAI, and additionally by grants 345811, 348952, and 346376 (VILMA: Centre of Excellence: Virtual Laboratory for Molecular Level Atmospheric Transformations). The authors wish to thank the Finnish Computing Competence Infrastructure (FCCI) for supporting this project with computational and data storage resources. H.P.H. Luu specifically thanks Michel Ledoux, Luigi Ambrosio, and Alain Durmus for helpful information.

References

[1] Miju Ahn, Jong-Shi Pang, and Jack Xin. Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM Journal on Optimization, 27(3):1637–1665, 2017.
[2] Luigi Ambrosio, Alberto Bressan, Dirk Helbing, Axel Klar, Enrique Zuazua, Luigi Ambrosio, and Nicola Gigli. A user’s guide to optimal transport. Modelling and Optimisation of Flows on Networks: Cetraro, Italy 2009, Editors: Benedetto Piccoli, Michel Rascle, pages 1–155, 2013.
[3] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2005.
[4] Luigi Ambrosio and Giuseppe Savaré. Gradient flows of probability measures. Handbook of differential equations: evolutionary equations, 3:1–136, 2006.
[5] Brandon Amos, Lei Xu, and J Zico Kolter. Input convex neural networks. In International Conference on Machine Learning, pages 146–155. PMLR, 2017.
[6] Michael Arbel, Anna Korba, Adil Salim, and Arthur Gretton. Maximum mean discrepancy gradient flow. Advances in Neural Information Processing Systems, 32, 2019.
[7] Miroslav Bačák and Jonathan M Borwein. On difference convexity of locally Lipschitz functions. Optimization, 60(8-9):961–978, 2011.
[8] Krishna Balasubramanian, Sinho Chewi, Murat A Erdogdu, Adil Salim, and Shunshi Zhang. Towards a theory of non-log-concave sampling: first-order stationarity guarantees for Langevin Monte Carlo. In Conference on Learning Theory, pages 2896–2923. PMLR, 2022.
[9] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
[10] Espen Bernton. Langevin Monte Carlo and JKO splitting. In Conference on learning theory, pages 1777–1798. PMLR, 2018.
[11] Patrick Billingsley. Convergence of probability measures. John Wiley & Sons, 2013.
[12] Adrien Blanchet and Jérôme Bolte. A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions. Journal of Functional Analysis, 275(7):1650–1673, 2018.
[13] Vladimir Igorevich Bogachev and Maria Aparecida Soares Ruas. Measure theory, volume 2. Springer, 2007.
[14] Jérôme Bolte, Aris Daniilidis, and Adrian Lewis. The łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17(4):1205–1223, 2007.
[15] Benoît Bonnet. A Pontryagin maximum principle in Wasserstein spaces for constrained optimal control problems. ESAIM: Control, Optimisation and Calculus of Variations, 25:52, 2019.
[16] Jonathan Borwein and Adrian Lewis. CONVEX ANALYSIS AND NONLINEAR OPTIMIZATION Theory and Examples. Springer, 2006.
[17] Glen E Bredon. Topology and geometry, volume 139. Springer Science & Business Media, 2013.
[18] Yann Brenier. Polar factorization and monotone rearrangement of vector-valued functions. Communications on pure and applied mathematics, 44(4):375–417, 1991.
[19] brett1479 (https://math.stackexchange.com/users/62876/brett1479). Borel sigma algebra of one point compactification. Mathematics Stack Exchange. URL:https://math.stackexchange.com/q/3532983 (version: 2020-02-03).
[20] Haim Brézis. Functional analysis, Sobolev spaces and partial differential equations, volume 2. Springer, 2011.
[21] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
[22] Frank H Clarke. Optimization and nonsmooth analysis. SIAM, 1990.
[23] Ying Cui, Jong-Shi Pang, and Bodhisattva Sen. Composite difference-max programs for modern statistical estimation problems. SIAM Journal on Optimization, 28(4):3344–3374, 2018.
[24] Arnak S Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society Series B: Statistical Methodology, 79(3):651–676, 2017.
[25] Alain Durmus, Szymon Majewski, and Błażej Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
[26] Alain Durmus and Éric Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 – 1587, 2017.
[27] Alain Durmus and Éric Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 – 2882, 2019.
[28] Alain Durmus, Eric Moulines, and Marcelo Pereyra. Efficient Bayesian computation by proximal Markov chain Monte Carlo: when Langevin meets Moreau. SIAM Journal on Imaging Sciences, 11(1):473–506, 2018.
[29] Matthias Erbar. The heat equation on manifolds as a gradient flow in the Wasserstein space. In Annales de l’IHP Probabilités et statistiques, volume 46, pages 1–23, 2010.
[30] Saeed Ghadimi, Guanghui Lan, and Hongchao Zhang. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1):267–305, 2016.
[31] Piotr Hajlasz. Is there a Borel measurable $f:\mathbb{R}^{d}\to\mathbb{R}^{d}$ such that $f(x)\in\partial\varphi(x)$ for all $x$ ? MathOverflow. URL:https://mathoverflow.net/q/453991 (version: 2023-12-02).
[32] W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications. 1970.
[33] Horst Herrlich. Axiom of choice, volume 1876. Springer, 2006.
[34] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. Advances in neural information processing systems, 29, 2016.
[35] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the Fokker–Planck equation. SIAM journal on mathematical analysis, 29(1):1–17, 1998.
[36] Nicolas Lanzetti, Saverio Bolognani, and Florian Dörfler. First-order conditions for optimization in the Wasserstein space. arXiv preprint arXiv:2209.12197, 2022.
[37] Stanis law Łojasiewicz. Ensembles semi-analytiques. IHES notes, page 220, 1965.
[38] Hoai An Le Thi, Van Ngai Huynh, and Tao Pham Dinh. Convergence analysis of difference-of-convex algorithm with subanalytic data. Journal of Optimization Theory and Applications, 179(1):103–126, 2018.
[39] Hoai An Le Thi, Van Ngai Huynh, Tao Pham Dinh, and Hoang Phuc Hau Luu. Stochastic difference-of-convex-functions algorithms for nonconvex programming. SIAM Journal on Optimization, 32(3):2263–2293, 2022.
[40] Hoai An Le Thi and Tao Pham Dinh. The DC (difference of convex functions) programming and DCA revisited with DC models of real world nonconvex optimization problems. Annals of operations research, 133:23–46, 2005.
[41] John M. Lee. Smooth Manifolds. In Introduction to Smooth Manifolds, pages 1–31. Springer.
[42] Tung Duy Luu, Jalal Fadili, and Christophe Chesneau. Sampling from non-smooth distributions through Langevin diffusion. Methodology and Computing in Applied Probability, 23:1173–1201, 2021.
[43] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
[44] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
[45] Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale Wasserstein gradient flows. Advances in Neural Information Processing Systems, 34:15243–15256, 2021.
[46] Boris Mordukhovich and Mau Nam Nguyen. An easy path to convex analysis and applications. Springer Nature, 2023.
[47] Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
[48] Maher Nouiehed, Jong-Shi Pang, and Meisam Razaviyayn. On the pervasiveness of difference-convexity in optimization and statistics. Mathematical Programming, 174(1):195–222, 2019.
[49] Felix Otto and Cédric Villani. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
[50] Jong-Shi Pang, Meisam Razaviyayn, and Alberth Alvarado. Computing B-stationary points of nonsmooth DC programs. Mathematics of Operations Research, 42(1):95–118, 2017.
[51] Neal Parikh, Stephen Boyd, et al. Proximal algorithms. Foundations and trends® in Optimization, 1(3):127–239, 2014.
[52] Tao Pham Dinh and Hoai An Le Thi. Convex analysis approach to DC programming: theory, algorithms and applications. Acta mathematica vietnamica, 22(1):289–355, 1997.
[53] Rick Presman and Jason Xu. Distance-to-set priors and constrained Bayesian inference. In International Conference on Artificial Intelligence and Statistics, pages 2310–2326. PMLR, 2023.
[54] Gareth O Roberts and Richard L Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, pages 341–363, 1996.
[55] R Tyrrell Rockafellar. Lagrange multipliers and optimality. SIAM review, 35(2):183–238, 1993.
[56] R Tyrrell Rockafellar. Convex analysis, volume 11. Princeton university press, 1997.
[57] Adil Salim, Anna Korba, and Giulia Luise. The Wasserstein proximal gradient algorithm. Advances in Neural Information Processing Systems, 33:12356–12366, 2020.
[58] Amirhossein Taghvaei and Prashant Mehta. Accelerated flow for probability distributions. In International conference on machine learning, pages 6076–6085. PMLR, 2019.
[59] Santosh Vempala and Andre Wibisono. Rapid convergence of the unadjusted Langevin algorithm: Isoperimetry suffices. Advances in neural information processing systems, 32, 2019.
[60] Cédric Villani. Optimal transport: old and new, volume 338. Springer, 2009.
[61] Cédric Villani. Topics in optimal transportation, volume 58. American Mathematical Soc., 2021.
[62] Andre Wibisono. Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem. In Conference on Learning Theory, pages 2093–3027. PMLR, 2018.
[63] Hu Zhang and Yi-Shuai Niu. A Boosted-DCA with power-sum-DC decomposition for linearly constrained polynomial programs. Journal of Optimization Theory and Applications, pages 1–40, 2024.
[64] Xingyu Zhou. On the Fenchel duality between strong convexity and Lipschitz continuous gradient. arXiv preprint arXiv:1803.06573, 2018.

Appendix A Theory

Lemma 2 (Transfer lemma).

[2, Sect. 1] Let $T:\mathbb{R}^{m}\to\mathbb{R}^{n}$ be a measurable map, and $\mu\in\mathcal{P}(\mathbb{R}^{m})$ , then $T_{\#}\mu\in\mathcal{P}(\mathbb{R}^{n})$ and $\int{f(y)}d(T_{\#}\mu)(y)=\int{(f\circ T)(x)}d\mu(x)$ for every measurable function $f:\mathbb{R}^{n}\to\mathbb{R}$ , where the above identity has to be understood that: one of the integrals exits (potentially $\pm\infty$ ) iff the other one exists, and in such a case they are equal. Consequently, for a bounded function $f$ , the above integrals exist as real numbers that are equal.

Lemma 3.

[3, Rmk. 6.2.11] Let $\mu,\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ , then $T_{\nu}^{\mu}\circ T_{\mu}^{\nu}=I$ $\mu$ -a.e. and $T_{\mu}^{\nu}\circ T_{\nu}^{\mu}=I$ $\nu$ -a.e.

Theorem 6 (Characterization of Fréchet subdifferential for geodesically convex functions).

[3, Section 10.1] Suppose $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ is proper, l.s.c, convex on geodesics. Let $\mu\in\operatorname{dom}(\partial\phi)\cap\mathcal{P}_{2,\operatorname{abs}}(X)$ , then a vector $\xi\in L^{2}(X,X,\mu)$ belongs to the Fréchet subdifferential of $\phi$ at $\mu$ if and only if

\displaystyle\phi(\nu)-\phi(\mu)\geq\int_{X}{\langle\xi(x),T_{\mu}^{\nu}(x)-x% \rangle}d\mu(x)\quad\forall\nu\in\operatorname{dom}(\phi).

Lemma 4.

Let $H:X\to\mathbb{R}$ be a convex function having quadratic growth and $\xi$ be a measurable selector of $\partial H$ , i.e., $\xi(x)\in\partial H(x)$ for all $x\in X$ . Then, for all $\mu\in\mathcal{P}_{2}(X)$ ,

\displaystyle\int_{X}{\|\xi(x)\|^{2}}d\mu(x)<+\infty.

(5)

In other words, $\xi\in L^{2}(X,X,\mu)$ for all $\mu\in\mathcal{P}_{2}(X)$ .

Proof.

Since $\xi(x)\in\partial H(x)$ , by tangent inequality for convex functions, we have

\displaystyle H(y)\geq H(x)+\langle\xi(x),y-x\rangle,\quad\forall y\in X.

By picking $y=x+\eta\xi(x)$ for some $\eta>0$ , we get

\displaystyle H(x+\eta\xi(x))-H(x)\geq\langle\xi(x),\eta\xi(x)\rangle=\eta\|% \xi(x)\|^{2}.

(6)

Since $H$ has quadratic growth, for some $a>0$ ,

	$\displaystyle H(x+\eta\xi(x))-H(x)$	$\displaystyle\leq\|H(x+\eta\xi(x))\|+\|H(x)\|$
		$\displaystyle\leq a\left(\\|x+\eta\xi(x)\\|^{2}+1\right)+a(\\|x\\|^{2}+1)$
		$\displaystyle\leq 2a+a\\|x\\|^{2}+a(\\|x\\|+\eta\\|\xi(x)\\|)^{2}$
		$\displaystyle\leq 2a+a\\|x\\|^{2}+2a(\\|x\\|^{2}+\eta^{2}\\|\xi(x)\\|^{2})$
		$\displaystyle=2a+3a\\|x\\|^{2}+2a\eta^{2}\\|\xi(x)\\|^{2}.$

Combining with (6), it holds

\displaystyle\eta(1-2a\eta)\|\xi(x)\|^{2}\leq 2a+3a\|x\|^{2}.

By choosing $0<\eta<1/(2a)$ , we obtain

\displaystyle\|\xi(x)\|^{2}\leq\dfrac{2a}{\eta(1-2a\eta)}+\dfrac{3a}{\eta(1-2a% \eta)}\|x\|^{2}.

Therefore, $\|\xi(x)\|^{2}$ has quadratic growth and - as a consequence - (5) holds for any $\mu\in\mathcal{P}_{2}(X).$ ∎

Lemma 5.

Let $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ be a proper function. Let $\mu^{*}$ be a local minimizer of $\phi$ , then $\mu^{*}$ is a Fréchet stationary point of $\phi$ .

Proof.

There exists $r>0$ such that $\phi(\mu^{*})\leq\phi(\mu)$ for all $\mu\in\mathcal{P}_{2}(X):W_{2}(\mu,\mu^{*})<r.$ It follows that

\displaystyle\liminf_{\mu\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}}{% \dfrac{\phi(\mu)-\phi(\mu^{*})}{W_{2}(\mu,\mu^{*})}}\geq 0,

so $0\in\partial_{F}^{-}\phi(\mu^{*})$ , or $\mu^{*}$ is a Fréchet stationary point of $\phi$ . ∎

Lemma 6.

Let $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ be a proper, l.s.c, geodesically convex function. Suppose that $\mu^{*}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ is a Fréchet stationary point of $\phi$ . Then, $\mu^{*}$ is a global minimizer of $\phi$ .

Proof.

By definition of Fréchet stationarity, $0\in\partial\phi(\mu^{*})$ . By characterization of subdifferential of geodesically convex functions (Thm. 6), it holds $\phi(\mu)\geq\phi(\mu^{*})$ for all $\mu\in\operatorname{dom}(\phi)$ , or $\mu^{*}$ is a global minimizer of $\phi$ . ∎

Lemma 7.

Under Assumption 1, let $\mu^{*}\in\operatorname{dom}(\mathcal{F})$ be a local minimizer of $\mathcal{F}$ , then $\mu^{*}$ is a critical point of $\mathcal{F}$ , i.e., $\partial(\mathscr{H}+\mathcal{E}_{G})(\mu^{*})\cap\partial\mathcal{E}_{H}(\mu^% {*})\neq\emptyset.$

Proof.

Since $\mu^{*}$ is a local minimizer of $\mathcal{F}$ , there exists $r>0$ such that

\mathcal{F}(\mu^{*})\leq\mathcal{F}(\mu),\quad\forall\mu\in\mathcal{P}_{2}(X):% W_{2}(\mu,\mu^{*})<r.

(7)

Let $\xi$ be a measurable selector of $\partial H$ . Thanks to Lemma 4, $\xi\in L^{2}(X,X,\mu^{*})$ . According to [4, Prop. 4.13], $\xi\in\partial\mathcal{E}_{H}(\mu^{*})$ . It follows from Thm. 6 that

\displaystyle\mathcal{E}_{H}(\mu)\geq\mathcal{E}_{H}(\mu^{*})+\int_{X}{\langle% \xi(x),T_{\mu^{*}}^{\mu}(x)-x\rangle}d\mu^{*}(x),\quad\forall\mu\in\mathcal{P}% _{2}(X).

(8)

From (7) and (8), for $\mu\in B(\mu^{*},r)$ ,

\displaystyle\mathscr{H}(\mu)+\mathcal{E}_{G}(\mu)\geq\mathscr{H}(\mu^{*})+% \mathcal{E}_{G}(\mu^{*})+\int_{X}{\langle\xi(x),T_{\mu^{*}}^{\mu}(x)-x\rangle}% d\mu^{*}(x).

Therefore, $\xi\in\partial(\mathscr{H}+\mathcal{E}_{G})(\mu^{*})$ since $\mathscr{H}+\mathcal{E}_{G}$ is geodesically convex. It follows that $\mu^{*}$ is a critical point of $\mathcal{F}.$ ∎

Lemma 8.

Let $\mathcal{U},\mathcal{V}:\mathcal{P}_{2}(X)\to\mathbb{R}{\cup\{+\infty\}}$ . The following statements hold

a.

$\partial_{F}^{-}(\mathcal{U}+\mathcal{V})(\mu)\supset\partial_{F}^{-}\mathcal{% U}(\mu)+\partial_{F}^{-}\mathcal{V}(\mu).$

If $\mathcal{V}$ is Wasserstein differentiable of $\mu$ , then

\displaystyle\partial_{F}^{-}\mathscr{(}\mathcal{U}+\mathcal{V})(\mu)=\partial% _{F}^{-}\mathcal{U}(\mu)+\nabla_{W}\mathcal{V}(\mu).

(9)

Proof.

Item a. is trivial from the definition of Fréchet subdifferential. For item b., from item a., we first see that,

\displaystyle\partial_{F}^{-}(\mathcal{U}+\mathcal{V})(\mu)\supset\partial_{F}% ^{-}\mathcal{U}(\mu)+\nabla_{W}\mathcal{V}(\mu).

(10)

On the other hand, we apply item a. for $\mathcal{U}+\mathcal{V}$ and $-\mathcal{V}$ to obtain

\displaystyle\partial_{F}^{-}\mathcal{U}(\mu)\supset\partial_{F}^{-}(\mathcal{% U}+\mathcal{V})(\mu)+\partial_{F}^{-}(-\mathcal{V})(\mu).

Since $-\nabla_{W}\mathcal{V}(\mu)\in\partial_{F}^{-}(-\mathcal{V})(\mu)$ , it follows that

\displaystyle\partial_{F}^{-}\mathcal{U}(\mu)\supset\partial_{F}^{-}(\mathcal{% U}+\mathcal{V})(\mu)-\nabla_{W}\mathcal{V}(\mu).

(11)

From (10) and (11), we derive (9). ∎

Lemma 9.

Let $\mu\ll\mathscr{L}^{d}$ , and $g$ is a strongly convex function. Then $\nabla g_{\#}\mu\ll\mathscr{L}^{d}.$

Proof.

Let $\Omega=\{x:\nabla^{2}g(x)\text{ exists}\}$ , then $\mathscr{L}^{d}(\mathbb{R}^{d}\setminus\Omega)=0$ (Aleksandrov, see, e.g., [3, Thm. 5.5.4]). Since $g$ is strongly convex, $\nabla g$ is injective on $\Omega$ and $|\det\nabla^{2}g|>0$ on $\Omega$ . By applying Lemma 5.5.3 [3], $\nabla g_{\#}\mu\ll\mathscr{L}^{d}.$ ∎

Lemma 10.

[57] Let $\mathcal{G}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ be proper and l.s.c. Suppose that $\mathcal{G}$ is convex along generalized geodesics. Let $\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ , $\mu,\pi\in\mathcal{P}_{2}(X)$ . If $\xi\in\partial\mathcal{G}(\mu)$ , then

\displaystyle\int_{X}{\langle\xi\circ T_{\nu}^{\mu}(x),T_{\nu}^{\pi}(x)-T_{\nu% }^{\mu}(x)\rangle}d\nu(x)\leq\mathcal{G}(\pi)-\mathcal{G}(\mu).

A.1 Existence of a Borel measurable selector of the subdifferential of a convex function

Given a convex function $H:X\to\mathbb{R}$ , we prove that there exists a Borel measurable selector $S(x)\in\partial H(x)$ . Although this problem is of natural interest, we are not aware of it as well as its proof at least in standard textbooks in convex analysis. Credits go to a quite recent MathOverflow thread [31], from which we give detailed proof as follows.

Firstly, we recall Alexandroff’s compactification of a topological space $(X,\tau)$ . From set theory, $X$ is strictly smaller than $2^{X}$ , which is the set of all subsets of $X$ , i.e., there is no bijection from $X$ to $2^{X}$ . So $2^{X}$ cannot be contained in $X$ . So, there is an element named $\infty$ that is not in $X$ . We denote $X^{\infty}=X\cup\{\infty\}$ . One-point Alexandroff compactification states that (1) there exists a topology $\tau^{\infty}$ in $X^{\infty}$ accepting $X$ as a topological subspace, i.e., the original topology $\tau$ in $X$ is inherited from $\tau^{\infty}$ , and (2) $(X^{\infty},\tau^{\infty})$ is compact.

The topology $\tau^{\infty}$ can be specifically described as follows: open sets of $\tau^{\infty}$ are either open sets of $\tau$ or the complements of the form $(X\setminus S)\cup\{\infty\}$ where $S$ are closed compact subsets of $X$ .

In our case, $X=\mathbb{R}^{d}$ and $\tau$ is the standard Euclidean topology, the Alexandroff compactification of $X$ is also metrizable [17, Thm. 12.12]. It is, in fact, homeomorphic to the sphere $\mathbb{S}^{d}$ whose topology is inherited from the ambient space $\mathbb{R}^{d+1}$ . Moreover, the mentioned metric is the Riemannian metric of $\mathbb{S}^{d}$ [41, Thm. 13.29].

Secondly, since $\psi(x):=H(x)+(1/2)\|x\|^{2}$ is 1-strongly convex, its Fenchel conjugate $y\mapsto\psi^{*}(y)$ defined as

\displaystyle\psi^{*}(y)=\sup_{x\in\mathbb{R}^{d}}\{\langle x,y\rangle-\psi(x)\}

is 1-smooth [64, Thm. 1].

By [56, Cor. 23.5.1], $(\partial\psi)^{-1}=\nabla\psi^{*}$ in the sense that

\displaystyle y\in\partial\psi(\nabla\psi^{*}(y))\quad\forall y\in\mathbb{R}^{% d}.

(12)

On the other hand, $\partial\psi$ is strongly monotone [46, Ex. 3.9] in the following sense,

\displaystyle\langle x_{2}-x_{1},y_{2}-y_{1}\rangle\geq\|x_{2}-x_{1}\|^{2}

(13)

for all $x_{1},x_{2}\in\mathbb{R}^{d}$ and $y_{1}\in\partial\psi(x_{1}),y_{2}\in\partial\psi(x_{2})$ .

We see that $\nabla\psi^{*}$ is subjective. Indeed we show that for every $x\in\mathbb{R}^{d}$ , there exists $y\in\mathbb{R}^{d}$ such that $\nabla\psi^{*}(y)=x$ . We show that relation holds for any $y\in\partial\psi(x)$ . By contradiction, suppose that $\nabla\psi^{*}(y)\neq x$ . From the strong monotonicity of $\partial\psi$ as in (13), $\partial\psi(\nabla\psi^{*}(y))\cap\partial\psi(x)=\emptyset.$ However, from (12) and by the choice of $y$ , it holds $y\in\partial\psi(\nabla\psi^{*}(y))\cap\partial\psi(x)$ . This is a contradiction.

Thirdly, we recall a fundamental result on the compactness of the subdifferential of a convex function: if $C$ is compact, then $\partial\psi(C)$ is compact [56, Thm. 24.7].

Fourthly, we need the Federer-Morse theorem [13] as follows:

Theorem 7.

Let $Z$ be a compact metric space, $Y$ be a Hausdorff topological space and $f:Z\to Y$ be a continuous map**. Then, there exists a Borel set $B\subset Z$ such that $f(B)=f(Z)$ and $f$ is injective on $B$ . Furtheremore, $f^{-1}:f(Z)\to B$ is Borel.

Now we observe that $\nabla\psi^{*}(x)\to\infty$ was $x\to\infty$ . Otherwise, by using the compactness of $\partial\psi$ and (12), we will get a contradiction immediately. We then can extend $\nabla\psi^{*}$ in a continuous way in the Alexandorff compactification space $X^{\infty}=\mathbb{R}^{d}\cup\{\infty\}$ by simply putting $\nabla\psi^{*}(\infty)=\infty.$ We shall show that this extension of $\nabla\psi^{*}$ is continuous from $(X^{\infty},\tau^{\infty})$ to $(X^{\infty},\tau^{\infty})$ , or $(\nabla\psi^{*})^{-1}(V)\in\tau^{\infty}$ for all $V\in\tau^{\infty}$ . Recall that, by construction, open sets of $\tau^{\infty}$ are either open sets of $\tau$ or the complements of the form $(X\setminus S)\cup\{\infty\}$ where $S$ are compact subsets of $X$ . The former type of open sets is handled easily since $\nabla\psi^{*}$ is already continuous in $(X,\tau)$ . For the latter type, let $U=(\nabla\psi^{*})^{-1}((X\setminus S)\cup\{\infty\})=(\nabla\psi^{*})^{-1}(X% \setminus S)\cup\{\infty\}=(X\setminus(\nabla\psi^{*})^{-1}(S))\cup\{\infty\}$ for a compact set $S$ . Proving $U$ open boils down to proving $(\nabla\psi^{*})^{-1}(S)$ compact. Indeed, it is closed since $S$ is closed. It is bounded. Otherwise, it will be contradictory to $\nabla\psi^{*}(x)\to\infty$ as $x\to\infty$ .

We now can apply the Federer-Morse theorem for $Z=Y=(X^{\infty},\tau^{\infty})$ by noting that $(X^{\infty},\tau^{\infty})$ is metrizable and a metric space is a Hausdorff space, and for $f=\nabla\psi^{*}$ : there exists a Borel set $B\subset X^{\infty}$ such that $\nabla\psi^{*}|_{B}:B\to X^{\infty}$ is a bijection and the inverse map** $(\nabla\psi^{*}|_{B})^{-1}:X^{\infty}\to B$ is Borel measurable (here Borel set/measurability are with respect to $\tau^{\infty}$ , not yet $\tau$ ). This is the Borel (w.r.t. $\tau^{\infty}$ ) selector of $\partial\psi$ .

Finally, we need to convert Borel measurability w.r.t. $\tau^{\infty}$ to Borel measurability w.r.t. $\tau$ . In terms of map**, $\infty$ is mapped to $\infty$ either way around. So we only need to show: $(\nabla\psi^{*}|_{B})^{-1}:X\to X$ is Borel measurable w.r.t. $\tau$ . Take any Borel set (w.r.t. $\tau$ ) $E\subset X$ , $(\nabla\psi^{*}|_{B})(B\cap E)$ is Borel set w.r.t. $\tau^{\infty}$ and does not contain $\infty$ . We shall prove $(\nabla\psi^{*}|_{B})(B\cap E)$ is a Borel set w.r.t. $\tau$ . This follows directly from the following claim, which is from another Mathematics Stack Exchange thread [19].

Claim [19]:

\displaystyle\sigma(\tau^{\infty})=\sigma(\tau\cup\{\infty\})=\sigma(\tau)\cup% \{V\cup\{\infty\}:V\in\sigma(\tau)\}.

(14)

A sketch of the claim proof goes as follows. For the first equality in (14), first we have $\sigma(\tau\cup\{\infty\})\subset\sigma(\tau^{\infty})$ because (1) $\tau\subset\tau^{\infty}$ and (2) $\{\infty\}=X^{\infty}\setminus X\in\sigma(\tau^{\infty})$ as $X,X^{\infty}\in\tau^{\infty}$ . On the other hand, $\sigma(\tau^{\infty})\subset\sigma(\tau\cup\{\infty\})$ because, again, of the construction of $\tau^{\infty}$ : let $U\in\tau^{\infty}$ , if $U\in\tau$ then $U\in\sigma(\tau\cup\{\infty\})$ , otherwise $U=(X\setminus S)\cup\{\infty\}$ for some compact set $S\subset X$ . As $X=\mathbb{R}^{d}$ , $S$ is closed, so $(X\setminus S)\in\tau$ implying $U\in\sigma(\tau\cup\{\infty\}).$ For the second equality in (14), it is straight forward to verify that $\mathscr{G}:=\sigma(\tau)\cup\{V\cup\{\infty\}:V\in\sigma(\tau)\}$ is a sigma-algebra in $X^{\infty}$ . Since $\tau\cup\{\infty\}\subset\mathscr{G}$ , it holds $\sigma(\tau\cup\{\infty\})\subset\mathscr{G}$ . Conversely, as $\sigma(\tau)\subset\sigma(\tau\cup\{\infty\})$ and - consequently - $V\cup\{\infty\}\in\sigma(\tau\cup\{\infty\})$ for all $V\in\sigma(\tau)$ , it holds $\mathscr{G}\subset\sigma(\tau\cup\{\infty\}).$

We conclude that $(\nabla\psi^{*}|_{B})^{-1}:X\to X$ is a Borel (w.r.t. $\tau$ ) selector of $\partial\psi$ . As a consequence, $S:=(\nabla\psi^{*}|_{B})^{-1}-I$ is a Borel measurable selector of $\partial H$ .

A.2 DC structure of Maximum Mean Discrepancy

Let $k$ be a kernel whose gradient is Lipschitz continuous, i.e., for some $L>0$ ,

\displaystyle\|\nabla k(x,y)-\nabla k(x^{\prime},y^{\prime})\|\leq L\left(\|x-% x^{\prime}\|^{2}+\|y-y^{\prime}\|^{2}\right)^{\frac{1}{2}},\quad\forall x,x^{% \prime},y,y^{\prime}\in X,

which can be expressed equivalently as

\displaystyle\|\nabla_{x}k(x,y)-\nabla_{x}k(x^{\prime},y^{\prime})\|^{2}+\|% \nabla_{y}k(x,y)-\nabla_{y}k(x^{\prime},y^{\prime})\|^{2}\leq L^{2}\left(\|x-x% ^{\prime}\|^{2}+\|y-y^{\prime}\|^{2}\right).

Let $\mu^{*}$ be some fixed distribution. Consider the following free-energy functional

\displaystyle\mathcal{F}(\mu)=\int_{X}\int_{X}k(x,y)d\mu(x)d\mu(y)-2\int_{X}% \int_{X}k(x,y)d\mu^{*}(y)d\mu(x).

Let $\alpha>0$ , we can rewrite $\mathcal{F}$ as follows

\displaystyle\mathcal{F}(\mu)=\int_{X}\int_{X}\left[\alpha\|x\|^{2}+\alpha\|y% \|^{2}+k(x,y)\right]d\mu(x)d\mu(y)-2\int_{X}\left[\alpha\|x\|^{2}+\int_{X}k(x,% y)d\mu^{*}(y)\right]d\mu(x).

As $k$ is Lipschitz smooth w.r.t. $(x,y)$ , $x\mapsto\int{k(x,y)}d\mu^{*}(y)$ is also Lipschitz smooth. Indeed,

	$\displaystyle\left\\|\nabla_{x}\int_{X}{k(x,y)}d\mu^{}(y)-\nabla_{x}\int_{X}{k% (x^{\prime},y)}d\mu^{}(y)\right\\|$
	$\displaystyle=\left\\|\int_{X}{\left(\nabla_{x}k(x,y)-\nabla_{x}k(x^{\prime},y)% \right)}d\mu^{*}(y)\right\\|$
	$\displaystyle\leq\left(\int_{X}\left\\|\nabla_{x}k(x,y)-\nabla_{x}k(x^{\prime},% y)\right\\|^{2}d\mu^{*}(y)\right)^{\frac{1}{2}}$
	$\displaystyle\leq L\left(\int_{X}\left\\|x-x^{\prime}\right\\|^{2}d\mu^{*}(y)% \right)^{\frac{1}{2}}$
	$\displaystyle=L\\|x-x^{\prime}\\|.$

Next, as a standard result, if $f$ is an L-smooth function, $x\mapsto(\alpha/2)\|x\|^{2}\pm f(x)$ are convex whenever $\alpha\geq L$ . Therefore, for $\alpha\geq L$ , $W(x,y):=\alpha\|x\|^{2}+\alpha\|y\|^{2}+k(x,y)$ is convex and $F(x)=-2\left[\alpha\|x\|^{2}+\int_{X}k(x,y)d\mu^{*}(y)\right]$ is concave. From [3, Prop. 9.3.5], the interaction energy corresponding to $W$ is generalized geodesically convex.

A.3 Proof of Lemma 1

Since $I+\gamma S$ is a subgradient selector of a convex function, the optimal transport between $\mu_{n}$ and $\nu_{n+1}$ is given by

\displaystyle T_{\mu_{n}}^{\nu_{n+1}}=I+\gamma S.

(15)

and between $\mu_{n+1}$ and $\nu_{n+1}$ [3, Lem. 10.1.2]

\displaystyle T_{\mu_{n+1}}^{\nu_{n+1}}\in I+\gamma\partial\left(\mathcal{E}_{% G}+\mathscr{H}\right)(\mu_{n+1}).

(16)

Since $\mathcal{E}_{H}$ is convex along generalized geodesics [3, Prop. 9.3.2] and $S$ is a subgradient of $\mathcal{E}_{H}$ at $\mu_{n}$ [4, Proposition 4.13], by Lem. 10 it holds, for any $\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ ,

\displaystyle\mathcal{E}_{H}(\mu_{n+1})

\displaystyle\geq\mathcal{E}_{H}(\mu_{n})+\int_{X}{\langle S\circ T_{\nu}^{\mu% _{n}}(x),T_{\nu}^{\mu_{n+1}}(x)-T_{\nu}^{\mu_{n}}(x)\rangle}d\nu(x).

By choosing $\nu=\nu_{n+1}$ (note that $\nu_{n+1}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ ),

$\displaystyle\mathcal{E}_{H}(\mu_{n+1})$	$\displaystyle\geq\mathcal{E}_{H}(\mu_{n})+\int_{X}{\langle S\circ T_{\nu_{n+1}% }^{\mu_{n}}(x),T_{\nu_{n+1}}^{\mu_{n+1}}(x)-T_{\nu_{n+1}}^{\mu_{n}}(x)\rangle}% d\nu_{n+1}(x)$
	$\displaystyle=\mathcal{E}_{H}(\mu_{n})+\dfrac{1}{\gamma}\int_{X}{\langle(T_{% \mu_{n}}^{\nu_{n+1}}-I)\circ T_{\nu_{n+1}}^{\mu_{n}}(x),T_{\nu_{n+1}}^{\mu_{n+% 1}}(x)-T_{\nu_{n+1}}^{\mu_{n}}(x)\rangle}d\nu_{n+1}(x)$
	$\displaystyle=\mathcal{E}_{H}(\mu_{n})+\dfrac{1}{\gamma}\int_{X}{\langle x-T_{% \nu_{n+1}}^{\mu_{n}}(x),T_{\nu_{n+1}}^{\mu_{n+1}}(x)-T_{\nu_{n+1}}^{\mu_{n}}(x% )\rangle}d\nu_{n+1}(x)$	(17)

where the second equality uses (15) and the last one uses Lem. 3.

On the other hand, since $\mathcal{E}_{G}+\mathscr{H}$ is convex along generalized geodesics, by applying Lem. 10 for $\mathcal{E}_{G}+\mathscr{H}$ at $\mu_{n+1}$ with a subgradient $\gamma^{-1}(T_{\mu_{n+1}}^{\nu_{n+1}}-I)\in\partial(\mathcal{E}_{G}+\mathscr{H% })(\mu_{n+1})$ (from (16)),

		$\displaystyle(\mathcal{E}_{G}+\mathscr{H})(\mu_{n})$
		$\displaystyle\geq(\mathcal{E}_{G}+\mathscr{H})(\mu_{n+1})+\int_{X}{\left% \langle\dfrac{(T_{\mu_{n+1}}^{\nu_{n+1}}-I)}{\gamma}\circ T_{\nu_{n+1}}^{\mu_{% n+1}}(x),T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\right\rangle}% d\nu_{n+1}(x)$
		$\displaystyle=(\mathcal{E}_{G}+\mathscr{H})(\mu_{n+1})+\dfrac{1}{\gamma}\int_{% X}{\left\langle x-T_{\nu_{n+1}}^{\mu_{n+1}}(x),T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{% \nu_{n+1}}^{\mu_{n+1}}(x)\right\rangle}d\nu_{n+1}(x)$		(18)

where the last equality uses Lem. 3.

By adding (A.3) and (A.3) side by side,

\displaystyle\mathcal{F}(\mu_{n})\geq\mathcal{F}(\mu_{n+1})+\dfrac{1}{\gamma}% \int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_% {n+1}(x).

(19)

A.4 Proof of Theorem 1

Under Assumption 4, $S=\nabla H$ and $S$ is continuous.

For item (i), Lemma 1 implies that $\mathcal{F}(\mu_{n})\leq\mathcal{F}(\mu_{0})$ for all $n\in\mathbb{N}$ , so the whole sequence $\{\mu_{n}\}_{n\in\mathbb{N}}$ is contained in the sublevel set of $\mathcal{F}$ at level $\mathcal{F}(\mu_{0})$ which is compact under Wasserstein topology by Assumption 2. Therefore, there exists a subsequence of $\{\mu_{n}\}_{n\in\mathbb{N}}$ , denoted by $\{\mu_{n_{k}}\}_{k\in\mathbb{N}}$ , such that $\mu_{n_{k}}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}$ .

For item (ii), let $\mu^{*}\in\mathcal{P}_{2}(X)$ and $\mu_{n_{k}}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}$ . It holds

	$\displaystyle\liminf_{k\to\infty}{\mathcal{F}(\mu_{n_{k}})}$	$\displaystyle=\liminf_{k\to\infty}{\left(\mathscr{H}(\mu_{n_{k}})+\mathcal{E}_% {F}(\mu_{n_{k}})\right)}$
		$\displaystyle=\liminf_{k\to\infty}{\mathscr{H}(\mu_{n_{k}})}+\mathcal{E}_{F}(% \mu^{*})$
		$\displaystyle\geq\mathscr{H}(\mu^{})+\mathcal{E}_{F}(\mu^{}),$

since $\mathscr{H}$ is l.s.c. and $\mathcal{E}_{F}$ is continuous w.r.t. Wasserstein topology. Therefore, $\mathscr{H}(\mu^{*})<+\infty$ , which further implies that $\mu^{*}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ .

We have

	$\displaystyle\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x% )\\|^{2}}d\nu_{n+1}(x)$	$\displaystyle=\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(% x)\\|^{2}}d{T^{\nu_{n+1}}_{\mu_{n}}}_{\#}\mu_{n}(x)$
		$\displaystyle=\int_{X}{\\|x-T_{\nu_{n+1}}^{\mu_{n+1}}\circ T^{\nu_{n+1}}_{\mu_{% n}}(x)\\|^{2}}d\mu_{n}(x).$		(20)

We observe that $T_{\nu_{n+1}}^{\mu_{n+1}}\circ T^{\nu_{n+1}}_{\mu_{n}}$ is a (possibly non-optimal) transport pushing $\mu_{n}$ to $\mu_{n+1}$ , by the optimality of $T_{\mu_{n}}^{\mu_{n+1}}$ ,

\displaystyle\int_{X}{\|x-T_{\nu_{n+1}}^{\mu_{n+1}}\circ T^{\nu_{n+1}}_{\mu_{n% }}(x)\|^{2}}d\mu_{n}(x)\geq\int_{X}{\|x-T^{\mu_{n+1}}_{\mu_{n}}(x)\|^{2}}d\mu_% {n}(x)=W_{2}^{2}(\mu_{n},\mu_{n+1}).

(21)

By Lem. 1 and (A.4), (21),

\displaystyle\mathcal{F}(\mu_{n})\geq\mathcal{F}(\mu_{n+1})+\dfrac{1}{\gamma}W% _{2}^{2}(\mu_{n},\mu_{n+1}).

(22)

Note that $\mathcal{F}$ is bounded below (Assumption 1), telesco** (22) gives us

\displaystyle\sum_{n=0}^{\infty}{W_{2}^{2}(\mu_{n},\mu_{n+1})}<+\infty.

(23)

In particular, $W_{2}(\mu_{n},\mu_{n+1})\to 0$ . This together with $\mu_{n_{k}}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}$ implies $\mu_{n_{k}+1}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}$ .

Next, recall that $\nu_{n_{k}+1}=(I+\gamma S)_{\#}\mu_{n_{k}}$ , we show that

\displaystyle\nu_{n_{k}+1}\xrightarrow{\operatorname{\text{narrow}}}\nu^{*}:=(% I+\gamma S)_{\#}\mu^{*}\text{ as }k\to+\infty.

Thus, let $f$ be a continuous and bounded test functional in $X$ , by using transfer lemma 2,

$\displaystyle\lim_{k\to\infty}\int_{X}{f(x)}d\nu_{n_{k}+1}(x)$	$\displaystyle=\lim_{k\to\infty}\int_{X}{f(x)}d(I+\gamma S)_{\#}\mu_{n_{k}}(x)$
	$\displaystyle=\lim_{k\to\infty}\int_{X}{f(x+\gamma S(x))}d\mu_{n_{k}}(x)$	(24)
	$\displaystyle=\int_{X}{f(x+\gamma S(x))}d\mu^{*}(x)$
	$\displaystyle=\int_{X}{f(x)}d\nu^{*}(x),$	(25)

since $S$ is continuous. So $\nu_{n_{k}+1}\xrightarrow{\operatorname{\text{narrow}}}\nu^{*}$ . We go one step further and prove that $\nu_{n_{k}+1}$ actually converges to $\nu^{*}$ in the Wasserstein metric. This boils down to showing convergence in second-order moments, i.e.,

\displaystyle\mathfrak{m}_{2}(\nu_{n_{k}+1})\to\mathfrak{m}_{2}(\nu^{*}),

which is equivalent to showing that

\displaystyle\int_{X}{\|x+\gamma S(x)\|^{2}}d\mu_{n_{k}}(x)\to\int_{X}{\|x+% \gamma S(x)\|^{2}}d\mu^{*}(x).

On the other hand, $\psi(x):=\|x+\gamma S(x)\|^{2}$ has quadratic growth (follows from Lem. 4) and $\mu_{n_{k}}\to\mu^{*}$ in the Wasserstein metric, so [2, Prop. 2.4]

\displaystyle\lim_{k\to\infty}{\int_{X}{\|x+\gamma S(x)\|^{2}}d\mu_{n_{k}}(x)}% =\int_{X}\|x+\gamma S(x)\|^{2}d\mu^{*}(x)

Therefore, $\nu_{n_{k}+1}\to\nu^{*}$ in Wasserstein metric.

To proceed further, we need the following theorem stating that the graph of the subdifferential of a geodesically convex function is closed under the product of Wasserstein and weak topologies.

Theorem 8 (Closedness of subdifferential graph).

[3, Lemma 10.1.3] Let $\phi$ be a geodesically convex functional satisfying $\operatorname{dom}(\partial\phi)\subset\mathcal{P}_{2,\operatorname{abs}}(X).$ Let $\{\mu_{n}\}_{n\in\mathbb{N}}$ be a sequence converging in Wasserstein metric to $\mu\in\operatorname{dom}(\phi)$ . Let $\xi_{n}\in\partial\phi(\mu_{n})$ be satisfying

\displaystyle\sup_{n\in\mathbb{N}}\int_{X}{\|\xi_{n}(x)\|^{2}}d\mu_{n}(x)<+\infty,

(26)

and converging weakly to $\xi\in L^{2}(X,X,\mu)$ in the following sense:

\displaystyle\lim_{n\to\infty}\int_{X}{\zeta(x)\xi_{n}(x)}d\mu_{n}(x)=\int_{X}% {\zeta(x)\xi(x)}d\mu(x),\quad\forall\zeta\in C_{c}^{\infty}(X).

(27)

Then $\xi\in\partial\phi(\mu).$

As a side note, we need the notion of weak convergence in the above theorem because – unlike subdifferentials in flat Euclidean space – each $\xi_{n}$ lives in its own $L^{2}(X,X,\mu_{n})$ space.

Back to our proof, for item (26), we show that

\displaystyle\sup_{n\in\mathbb{N}}\int_{X}{\left\|{T_{\mu_{n}}^{\nu_{n}}(x)-x}% \right\|^{2}}d\mu_{n}(x)<+\infty.

(28)

We proceed as follows to prove (28). We first show that $\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\mu_{n})}<+\infty$ . By contradiction, by assuming $\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\mu_{n})}=+\infty$ , we can extract a subsequence $\{\mu_{n_{k}}\}_{k\in\mathbb{N}}$ such that

\displaystyle\lim_{k\to\infty}{\mathfrak{m}_{2}(\mu_{n_{k}})}=+\infty.

(29)

By compactness assumption 2, there further exists a subsequence $\{\mu_{n_{k_{i}}}\}_{i\in\mathbb{N}}$ such that $\mu_{n_{k_{i}}}$ converges (in Wasserstein metric) to some $\mu^{**}\in\mathcal{P}_{2}(X)$ . On the other hand, we have the following inequality: for all $x,y\in X,$

\displaystyle\|x\|^{2}\leq 2\|x-y\|^{2}+2\|y\|^{2}.

(30)

By using (30) for $(X,Y)$ where $(X,Y)$ is the optimal coupling of $(\mu_{n_{k_{i}}},\mu^{**})$ and taking the expectation of both sides, we get

\displaystyle\mathfrak{m}_{2}(\mu_{n_{k_{i}}})\leq 2W_{2}^{2}(\mu_{n_{k_{i}}},% \mu^{**})+2\mathfrak{m}_{2}(\mu^{**}).

It follows that $\limsup_{i\to\infty}{\mathfrak{m}_{2}(\mu_{n_{k_{i}}})}\leq 2\mathfrak{m}_{2}(% \mu^{**})$ , which contradicts (29). Therefore, $\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\mu_{n})}<+\infty$ . We next show that $\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\nu_{n})}<+\infty$ . Indeed, as $\|S\|^{2}$ has quadratic growth,

\displaystyle\|S(x)\|^{2}\leq c(\|x\|^{2}+1)

for some $c>0$ . So

	$\displaystyle\mathfrak{m}_{2}(\nu_{n+1})$	$\displaystyle=\int_{X}{\\|x\\|^{2}}d\nu_{n+1}(x)$
		$\displaystyle=\int_{X}{\\|x\\|^{2}}d(I+\gamma S)_{\#}\mu_{n}(x)$
		$\displaystyle=\int_{X}{\\|x+\gamma S(x)\\|^{2}}d\mu_{n}(x)$
		$\displaystyle\leq 2\int_{X}{\\|x\\|^{2}}d\mu_{n}(x)+2\gamma^{2}\int_{X}{\\|S(x)\\|% ^{2}}d\mu_{n}(x)$
		$\displaystyle\leq(2+2c\gamma^{2})\mathfrak{m}_{2}(\mu_{n})+2\gamma^{2}c,$

implying that $\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\nu_{n})}<+\infty.$ This in conjunction with $G$ having quadratic growth implies that $\sup_{n\in\mathbb{N}}{|\mathcal{E}_{G}(\nu_{n})|}<+\infty$ . Furthermore, $\inf_{n}{(\mathcal{E}_{G}+\mathscr{H})(\mu_{n})}>-\infty$ otherwise by lower semicontinuity of $\mathscr{H}$ and compactness of $\{\mu_{n}\}_{n\in\mathbb{N}}$ we get a contradiction.

Now, as $\gamma^{-1}(T_{\mu_{n+1}}^{\nu_{n+1}}-I)\in\partial\left(\mathcal{E}_{G}+% \mathscr{H}\right)(\mu_{n+1})$ and $\mathcal{E}_{G}+\mathscr{H}$ is geodesically convex, by applying Thm. 6, it holds

\displaystyle(\mathcal{E}_{G}+\mathscr{H})(\nu_{n+1})\geq(\mathcal{E}_{G}+% \mathscr{H})(\mu_{n+1})+\dfrac{1}{\gamma}\int_{X}{\|T_{\mu_{n+1}}^{\nu_{n+1}}(% x)-x\|^{2}}d\mu_{n+1}(x).

(31)

The finiteness as in (28) then follows from $\sup_{n\in\mathbb{N}}{|\mathcal{E}_{G}(\nu_{n})|}<+\infty,\sup_{n\in\mathbb{N}% }-(\mathcal{E}_{G}+\mathscr{H})(\mu_{n})<+\infty$ as proved and $\sup_{n\in\mathbb{N}}\mathscr{H}(\nu_{n})<+\infty$ as assumed.

We next prove that there is a subsequence of $\{\mu_{n_{k}}\}_{k\in\mathbb{N}}$ such that

\displaystyle T^{\nu_{n_{k_{j}}+1}}_{\mu_{n_{k_{j}}+1}}-I\to T_{\mu^{*}}^{\nu^% {*}}-I\text{ {weakly}.}

(32)

We consider the sequence of optimal plans between $\mu_{n}$ and $\nu_{n}$ as follows

\displaystyle\rho_{{n}}=(I,T_{\mu_{n}}^{\nu_{n}})_{\#}\mu_{{n}},\quad\forall n% \in\mathbb{N}.

We observe that

\displaystyle\mathfrak{m}_{2}(\rho_{n})=\int_{X\times X}{(\|x\|^{2}+\|y\|^{2})% }d\rho_{n}(x,y)=\int_{X}{\|x\|^{2}}d\mu_{n}(x)+\int_{X}{\|y\|^{2}}d\nu_{n}(y)<% +\infty,

so $\rho_{n}\in\mathcal{P}_{2}(X\times X)$ for all $n\in\mathbb{N}$ . Since $\sup_{n\in\mathbb{N}}{\mathfrak{m}_{2}(\nu_{n})}<+\infty$ as proved, and as a Wasserstein ball is relatively compact under narrow topology, $\{\nu_{n}\}_{n\in\mathbb{N}}$ is relatively compact under narrow topology. The same property holds for $\{\mu_{n}\}_{n\in\mathbb{N}}$ . According to Prokhorov [2, Theorem 1.3], $\{\mu_{n}\}_{n\in\mathbb{N}}$ and $\{\nu_{n}\}_{n\in\mathbb{N}}$ are tight. By [2, Remark 1.4], $\{\rho_{n}\}_{n\in\mathbb{N}}$ is also tight, hence relatively compact under narrow topology in $\mathcal{P}_{2}(X\times X)$ . Consequently, $\{\rho_{n_{k}+1}\}_{k\in\mathbb{N}}$ admits a subsequence converging narrowly to some $\rho^{*}\in\mathcal{P}(X\times X)$ . Let’s say

\displaystyle\rho_{n_{k_{i}}+1}\xrightarrow{\operatorname{\text{narrow}}}\rho^% {*}\text{ as }i\to\infty.

(33)

We can see that ${\operatorname{proj}_{1}}_{\#}\rho^{*}=\mu^{*},{\operatorname{proj}_{2}}_{\#}% \rho^{*}=\nu^{*}$ . Indeed, take any continuous and bounded function $f\in C_{b}(X)$ , it holds

	$\displaystyle\int_{X}{f(x)}d\mu^{*}(x)$	$\displaystyle=\lim_{i\to\infty}\int_{X}f(x)d\mu_{n_{k_{i}}+1}(x)$
		$\displaystyle=\lim_{i\to\infty}\int_{X\times X}f(x)d\rho_{n_{k_{i}}+1}(x,y)$
		$\displaystyle=\int_{X\times X}f(x)d\rho^{*}(x,y)$
		$\displaystyle=\int_{X}f(x)d({\operatorname{proj}_{1}}_{\#}\rho^{*})(x).$

Therefore,

\displaystyle\int_{X}{f(x)}d\mu^{*}(x)=\int_{X}f(x)d({\operatorname{proj}_{1}}% _{\#}\rho^{*})(x)\quad\forall f\in C_{b}(X).

It then follows from [11, Thm. 1.2] that ${\operatorname{proj}_{1}}_{\#}\rho^{*}=\mu^{*}$ . Similarly, ${\operatorname{proj}_{2}}_{\#}\rho^{*}=\nu^{*}$ .

We further show that $\rho^{*}\in\mathcal{P}_{2}(X\times X)$ , or equivalently,

\displaystyle\int_{X\times X}{(\|x\|^{2}+\|y\|^{2})}d\rho^{*}(x,y)<+\infty.

(34)

Let $C\in\mathbb{N}$ , thanks to the narrow convergence in (33), we have

\displaystyle\int_{X\times X}{\min\{\|x\|^{2}+\|y\|^{2},C}\}d\rho_{n_{k_{i}}+1% }(x,y)\to\int_{X\times X}{\min\{\|x\|^{2}+\|y\|^{2},C}\}d\rho^{*}(x,y).

Furthermore,

\displaystyle\int_{X\times X}{\min\{\|x\|^{2}+\|y\|^{2},C}\}d\rho_{n_{k_{i}}+1% }(x,y)\leq\sup_{n\in\mathbb{N}}\mathfrak{m}_{2}(\mu_{n})+\sup_{n\in\mathbb{N}}% \mathfrak{m}_{2}(\nu_{n}):=M<+\infty.

Passing to the limit, we get

\displaystyle\int_{X\times X}{\min\{\|x\|^{2}+\|y\|^{2},C}\}d\rho^{*}(x,y)\leq M

for all $C\in\mathbb{N}$ . Sending $C$ to $\infty$ and applying Monotone Convergence Theorem we derive (34).

Back to the main proof, since $\{\rho_{n_{k_{i}}+1}\}_{i\in\mathbb{N}}$ is a sequence of optimal plans, its limit, $\rho^{*}$ is also optimal [2, Proposition 2.5]. Therefore,

\displaystyle\rho^{*}=(I,T_{\mu^{*}}^{\nu^{*}})_{\#}\mu^{*}.

Moreover, as

\displaystyle\mathfrak{m}_{2}(\rho_{n_{k_{i}}+1})=\mathfrak{m}_{2}(\mu_{n_{k_{% i}}+1})+\mathfrak{m}_{2}(\nu_{n_{k_{i}}+1})\to\mathfrak{m}_{2}(\mu^{*})+% \mathfrak{m}_{2}(\nu^{*})=\mathfrak{m}_{2}(\rho^{*}),

we have

\displaystyle\rho_{n_{k_{i}}+1}\xrightarrow{\operatorname{\text{Wass}}}\rho^{*% }\text{ as }i\to\infty.

Now let’s take any test function $\zeta\in C_{c}^{\infty}(X)$ , we show

\displaystyle\lim_{j\to\infty}\int_{X}{\zeta(x)T_{\mu_{n_{k_{j}}+1}}^{\nu_{n_{% k_{j}}+1}}(x)}d\mu_{n_{k_{j}}+1}(x)=\int_{X}{\zeta(x)T_{\mu^{*}}^{\nu^{*}}(x)}% d\mu^{*}(x).

(35)

Indeed, $(x,y)\mapsto\zeta(x)\operatorname{proj}_{i}(y)$ where $\operatorname{proj}_{i}$ is the projection into the $i$ -th coordinate is continuous and has quadratic growth since $\zeta(x)$ is bounded and $\operatorname{proj}_{i}(y)$ is linear.

Since $\rho_{n_{k_{j}}+1}\xrightarrow{\operatorname{\text{Wass}}}\rho^{*}$ , it holds: for each $i\in[d]$ ,

	$\displaystyle\lim_{j\to\infty}\int_{X}{\zeta(x)\operatorname{proj}_{i}\left(T_% {\mu_{n_{k_{j}}+1}}^{\nu_{n_{k_{j}}+1}}(x)\right)}d\mu_{n_{k_{j}}+1}(x)$
	$\displaystyle=\lim_{j\to\infty}\int_{X}{\zeta(x)\operatorname{proj}_{i}(y)}d(I% ,T_{\mu_{n_{k_{j}}+1}}^{\nu_{n_{k_{j}}+1}})_{\#}\mu_{n_{k_{j}}+1}(x,y)$
	$\displaystyle=\lim_{j\to\infty}\int_{X}{\zeta(x)\operatorname{proj}_{i}(y)}d% \rho_{n_{k_{j}}+1}(x,y)$
	$\displaystyle=\int_{X}{\zeta(x)\operatorname{proj}_{i}(y)}d\rho^{*}(x,y)$
	$\displaystyle=\int_{X}{\zeta(x)\operatorname{proj}_{i}(y)}d(I,T_{\mu^{}}^{\nu% ^{}})_{\#}\mu^{*}(x,y)$
	$\displaystyle=\int_{X}{\zeta(x)\operatorname{proj}_{i}(T_{\mu^{}}^{\nu^{}}(x% ))}d\mu^{*}(x),$

so (35) holds. Consequently, (32) also holds by noticing that

\displaystyle\int_{X}x\zeta(x)d\mu_{n_{k_{j}}+1}(x)\to\int_{X}{x\zeta(x)}d\mu^% {*}(x)\text{ as }j\to\infty.

By the closedness of subdifferential graph of $\partial(\mathcal{E}_{G}+\mathscr{H})$ (Thm. 8), we obtain

\displaystyle\dfrac{T_{\mu^{*}}^{\nu^{*}}-I}{\gamma}\in\partial(\mathcal{E}_{G% }+\mathscr{H})(\mu^{*}).

Therefore, $S\in\partial(\mathcal{E}_{G}+\mathscr{H})(\mu^{*})\cap\partial\mathcal{E}_{H}(% \mu^{*})$ , or $\mu^{*}$ is a critical point of $\mathcal{F}.$

A.5 Proof of Theorem 2

We see that

	$\displaystyle\\|\mathcal{G}_{\gamma}(\mu_{n})\\|_{L^{2}(X,X;\mu_{n})}^{2}$	$\displaystyle=\dfrac{1}{\gamma^{2}}\int_{X}{\left\\|x-T_{\mu_{n}}^{% \operatorname{JKO}_{\gamma(\mathcal{E}_{G}+\mathscr{H})}((I+\gamma S)_{\#}\mu_% {n})}(x)\right\\|^{2}}d\mu_{n}(x)$
		$\displaystyle=\dfrac{1}{\gamma^{2}}W_{2}^{2}(\mu_{n},\operatorname{JKO}_{% \gamma(\mathcal{E}_{G}+\mathscr{H})}((I+\gamma S)_{\#}\mu_{n}))$
		$\displaystyle=\dfrac{1}{\gamma^{2}}W_{2}^{2}(\mu_{n},\mu_{n+1}).$

On the other hand, it follows from (23) of the proof of Thm. 1 that

\displaystyle\min_{i=\overline{1,N}}W_{2}^{2}(\mu_{n},\mu_{n+1})=O(N^{-1}).

A.6 Proof of Theorem 3

$H$ has uniformly bounded Hessian, by [36, Prop. 2.12], $\mathcal{E}_{H}$ is Wasserstein differentiable and $\nabla_{W}\mathcal{E}_{H}(\mu)=\nabla H$ for all $\mu\in\mathcal{P}_{2}(X).$ According to Lem. 8 and (16),

\displaystyle\partial_{F}^{-}\mathcal{F}(\mu_{n+1})=\partial(\mathcal{E}_{G}+% \mathscr{H})(\mu_{n+1})-\nabla H\ni\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}% -\nabla H.

(36)

We then have the following evaluations:

$\displaystyle\operatorname{dist}{(0,\partial_{F}^{-}\mathcal{F}(\mu_{n+1}))}$	$\displaystyle=\inf_{\xi\in\partial^{-}_{F}\mathcal{F}(\mu_{n+1})}\\|\xi\\|_{L^{2% }(X,X,\mu_{n+1})}$
	$\displaystyle\leq\left\\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}-\nabla H% \right\\|_{L^{2}(X,X,\mu_{n+1})}$
	$\displaystyle=\left(\int_{X}{\left\\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x}{% \gamma}-\nabla H(x)\right\\|^{2}}d\mu_{n+1}(x)\right)^{\frac{1}{2}}$
	$\displaystyle=\dfrac{1}{\gamma}\left(\int_{X}\\|T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x-% \gamma\nabla H(x)\\|^{2}d\mu_{n+1}(x)\right)^{\frac{1}{2}}.$	(37)

By transfer lemma 2,

		$\displaystyle\int_{X}{\left\\|T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x-\gamma\nabla H(x)% \right\\|^{2}}d\mu_{n+1}(x)$
		$\displaystyle=\int_{X}{\left\\|T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x-\gamma\nabla H(x)% \right\\|^{2}}d{T_{\nu_{n+1}}^{\mu_{n+1}}}_{\#}\nu_{n+1}(x)$
		$\displaystyle=\int_{X}{\left\\|T_{\mu_{n+1}}^{\nu_{n+1}}\circ T_{\nu_{n+1}}^{% \mu_{n+1}}(x)-(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)\right\\|^{2}% }d\nu_{n+1}(x)$
		$\displaystyle=\int_{X}{\left\\|x-(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+% 1}}(x)\right\\|^{2}}d\nu_{n+1}(x).$		(38)

On the other hand, by using the trivial identity

\displaystyle\nabla H=\dfrac{(I+\gamma\nabla H)-I}{\gamma}

we compute,

		$\displaystyle\int_{X}{\\|\nabla H(T_{\nu_{n+1}}^{\mu_{n}}(x))-\nabla H(T_{\nu_{% n+1}}^{\mu_{n+1}}(x))\\|^{2}}d\nu_{n+1}(x)$
		$\displaystyle=\int_{X}{\left\\|\dfrac{(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{% \mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n}}(x)}{\gamma}-\dfrac{(I+\gamma\nabla H)\circ T% _{\nu_{n+1}}^{\mu_{n+1}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)}{\gamma}\right\\|^{2}}% d\nu_{n+1}(x)$
		$\displaystyle=\dfrac{1}{\gamma^{2}}\int_{X}{\left\\|x-T_{\nu_{n+1}}^{\mu_{n}}(x% )-{(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)+T_{\nu_{n+1}}^{\mu_{n+% 1}}(x)}\right\\|^{2}}d\nu_{n+1}(x),$		(39)

where the last equality uses $(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n}}=I$ $\nu_{n+1}$ -a.e.

The Hessian of $H$ is bounded uniformly, $\nabla H$ is Lipschitz, let’s say $\|\nabla H(x)-\nabla H(y)\|\leq L_{H}\|x-y\|$ for all $x,y\in X.$ We continue evaluating (A.6) as follows

		$\displaystyle\int_{X}{\\|x-(I+\gamma\nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)% \\|^{2}}d\nu_{n+1}(x)$
		$\displaystyle\leq\int_{X}{\left(\\|x-T_{\nu_{n+1}}^{\mu_{n}}(x)-{(I+\gamma% \nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)+T_{\nu_{n+1}}^{\mu_{n+1}}(x)}\\|+\\|% T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\\|\right)^{2}}d\nu_{n+1% }(x)$
		$\displaystyle\leq 2\int_{X}{\left\\|x-T_{\nu_{n+1}}^{\mu_{n}}(x)-{(I+\gamma% \nabla H)\circ T_{\nu_{n+1}}^{\mu_{n+1}}(x)+T_{\nu_{n+1}}^{\mu_{n+1}}(x)}% \right\\|^{2}}d\nu_{n+1}(x)$
		$\displaystyle\quad+2\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{% n+1}}(x)\\|^{2}}d\nu_{n+1}(x)$
		$\displaystyle=2\gamma^{2}\int_{X}{\\|\nabla H(T_{\nu_{n+1}}^{\mu_{n}}(x))-% \nabla H(T_{\nu_{n+1}}^{\mu_{n+1}}(x))\\|^{2}}d\nu_{n+1}(x)+2\int_{X}{\\|T_{\nu_% {n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\\|^{2}}d\nu_{n+1}(x)$
		$\displaystyle\leq 2(\gamma^{2}L_{H}^{2}+1)\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x% )-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\\|^{2}}d\nu_{n+1}(x)$		(40)

where the third equality uses (A.6).

From (A.6), (A.6), and (A.6), we derive

\displaystyle\operatorname{dist}{(0,\partial_{F}^{-}\mathcal{F}(\mu_{n+1}))}% \leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\gamma}\left(\int_{X}{\|T_{\nu_{n+% 1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x)\right)^{% \frac{1}{2}}.

(41)

On the other hand, by telesco** Lem. 1, we obtain

\displaystyle\sum_{n=1}^{\infty}{\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_% {n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x)}<+\infty.

Therefore,

	$\displaystyle\sum_{n=0}^{N-1}{\operatorname{dist}{(0,\partial^{F}\mathcal{F}(% \mu_{n+1}))}}$
	$\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\gamma}\sum_{n=0}^{N-% 1}{\left(\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\\|^% {2}}d\nu_{n+1}(x)\right)^{\frac{1}{2}}}$
	$\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\gamma}\left(N\left(% \sum_{n=0}^{N-1}{\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1% }}(x)\\|^{2}}d\nu_{n+1}(x)}\right)\right)^{\frac{1}{2}}$
	$\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)N}}{\gamma}\left(\sum_{n% =0}^{+\infty}{\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(% x)\\|^{2}}d\nu_{n+1}(x)}\right)^{\frac{1}{2}}$

We derive

\displaystyle\min_{n=\overline{1,N}}{\operatorname{dist}{(0,\partial^{F}% \mathcal{F}(\mu_{n}))}}=O\left(\dfrac{1}{\sqrt{N}}\right).

A.7 Proof of Theorem 4

Convergence in terms of objective values

Since $H\in C^{2}(X)$ whose Hessian is uniformly bounded, recall from (36) that

\displaystyle\partial_{F}^{-}\mathcal{F}(\mu_{n+1})=\partial(\mathcal{E}_{G}+% \mathscr{H})(\mu_{n+1})-\nabla H\ni\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}% -\nabla H.

Since $\mathcal{F}(\mu_{0})-\mathcal{F}^{*}<r_{0}$ and the sequence $\{\mathcal{F}(\mu_{n})\}_{n\in\mathbb{N}}$ is not increasing (Lem. 1), $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}<r_{0}$ for all $n\in\mathbb{N}$ . Łojasiewicz condition implies

$\displaystyle c(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})^{\theta}$	$\displaystyle\leq\left\\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}-\nabla H% \right\\|_{L^{2}(X,X,\mu_{n+1})}$
	$\displaystyle=\left(\int_{X}{\left\\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x}{% \gamma}-\nabla H(x)\right\\|^{2}}d\mu_{n+1}(x)\right)^{\frac{1}{2}}$
	$\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\gamma}\left(\int_{X}% {\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\\|^{2}}d\nu_{n+1}(x)% \right)^{\frac{1}{2}}$	(42)

where the last inequality follows from (A.6) and (A.6) and $L_{H}$ is the Lipschitz constant of $\nabla H$ . Combining with Lem. 1, we derive

\displaystyle c\left(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*}\right)^{\theta}% \leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{\sqrt{\gamma}}\left(\mathcal{F}(% \mu_{n})-\mathcal{F}(\mu_{n+1})\right)^{\frac{1}{2}}

\displaystyle\left(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*}\right)^{2\theta}\leq% \dfrac{{2(\gamma^{2}L_{H}^{2}+1)}}{c^{2}{\gamma}}\left((\mathcal{F}(\mu_{n})-% \mathcal{F}^{*})-(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})\right).

(43)

We then use the following lemma [63, Lem. 4].

Lemma 11.

Let $\{s_{k}\}_{k\in\mathbb{N}}$ be a nonincreasing and nonnegative real sequence. Assume that there exist $\alpha\geq 0$ and $\beta>0$ such that for all sufficiently large $k$ ,

\displaystyle s_{k+1}^{\alpha}\leq\beta(s_{k}-s_{k+1}).

(44)

Then

(i)

if $\alpha=0$ , the sequence $\{s_{k}\}_{k\in\mathbb{N}}$ converges to $0$ in a finite number of steps;
(ii)

if $\alpha\in(0,1]$ , the sequence $\{s_{k}\}_{k\in\mathbb{N}}$ converges linearly to $0$ with rate $\frac{\beta}{\beta+1}$ ;

(iii)

if $\alpha>1$ , the sequence $\{s_{k}\}_{k\in\mathbb{N}}$ converges sublinearly to $0$ , i.e., there exists $\eta>0$ :

\displaystyle s_{k}\leq\eta k^{\frac{-1}{\alpha-1}}

for sufficiently large $k$ .

Compared to [63, Lem. 4], we have dropped the assumption $s_{k}\to 0$ in Lem. 11 because this assumption is vacuous, i.e., it can be induced by (44) and nonnegativity of $\{s_{k}\}_{k\in\mathbb{N}}$ .

We now apply Lem. 11 for $s_{k}=\mathcal{F}(\mu_{k})-\mathcal{F}^{*}$ using (43) to derive the followings

(i)

if $\theta=0$ , $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}$ converges to $0$ in a finite number of steps;

(ii)

if $\theta\in(0,1/2]$ , $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}$ converges to $0$ linearly (exponentially fast) with rate

\displaystyle\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=O\left(\left(\dfrac{M}{M+1}% \right)^{n}\right)\text{ where }M=\dfrac{{2(\gamma^{2}L_{H}^{2}+1)}}{c^{2}{% \gamma}};

(iii)

if $\theta\in(1/2,1)$ , $\mathcal{F}(\mu_{n})-\mathcal{F}^{*}$ converges sublinearly to $0$ , i.e.,

\displaystyle\mathcal{F}(\mu_{n})-\mathcal{F}^{*}=O\left(n^{-\frac{1}{2\theta-% 1}}\right).

A.8 Proof of Theorem 5

Cauchy sequence.

By replacing $n:=n-1$ in (A.7) and rearranging

\displaystyle 1\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{c\gamma}\left(\int_% {X}{\|T_{\nu_{n}}^{\mu_{n-1}}(x)-T_{\nu_{n}}^{\mu_{n}}(x)\|^{2}}d\nu_{n}(x)% \right)^{\frac{1}{2}}\left(\mathcal{F}(\mu_{n})-\mathcal{F}^{*}\right)^{-% \theta}.

(45)

It follows from Lem. 1 and (45) that

		$\displaystyle\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x% )\\|^{2}}d\nu_{n+1}(x)\leq\gamma(\mathcal{F}(\mu_{n})-\mathcal{F}(\mu_{n+1}))$
		$\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{c}\left(\int_{X}{\\|T_% {\nu_{n}}^{\mu_{n-1}}(x)-T_{\nu_{n}}^{\mu_{n}}(x)\\|^{2}}d\nu_{n}(x)\right)^{% \frac{1}{2}}\left(\mathcal{F}(\mu_{n})-\mathcal{F}^{*}\right)^{-\theta}(% \mathcal{F}(\mu_{n})-\mathcal{F}(\mu_{n+1})).$		(46)

Since the function $s:\mathbb{R}^{+}\to\mathbb{R}$ , $s(t)=t^{1-\theta}$ is concave if $\theta\in[0,1)$ , tangent inequality holds

\displaystyle s^{\prime}(a)(a-b)\leq s(a)-s(b).

Note that $s^{\prime}(t)=(1-\theta)t^{-\theta}$ , the above inequality further implies

\displaystyle(1-\theta)\left(\mathcal{F}(\mu_{n})-\mathcal{F}^{*}\right)^{-% \theta}(\mathcal{F}(\mu_{n})-\mathcal{F}(\mu_{n+1}))\leq(\mathcal{F}(\mu_{n})-% \mathcal{F}^{*})^{1-\theta}-(\mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})^{1-\theta}.

(47)

From (A.8) and (47)

	$\displaystyle\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x% )\\|^{2}}d\nu_{n+1}(x)\leq$	$\displaystyle\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{(1-\theta)c}\left(\int_{X% }{\\|T_{\nu_{n}}^{\mu_{n-1}}(x)-T_{\nu_{n}}^{\mu_{n}}(x)\\|^{2}}d\nu_{n}(x)% \right)^{\frac{1}{2}}$
		$\displaystyle\times\left[(\mathcal{F}(\mu_{n})-\mathcal{F}^{})^{1-\theta}-(% \mathcal{F}(\mu_{n+1})-\mathcal{F}^{})^{1-\theta}\right]$

or equivalently,

	$\displaystyle\dfrac{r_{n}}{\sqrt{r_{n-1}}}$	$\displaystyle:=\dfrac{\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu% _{n+1}}(x)\\|^{2}}d\nu_{n+1}(x)}{\left(\int_{X}{\\|T_{\nu_{n}}^{\mu_{n-1}}(x)-T_% {\nu_{n}}^{\mu_{n}}(x)\\|^{2}}d\nu_{n}(x)\right)^{\frac{1}{2}}}$
		$\displaystyle\leq\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{(1-\theta)c}\left[(% \mathcal{F}(\mu_{n})-\mathcal{F}^{})^{1-\theta}-(\mathcal{F}(\mu_{n+1})-% \mathcal{F}^{})^{1-\theta}\right]$		(48)

where $r_{n}:=\int_{X}{\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2% }}d\nu_{n+1}(x).$

By telesco** (A.8) from $n=1$ to $+\infty$ we obtain

\displaystyle\sum_{n=1}^{+\infty}{\dfrac{r_{n}}{\sqrt{r_{n-1}}}}<+\infty.

On the other hand

\displaystyle\dfrac{r_{n}}{\sqrt{r_{n-1}}}+\sqrt{r_{n-1}}

\displaystyle\geq 2\sqrt{r_{n}},

(49)

we derive $\sum_{n=0}^{+\infty}{\sqrt{r_{n}}}<+\infty$ . From (A.4), (21) of the proof of Thm. 1, $r_{n}\geq W_{2}^{2}(\mu_{n},\mu_{n+1})$ , we obtain

\displaystyle\sum_{n=0}^{+\infty}{W_{2}(\mu_{n},\mu_{n+1})}<+\infty.

or, in other words, $\{\mu_{n}\}_{n\in\mathbb{N}}$ is a Cauchy sequence under Wasserstein topology. The Wasserstein space $(\mathcal{P}_{2}(X),W_{2})$ is complete [4, Thm. 2.2], every Cauchy sequence is convergent, i.e., there exists $\mu^{*}\in\mathcal{P}_{2}(X)$ such that $\mu_{n}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}.$

We prove that $\mu^{*}$ is actually an optimal solution of $\mathcal{F}$ by showing $\mathcal{F}(\mu^{*})=\mathcal{F}^{*}$ . Indeed, firstly, as $G$ and $H$ have quadratic growth, it holds

\displaystyle\mathcal{E}_{G}(\mu_{n})\to\mathcal{E}_{G}(\mu^{*}),\mathcal{E}_{% H}(\mu_{n})\to\mathcal{E}_{H}(\mu^{*}).

On the other hand,

	$\displaystyle\mathcal{F}^{*}=\lim_{n\to\infty}{\mathcal{F}(\mu_{n})}$	$\displaystyle=\liminf_{n\to\infty}\mathcal{F}(\mu_{n})=\liminf_{n\to\infty}{% \mathscr{H}}(\mu_{n})+\mathcal{E}_{G}(\mu^{})-\mathcal{E}_{H}(\mu^{})$
		$\displaystyle\geq{\mathscr{H}}(\mu^{})+\mathcal{E}_{G}(\mu^{})-\mathcal{E}_{% H}(\mu^{})=\mathcal{F}(\mu^{})$

since $\mathscr{H}$ is l.s.c. The equality has to occur, i.e., $\mathcal{F}^{*}=\mathcal{F}(\mu^{*})$ , due to the optimality of $\mathcal{F}^{*}$ .

Convergence rate of $\{\mu_{n}\}_{n\in\mathbb{N}}$ .

(i) If $\theta=0$

From item (i) of Thm. 4, there exists $n_{0}\in\mathbb{N}$ such that $\mathcal{F}(\mu_{n})=\mathcal{F}^{*}$ for all $n\geq n_{0}$ . It then follows from (22) that $\mu_{n_{0}}=\mu_{n_{0}+1}=\mu_{n_{0}+2}=\ldots$ , which further implies that $\mu_{n}=\mu^{*}$ for all $n\geq n_{0}$ .

(ii) If $\theta\in(0,1/2]$

Let $s_{i}=\sum_{n=i}^{\infty}\sqrt{r_{n}}$ . We have

\displaystyle s_{i}\geq\sum_{n=i}^{\infty}{W_{2}(\mu_{n},\mu_{n+1})}\geq W_{2}% (\mu_{i},\mu^{*})

(50)

where the last inequality uses triangle inequality

\displaystyle W_{2}(\mu_{i},\mu^{*})\leq\sum_{n=i}^{N-1}{W_{2}(\mu_{n},\mu_{n+% 1})}+W_{2}(\mu_{N},\mu^{*})

and lets $N\to\infty$ with a notice that $\mu_{N}\xrightarrow{\operatorname{\text{Wass}}}\mu^{*}$ .

From (A.8) and (49),

\displaystyle 2\sqrt{r_{n}}\leq\sqrt{r_{n-1}}+\dfrac{\sqrt{2(\gamma^{2}L_{H}^{% 2}+1)}}{(1-\theta)c}\left[(\mathcal{F}(\mu_{n})-\mathcal{F}^{*})^{1-\theta}-(% \mathcal{F}(\mu_{n+1})-\mathcal{F}^{*})^{1-\theta}\right].

(51)

Telescope (51) for $n=i$ to $+\infty$ ,

\displaystyle s_{i}\leq\sqrt{r_{i-1}}+\dfrac{\sqrt{2(\gamma^{2}L_{H}^{2}+1)}}{% (1-\theta)c}(\mathcal{F}(\mu_{i})-\mathcal{F}^{*})^{1-\theta}\leq\sqrt{r_{i-1}% }+\dfrac{(2(\gamma^{2}L_{H}^{2}+1))^{\frac{1}{2\theta}}}{(1-\theta)\gamma^{% \frac{1-\theta}{\theta}}c^{\frac{1}{\theta}}}r_{i-1}^{\frac{1-\theta}{2\theta}}.

(52)

where the last inequality uses (A.7). Since $r_{i}\to 0$ as $i\to\infty$ , $r_{i}<1$ for $i$ sufficiently large. It follows from (52) that: for $i$ sufficiently large

\displaystyle s_{i}\leq M\sqrt{r_{i-1}}=M(s_{i-1}-s_{i})

where

\displaystyle M=1+\dfrac{(2(\gamma^{2}L_{H}^{2}+1))^{\frac{1}{2\theta}}}{(1-% \theta)\gamma^{\frac{1-\theta}{\theta}}c^{\frac{1}{\theta}}}.

(53)

Rewriting as $s_{i}\leq\frac{M}{M+1}s_{i-1}$ , we derive $W_{2}(\mu_{i},\mu^{*})=O\left(\left(\frac{M}{M+1}\right)^{i}\right).$

(iii) If $\theta\in(1/2,1)$

(52) implies: for all $i$ sufficiently large,

\displaystyle s_{i}\leq Mr_{i-1}^{\frac{1-\theta}{2\theta}}=M(s_{i-1}-s_{i})^{% \frac{1-\theta}{\theta}}

where $M$ is the same as in (53).

Applying Lem. 11(iii), $s_{i}=O\left(i^{-\frac{1-\theta}{2\theta-1}}\right)$ , which implies (by (50)) $W_{2}(\mu_{i},\mu^{*})=O\left(i^{-\frac{1-\theta}{2\theta-1}}\right)$ .

Appendix B Implementation details

We present implementations for FB Euler and semi FB Euler when $\mathscr{H}$ is the negative entropy. The main recipe of these implementations is the deep learning approach to approximate the JKO operator presented in [45] (MIT license).

B.1 FB Euler

The push forward step $\nu_{n+1}=(I-\gamma\nabla F)_{\#}\mu_{n}$ is rather straightforward: if $Z$ are samples from $\mu_{n}$ then $Z-\gamma\nabla F(Z)$ are samples from $\nu_{n+1}$ .

On the other hand, to move from $\nu_{n+1}$ to $\mu_{n+1}$ we have to work out the JKO operator. The idea goes as follows [45]. We wish to compute the optimal Monge map pushing $\nu_{n+1}$ to $\mu_{n+1}$ . From Brenier theorem, we know this map has to be a (sub)gradient field of some convex function. Therefore, one can "parametrize" $\mu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ as $\mu=\nabla\psi_{\#}\nu_{n+1}$ for some convex function $\psi$ . We then can write the objective function of the JKO operator as follows

	$\displaystyle\mathscr{H}(\mu)+\dfrac{1}{2\gamma}W_{2}^{2}(\mu,\nu_{n+1})$	$\displaystyle=\mathscr{H}(\mu)+\dfrac{1}{2\gamma}\int_{X}{\\|x-T_{\nu_{n+1}}^{% \mu}(x)\\|^{2}}d\nu_{n+1}(x)$
		$\displaystyle=\mathscr{H}(\nabla\psi_{\#}\nu_{n+1})+\dfrac{1}{2\gamma}\int_{X}% {\\|x-\nabla\psi(x)\\|^{2}}d\nu_{n+1}(x).$		(54)

We next use the following result: for any $T:X\to X$ be a diffeomorphism, any $\rho\in\mathcal{P}_{2,\operatorname{abs}}(X)$ ,

\displaystyle-\mathscr{H}(T_{\#}\rho)=-\mathscr{H}(\rho)+\int_{X}{\log|\det% \nabla T(x)|}d\rho(x),

so (B.1) can be written as (up to a constant that does not depend on $\psi$ )

\displaystyle-\int_{X}{\log\det\nabla^{2}\psi(x)}d\nu_{n+1}(x)+\dfrac{1}{2% \gamma}\int_{X}{\|x-\nabla\psi(x)\|^{2}}d\nu_{n+1}(x).

We can now leverage on a class of input convex neural networks (ICNNs) [5] $\psi_{\theta}(x)$ ( $\theta$ is the neural network’s parameters, $x$ is the input) in which $x\mapsto\psi_{\theta}(x)$ is convex. The JKO operator boils down to the following optimization problem

\displaystyle\min_{\theta}-\int_{X}{\log\det\nabla^{2}_{x}\psi_{\theta}(x)}d% \nu_{n+1}(x)+\dfrac{1}{2\gamma}\int_{X}{\|x-\nabla_{x}\psi_{\theta}(x)\|^{2}}d% \nu_{n+1}(x).

This problem can be solved effectively by standard deep learning optimizers like Adam or stochastic gradient descent as samples from $\nu_{n+1}$ can be drawn in a recursive manner. A detailed implementation of FB Euler is in Alg. 1.

Algorithm 1 FB Euler for sampling

Input: Initial measure

\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)

, discretization stepsize

\gamma>0

, number of steps

K>0

, batch size

B

for

k=1

K

for

i=1,2,\ldots

Draw a batch of samples

Z\sim\mu_{0}

of size

B

;

\Xi\leftarrow(I-\gamma\nabla F)\circ\nabla_{x}\psi_{\theta_{k}}\circ(I-\gamma% \nabla F)\circ\nabla_{x}\psi_{\theta_{k-1}}\circ\ldots\circ(I-\gamma\nabla F)(Z)

;

\widehat{W_{2}^{2}}\leftarrow\frac{1}{B}\sum_{\xi\in\Xi}\|\nabla_{x}\psi_{% \theta}(\xi)-\xi\|^{2}

;

\widehat{\Delta\mathscr{H}}\leftarrow-\frac{1}{B}\sum_{\xi\in\Xi}{\log\det% \nabla_{x}^{2}\psi_{\theta}(\xi)}

\widehat{\mathcal{L}}\leftarrow\frac{1}{2\gamma}\widehat{W_{2}^{2}}+\widehat{% \Delta\mathscr{H}}.

Apply an optimization step (e.g., Adam) over

\theta

using

\nabla_{\theta}\widehat{\mathcal{L}}

end for

\theta_{k+1}\leftarrow\theta.

end for

B.2 Semi FB Euler

Similar to the idea presented in the implementation of FB Euler, a detailed implementation of semi FB Euler is presented in Alg. 2.

Algorithm 2 Semi FB Euler for sampling

Input: Initial measure

\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)

, discretization step size

\gamma>0

, number of steps

K>0

, batch size

B

for

k=1

K

for

i=1,2,\ldots

Draw a batch of samples

Z\sim\mu_{0}

of size

B

;

\Xi\leftarrow(I+\gamma\nabla H)\circ\nabla_{x}\psi_{\theta_{k}}\circ(I+\gamma% \nabla H)\circ\nabla_{x}\psi_{\theta_{k-1}}\circ\ldots\circ(I+\gamma\nabla H)(Z)

;

\widehat{W_{2}^{2}}\leftarrow\frac{1}{B}\sum_{\xi\in\Xi}\|\nabla_{x}\psi_{% \theta}(\xi)-\xi\|^{2}

;

\widehat{\mathcal{U}}\leftarrow\frac{1}{B}\sum_{\xi\in\Xi}{G(\nabla_{x}\psi_{% \theta}(\xi))}

;

\widehat{\Delta\mathscr{H}}\leftarrow-\frac{1}{B}\sum_{\xi\in\Xi}{\log\det% \nabla_{x}^{2}\psi_{\theta}(\xi)}

\widehat{\mathcal{L}}\leftarrow\frac{1}{2\gamma}\widehat{W_{2}^{2}}+\widehat{% \mathcal{U}}+\widehat{\Delta\mathscr{H}}.

Apply an optimization step (e.g., Adam) over

\theta

using

\nabla_{\theta}\widehat{\mathcal{L}}

end for

\theta_{k+1}\leftarrow\theta.

end for

Appendix C Numerical illustrations

We perform the numerical experiments in a high-performance computing cluster with GPU supports. We use Python version 3.8.0. We allocate 8G memory for the experiments. The running time is a couple of hours.

C.1 Gaussian mixture

Consider a target Gaussian mixture of the following form:

\displaystyle\pi(x)\propto\exp(-F(x)):=\sum_{i=1}^{K}{\pi_{i}\exp\left(-\dfrac% {\|x-x_{i}\|^{2}}{\sigma^{2}}\right)}.

We write

	$\displaystyle F(x)$	$\displaystyle=-\log\left(\sum_{i=1}^{K}{\pi_{i}\exp\left(-\dfrac{\\|x-x_{i}\\|^{% 2}}{\sigma^{2}}\right)}\right)$
		$\displaystyle=-\log\left(\sum_{i=1}^{K}{\pi_{i}\exp\left(-\dfrac{\\|x\\|^{2}+\\|x% _{i}\\|^{2}-2\langle x,x_{i}\rangle}{\sigma^{2}}\right)}\right)$
		$\displaystyle=-\log\left(\sum_{i=1}^{K}{\pi_{i}\exp\left(-\dfrac{\\|x\\|^{2}}{% \sigma^{2}}\right)\times\exp\left(-\dfrac{\\|x_{i}\\|^{2}-2\langle x,x_{i}% \rangle}{\sigma^{2}}\right)}\right)$
		$\displaystyle=\dfrac{\\|x\\|^{2}}{\sigma^{2}}-\underbrace{\log\left(\sum_{i=1}^{% K}{\pi_{i}\exp\left(-\dfrac{\\|x_{i}\\|^{2}-2\langle x,x_{i}\rangle}{\sigma^{2}}% \right)}\right)}_{\text{convex}}$

which is DC. Note that the convexity of the second component is thanks to (a) log-sum-exp is convex and (b) the composite of a convex function and an affine function is convex.

Experiment details

We set $K=5$ and randomly generate $x_{1},x_{2},\ldots,x_{5}\in\mathbb{R}^{2}$ . We set $\sigma=1$ . The initial distribution is $\mu_{0}=\mathcal{N}(0,16I)$ . We use $\gamma=0.1$ for both FB Euler and semi FB Euler. We train both algorithms for $40$ iterations using Adam optimizer with a batch size of $512$ in which the first $20$ iterations use a learning rate of $5\times 10^{-3}$ while the latter $20$ iterations use $2\times 10^{-3}$ .

We run the above experiment $5$ times where $x_{1},x_{2},\ldots,x_{5}$ are randomly generated each time. Fig. 1 (b) reports the mean and $2\times\text{std}$ curves of the KL divergence along the training process.

C.2 Distance-to-set prior

Let $\pi$ be the original prior, $\Theta$ be the constraint set that we want to impose, and the distance-to-set prior [53] is defined by, for some $\rho>0$ ,

\displaystyle\tilde{\pi}(\theta)\propto\pi(\theta)\exp\left(-\dfrac{\rho}{2}d(% \theta,\Theta)^{2}\right),

that penalize exponentially $\theta$ deviating from the constraint set.

Given data $y$ , using the this distance-to-set prior, the posterior reads

\displaystyle\bar{\pi}(\theta|y)\propto L(\theta|y)\pi(\theta)\exp\left(-% \dfrac{\rho}{2}d(\theta,\Theta)^{2}\right).

where $L(\theta|y)$ is the likelihood.

The structure of $\bar{\pi}(\theta|y)$ depends on three separate components: the original prior, the likelihood, and the constraint set $\Theta$ . In the ideal case, $\pi(\theta)$ and $L(\theta|y)$ are given in nice forms (e.g., log-concave), and $\Theta$ is a convex set. As a fact, if $\Theta$ is a convex set, $\theta\mapsto d(\theta,\Theta)^{2}$ is convex, making the whole posterior log concave. If $\Theta$ is additionally closed, $d(\theta,\Theta)^{2}$ is L-smooth.

However, whenever $\Theta$ is nonconvex, the function $\theta\mapsto d(\theta,\Theta)^{2}$ is not continuously differentiable. This is induced by the Motzkin-Bunt theorem [16, Thm. 9.2.5] asserting that any Chebyshev set (a set $S\subset X$ is called Chebyshev if every point in $X$ has a unique nearest point in $S$ ) has to be closed and convex.

On the other hand, $d(\theta,\Theta)^{2}$ is always DC regardless the geometric structure of $\Theta$ :

$\displaystyle d(\theta,\Theta)^{2}$	$\displaystyle=\inf_{x\in\Theta}\\|\theta-x\\|^{2}$
	$\displaystyle=\inf_{x\in\Theta}\left(\\|\theta\\|^{2}+\\|x\\|^{2}-2\langle x,% \theta\rangle\right)$
	$\displaystyle=\\|\theta\\|^{2}+\inf_{x\in\Theta}\left(\\|x\\|^{2}-2\langle x,% \theta\rangle\right)$
	$\displaystyle=\\|\theta\\|^{2}-\sup_{x\in\Theta}\left(-\\|x\\|^{2}+2\langle x,% \theta\rangle\right).$	(55)

Note that the supremum of an arbitrary family of affine functions is convex.

Therefore, the log-DC structure of the whole posterior only depends on whether the original prior and the likelihood are log-DC, which is likely to be the case.

Distance-to-set prior relaxed von Mises-Fisher In directional statistics, the von Mises-Fisher distribution is a distribution over unit-length vectors (unit sphere). It can be described as a restriction of a Gaussian distribution in a sphere. By using the distance-to-set prior, we can relax the spherical constraint as

\displaystyle\bar{\pi}(\theta)\propto\exp(-F(\theta)):=\exp\left(-\kappa\dfrac% {(\theta-\mu)^{\top}(\theta-\mu)}{2}\right)\times\exp\left(-\dfrac{\rho}{2}% \operatorname{dist}(\theta,S)^{2}\right)

where $S$ denote the unit sphere in some $\mathbb{R}^{d}$ space. By the DC structure (C.2) of the distance function, $F$ is DC with the following composition

	$\displaystyle F(\theta)$	$\displaystyle=\left(\kappa\dfrac{\\|\theta-\mu\\|^{2}}{2}+\dfrac{\rho}{2}\\|% \theta\\|^{2}\right)-\dfrac{\rho}{2}\sup_{x\in S}\left(-\\|x\\|^{2}+2\langle x,% \theta\rangle\right)$
		$\displaystyle:=G(\theta)-H(\theta).$

We also note that $H$ is not continuously differentiable because $S$ is nonconvex. Furthermore, $\rho\operatorname{proj}_{S}(\theta)\in\partial H(\theta)$ where $\operatorname{proj}_{S}(\theta)$ is the projection of $\theta$ onto $S$ , which can be computed explicitly in this case.

Experiment details

We consider a unit circle in $\mathbb{R}^{2}$ with centre $\mu=(1,1.5)$ . We set $\kappa=1$ and $\rho=100$ . The initial distribution is $\mu_{0}=\mathcal{N}(0,16I)$ . We use $\gamma=0.1$ for both FB Euler and semi FB Euler. We train both algorithms for $40$ iterations using Adam optimizer with a batch size of $512$ in which the first $20$ iterations use a learning rate of $5\times 10^{-3}$ while the latter $20$ iterations use $2\times 10^{-3}$ .

	$\displaystyle H(x+\eta\xi(x))-H(x)$	$\displaystyle\leq\|H(x+\eta\xi(x))\|+\|H(x)\|$
		$\displaystyle\leq a\left(\\|x+\eta\xi(x)\\|^{2}+1\right)+a(\\|x\\|^{2}+1)$
		$\displaystyle\leq 2a+a\\|x\\|^{2}+a(\\|x\\|+\eta\\|\xi(x)\\|)^{2}$
		$\displaystyle\leq 2a+a\\|x\\|^{2}+2a(\\|x\\|^{2}+\eta^{2}\\|\xi(x)\\|^{2})$
		$\displaystyle=2a+3a\\|x\\|^{2}+2a\eta^{2}\\|\xi(x)\\|^{2}.$

	$\displaystyle\left\\|\nabla_{x}\int_{X}{k(x,y)}d\mu^{}(y)-\nabla_{x}\int_{X}{k% (x^{\prime},y)}d\mu^{}(y)\right\\|$
	$\displaystyle=\left\\|\int_{X}{\left(\nabla_{x}k(x,y)-\nabla_{x}k(x^{\prime},y)% \right)}d\mu^{*}(y)\right\\|$
	$\displaystyle\leq\left(\int_{X}\left\\|\nabla_{x}k(x,y)-\nabla_{x}k(x^{\prime},% y)\right\\|^{2}d\mu^{*}(y)\right)^{\frac{1}{2}}$
	$\displaystyle\leq L\left(\int_{X}\left\\|x-x^{\prime}\right\\|^{2}d\mu^{*}(y)% \right)^{\frac{1}{2}}$
	$\displaystyle=L\\|x-x^{\prime}\\|.$

	$\displaystyle\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x% )\\|^{2}}d\nu_{n+1}(x)$	$\displaystyle=\int_{X}{\\|T_{\nu_{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(% x)\\|^{2}}d{T^{\nu_{n+1}}_{\mu_{n}}}_{\#}\mu_{n}(x)$
		$\displaystyle=\int_{X}{\\|x-T_{\nu_{n+1}}^{\mu_{n+1}}\circ T^{\nu_{n+1}}_{\mu_{% n}}(x)\\|^{2}}d\mu_{n}(x).$		(20)

	$\displaystyle\mathfrak{m}_{2}(\nu_{n+1})$	$\displaystyle=\int_{X}{\\|x\\|^{2}}d\nu_{n+1}(x)$
		$\displaystyle=\int_{X}{\\|x\\|^{2}}d(I+\gamma S)_{\#}\mu_{n}(x)$
		$\displaystyle=\int_{X}{\\|x+\gamma S(x)\\|^{2}}d\mu_{n}(x)$
		$\displaystyle\leq 2\int_{X}{\\|x\\|^{2}}d\mu_{n}(x)+2\gamma^{2}\int_{X}{\\|S(x)\\|% ^{2}}d\mu_{n}(x)$
		$\displaystyle\leq(2+2c\gamma^{2})\mathfrak{m}_{2}(\mu_{n})+2\gamma^{2}c,$

$\displaystyle\operatorname{dist}{(0,\partial_{F}^{-}\mathcal{F}(\mu_{n+1}))}$	$\displaystyle=\inf_{\xi\in\partial^{-}_{F}\mathcal{F}(\mu_{n+1})}\\|\xi\\|_{L^{2% }(X,X,\mu_{n+1})}$
	$\displaystyle\leq\left\\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}-I}{\gamma}-\nabla H% \right\\|_{L^{2}(X,X,\mu_{n+1})}$
	$\displaystyle=\left(\int_{X}{\left\\|\dfrac{T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x}{% \gamma}-\nabla H(x)\right\\|^{2}}d\mu_{n+1}(x)\right)^{\frac{1}{2}}$
	$\displaystyle=\dfrac{1}{\gamma}\left(\int_{X}\\|T_{\mu_{n+1}}^{\nu_{n+1}}(x)-x-% \gamma\nabla H(x)\\|^{2}d\mu_{n+1}(x)\right)^{\frac{1}{2}}.$	(37)