Generative Modeling with Phase Stochastic Bridges

Tianrong Chen¹, Jiatao Gu², Laurent Dinh², Evangelos A. Theodorou¹
Joshua Susskind², Shuangfei Zhai²
¹Georgia Tech, ²Apple
{tianrong.chen, evangelos.theodorou}@gatech.edu, {jgu32,l_dinh,jsusskind,szhai}@apple.com work done while Tianrong Chen is an intern at Apple

Abstract

We introduce a novel generative modeling framework grounded in phase space dynamics, taking inspiration from the principles underlying Critically damped Langevin Dynamics and Bridge Matching. Leveraging insights from Stochastic Optimal Control, we construct a more favorable path measure in the phase space that is highly advantageous for efficient sampling. A distinctive feature of our approach is the early-stage data prediction capability within the context of propagating generative Ordinary Differential Equations or Stochastic Differential Equations. This early prediction, enabled by the model’s unique structural characteristics, sets the stage for more efficient data generation, leveraging additional velocity information along the trajectory. This innovation has spurred the exploration of a novel avenue for mitigating sampling complexity by quickly converging to realistic data samples. Our model yields comparable results in image generation and notably outperforms baseline methods, particularly when faced with a limited Number of Function Evaluations. Furthermore, our approach rivals the performance of diffusion models equipped with efficient sampling techniques, underscoring its potential in the realm of generative modeling. Code is available at https://github.com/apple/ml-agm.

1 Introduction

Diffusion Models (DMs;Song et al. (2020a); Ho et al. (2020)) constitute an instrumental technique in generative modeling, which formulate a particular Stochastic Differential Equation (SDE) linking the data distribution with a tractable prior distribution. Initially, a DM diffuses data towards the prior distribution via a predetermined linear SDE. In order to reverse the process, a neural network is used to approximate the score function which is analytically available. Subsequently, the approximated score is utilized to conduct time reversal (Anderson, 1982; Haussmann & Pardoux, 1986) of this diffusion process, ultimately generating data. Recently, the Critical-damped Langevin Dynamics (CLD;Dockhorn et al. (2021)) extends the SDE framework of DM into phase space (whereas DMs operate in the position space) by introducing an auxiliary velocity variable, which is defined by tractable Gaussian distributions at the initial and terminal time steps. This augmentation induces a trajectory in position space exhibiting enhanced smoothness, as stochasticity is solely introduced into the velocity space. The distinctive structure of CLD is shown to enhance the empirical performance and sample efficiency. However, despite the success of CLD, inefficient sampling still persists due to unnecessary curvature of the dynamics (Fig.1) as it has to converge to equilibrium for sampling from the tractable prior.

The remarkable accomplishments of DM have also catalyzed recent advancements in generative modeling, leading to the development of Bridge Matching (BM;(Peluchetti, 2021; Liu et al., 2022; 2023)) and Flow Matching (FM;models(Lipman et al., 2022)). These models leverage dynamic transport maps underpinned by the utilization of SDEs or ODEs. Unlike DM, Bridge and Flow Matching relaxes the reliance on a forward diffusion process with an asymptotic convergence to a prior distribution over an infinite time horizon. Moreover, they exhibit a heightened degree of versatility, enabling the construction of transport maps between two arbitrary distributions by drawing upon insights from domains such as optimal transport (Pooladian et al., 2023), normalizing flow (Tong et al., 2023b), and optimal control (Liu et al., 2023).

In this paper, we focus on enhancing the sample efficiency of velocity based generative modeling (eg, CLD) by utilizing the Stochastic Optimal Control (SOC) theory. Specifically, we leverage the outcomes of stochastic bridge within the context of linear momentum systems (Chen & Georgiou, 2015) to construct a path measure bridging the data and prior distribution. The resulting path exhibits a more straight position and velocity trajectory compared to CLD (fig.1), making it more amenable to efficient sampling. Within the broader landscape of dynamic generative modeling (ie, ODE/SDE based generative models), data point can often be represented as linear combinations of scaled intermediate data of dynamics and Gaussian noise. In our work, we re-establish this property, enabling the estimation of target data points by leveraging both state and velocity information. In the case of DM and FM, the estimation of target data is exclusively reliant on position information, whereas our method incorporates the additional dimension of velocity data, enhancing the precision and comprehensiveness of our estimations. It is also worth noting that our model exhibits the capacity to generate high fidelity images at early time steps (fig.2). In addition, we propose a sampling technique which demonstrates competitive results with small Number of Function Evaluations (NFEs), eg, 5 to 10. Table.1 demonstrates the design differences among aforementioned models. In summary, our paper presents the following contributions:

1.

We propose Acceleration Generative Modeling (AGM) which is built on the SOC theory, enabling the favorable trajectories for efficient sampling over 2nd-order momentum dynamics generative modeling such as CLD.
2.

As a result of AGM structural characteristics, it becomes possible to estimate a realistic data point at an early time point, a concept we refer to as sampling-hop. This approach not only yields a significant reduction in sampling complexity but also offers a novel perspective on accelerating the sampling in generative modeling by leveraging additional information from the dynamics.
3.

We achieve competitive results compared to DM approaches equipped with specifically designed fast sampling techniques on image datasets, particularly in small NFE settings.

Refer to caption — Figure 1: The pixel-wise trajectories comparison with CLD(Dockhorn et al., 2021). Left figures correspond to the trajectories over time w.r.t random sampled 16 pixels, for position and velocity. Our model is able to learn straighter trajectories which is beneficial for reducing sampling complexity.

2 Preliminary

Notation: Let ${\mathbf{x}}_{t}\in\mathbb{R}^{d}$ and ${\mathbf{v}}_{t}\in\mathbb{R}^{d}$ denote the $d$ -dimensional position and velocity variable of a particle ${\mathbf{m}}_{t}=[{\mathbf{x}}_{t},{\mathbf{v}}_{t}]^{\mathsf{T}}\in\mathbb{R}% ^{2d}$ at time $t$ . We denote the discretized time series as $0\leq t_{0}<...t_{n}<t_{N}<1$ . The Wiener Process is denoted as ${\mathbf{w}}_{t}$ . The identity matrix is denoted as ${\mathbf{I}}_{d}\in\mathbb{R}^{d\times d}$ . We define ${\bm{\Sigma}_{t}}$ as the covariance matrix of ${\mathbf{x}}_{t}$ and ${\mathbf{v}}_{t}$ at time step $t$ .

2.1 Dynamical Generative Modeling

The generative modeling approaches rooted in dynamical systems, including ODE and SDE, have garnered significant attention. Here, we present three noteworthy dynamical generative models: Diffusion Model (DM), Flow Matching (FM) and Bridge Matching (BM).

Table 1: Comparison between models in terms of boundary distributions

p_{0}

and

p_{1}

. Our AGM generalizes DM beyond Gaussian priors to phase space, similar to CLD. However, unlike CLD, AGM does not need to converge to the Gaussian at equilibrium which causes curved trajectory(see Fig.1), instead, velocity distribution will be the convolution of data distribution with Gaussian.

Models	DM/FM	CLD	AGM(ours)
$p_{0}(\cdot)$	$p_{\rm{data}}(x)$	$p_{\rm{data}}(x)\times\mathcal{N}(\mathbf{0},{\bm{I}}_{d})$	$\mathcal{N}(\mathbf{0},{\bm{\Sigma}}_{0}\times{\bm{I}}_{2d})$
$p_{1}(\cdot)$	${\mathcal{N}}({\mathbf{0}},{\bm{I}}_{d})$	$\mathcal{N}(\mathbf{0},{\bm{I}}_{d})\times\mathcal{N}(\mathbf{0},{\bm{I}}_{d})$	$p_{\rm{data}}(x)\times p_{\rm{data}}(x)*\mathcal{N}(\mathbf{0},\bm{\Sigma}_{1}% \otimes{\bm{I}}_{2d})$

Diffusion Model: In the framework of DM, given ${\mathbf{x}}_{0}$ drawn from a data distribution $p_{\rm{data}}$ , the model proceeds to construct a SDE,

\displaystyle{\textnormal{d}}{\mathbf{x}}_{t}=f_{t}({\mathbf{x}}_{t}){% \textnormal{d}}t+g(t){\textnormal{d}}{\mathbf{w}}_{t}\quad{\mathbf{x}}_{0}\sim p% _{\rm{data}}({\mathbf{x}})

(1)

whose terminal distributions at $t=1$ approach an approximate Gaussian, i.e. ${\mathbf{x}}_{1}\sim\mathcal{N}({\mathbf{0}},{\bm{I}}_{d})$ . This accomplishment is realized through the careful selection of the diffusion coefficient $g_{t}$ and the base drift $f_{t}({\mathbf{x}}_{t})$ . It is noteworthy that the time-reversal (Anderson, 1982) of (1) results in another SDE:

\displaystyle{\textnormal{d}}{\mathbf{x}}_{t}=\left[f_{t}({\mathbf{x}}_{t})-g_% {t}^{2}\nabla_{{\mathbf{x}}}\log p({\mathbf{x}}_{t},t)\right]{\textnormal{d}}t% +g(t){\textnormal{d}}{\mathbf{w}}_{t},\quad{\mathbf{x}}_{1}\sim\mathcal{N}({% \mathbf{0}},{\mathbf{I}}_{d})

(2)

where $p(\cdot,t)$ is the marginal density of (1) at time $t$ and $\nabla_{{\mathbf{x}}}\log p_{t}$ is known as the score function. SDE (2) can be regarded as the time-reversal of (1) in such a manner that the path-wise measure is almost surely equivalent to the one induced by (1). As a consequence, these two SDEs share identical marginal over time. In practice, it is feasible to analytically sample ${\mathbf{x}}_{t}$ given $t$ and ${\mathbf{x}}_{0}$ . Additionally, we can leverage a neural network to learn the score function by regressing scaled Stein Score $\mathbb{E}_{{\mathbf{x}}_{t},t}\lVert{\mathbf{s}}_{t}^{\theta}({\mathbf{x}}_{t% },t;\theta)-\nabla_{{\mathbf{x}}}\log p({\mathbf{x}}_{t},t|{\mathbf{x}}_{0})% \rVert_{2}^{2}$ for the purpose of propagating (2). This learned score can then be integrated into the solution of the aforementioned SDE(2) to simulate the generation of data that adheres to the target data distribution from the prior distribution. Meanwhile, (2) also corresponds to an ODE which shares the same path-wise measure:

\displaystyle{\textnormal{d}}{\mathbf{x}}_{t}=\left[f_{t}({\mathbf{x}}_{t})-% \frac{1}{2}g_{t}^{2}\nabla_{{\mathbf{x}}}\log p({\mathbf{x}}_{t},t)\right]{% \textnormal{d}}t,\quad{\mathbf{x}}_{1}\sim\mathcal{N}({\mathbf{0}},{\mathbf{I}% }_{d})

(3)

which motivates the popular sampler introduced in (Zhang & Chen, 2022; Zhang et al., 2022; Bao et al., 2022) to solve the ODE (2) efficiently.

Bridge Matching and Flow Matching: An alternative approach to exploring the time-reversal of a forward noising process involves the concept of ’building bridges’ between two distinct distributions $p_{0}(\cdot)$ and $p_{1}(\cdot)$ . This method entails the learning of a mimicking diffusion process, commonly referred to as bridge matching, as elucidated in previous works (Peluchetti, 2021; Shi et al., 2022). Here we consider the SDE in the form of:

\displaystyle{\textnormal{d}}{\mathbf{x}}_{t}={\mathbf{v}}_{t}({\mathbf{x}},t)% {\textnormal{d}}t+g_{t}{\textnormal{d}}{\mathbf{w}}_{t}\quad s.t.\quad(x_{0},x% _{1})\sim\Pi_{0,1}({\mathbf{x}}_{0},{\mathbf{x}}_{1}):=p_{0}\times p_{1}

(4)

which is pinned down at an initial and terminal point $x_{0},x_{1}$ which are independently samples from predefined $p_{0}$ and $p_{1}$ . This is commonly known as the reciprocal projection of $x_{0}$ and $x_{1}$ in the literature (Shi et al., 2023; Peluchetti, 2023; Liu et al., 2022; Léonard et al., 2014). The construction of such SDE is accomplished by meticulous design of ${\mathbf{v}}_{t}$ . A widely adopted choice for ${\mathbf{v}}_{t}$ is ${\mathbf{v}}_{t}:=({\mathbf{x}}_{1}-{\mathbf{x}}_{t})/(1-t)$ , which induces the well-known Brownian Bridge (Liu et al., 2023; Somnath et al., 2023). Similar to the approach in DM and owing to the linear structure of the dynamics, one can efficiently estimate this drift by employing a neural network parameterized by weights $\theta$ for regression on: $\mathbb{E}_{{\mathbf{x}}_{t},t}\lVert{\mathbf{v}}_{t}^{\theta}({\mathbf{x}}_{t% },t;\theta)-{\mathbf{v}}_{t}({\mathbf{x}}_{t},t)\rVert_{2}^{2}$ given ${\mathbf{x}}_{1}$ and $t$ . As extensively discussed in previous studies (Liu et al., 2023; Shi et al., 2022), this bridge matching framework takes on the characteristics of FM (Lipman et al., 2022) when the diffusion coefficient $g_{t}$ tends to zero.

Remark 1.

The practice of constraining a stochastic process to specific initial and terminal conditions is a well-established setup in SOC. For a gentle introduction of it’s connection with Brownian Bridge, Schrödinger Bridge please see Appendix.C. From this perspective, one can derive Brownian Bridge, as elaborated in Appendix.D.1 for comprehensive elucidation. It is imperative to note that the SOC framework will serve as the fundamental basis upon which we will develop our algorithm.

3 Acceleration Generative Model

We apply SOC to characterize the twisted trajectory of momentum dynamics induced by CLD(Dockhorn et al., 2021). It becomes evident that the mechanisms encompassing flow matching, diffusion modeling, and Bridge matching collectively facilitate the construction of an estimated target data point, denoted as ${\mathbf{x}}_{1}$ , by utilizing the intermediate state of the dynamics, ${\mathbf{x}}_{t}$ . Our additional objective is to expedite the estimation of a plausible ${\mathbf{x}}_{1}$ by incorporating additional dynamics-related information, such as velocity, thereby curtailing the requisite time integration.

In this section, we introduce the proposed method, termed as the Acceleration Generative Model (AGM), rooted in SOC theory. Building upon (Chen & Georgiou, 2015), we extend the framework by incorporating a time-varying diffusion coefficient and accommodating arbitrary boundary conditions, ultimately arriving at an analytical solution suited for the generative modeling. We demonstrate its efficacy in rectifying the trajectory of CLD, concurrently showcasing its aptitude for accurately estimating the target data at an early timestep $t_{i}$ , thereby enabling expeditious sampling.

As suggested by BM approach, there is a necessity to formulate a trajectory that bridges the two data points sampled from $p_{0}$ and $p_{1}$ respectively. Desirably, the intermediate trajectory should exhibit optimal characteristics that facilitate smoothness and linearity. This is essential for the ease of simulating the dynamics system to obtain the solution. In our endeavor to tackle this challenge and enhance the estimation of the data point ${\mathbf{x}}_{1}$ by incorporating velocity components, we encapsulate the problem within a SOC framework, specifically formulated in the phase space which reads:

Definition 2 (Stochastic Bridge problem of linear momentum system (Chen & Georgiou, 2015)).

	$\displaystyle\min_{{\mathbf{a}}_{t}}\int_{\tau}^{1}\lVert{\mathbf{a}}_{t}% \rVert_{2}^{2}{\textnormal{d}}t+({\mathbf{m}}_{1}-m_{1})^{\mathsf{T}}{\mathbf{% R}}({\mathbf{m}}_{1}-m_{1})$	$\displaystyle s.t\underbrace{\begin{bmatrix}{\textnormal{d}}{\mathbf{x}}_{t}\\ {\textnormal{d}}{\mathbf{v}}_{t}\end{bmatrix}}_{{\textnormal{d}}{\mathbf{m}}_{% t}}=\underbrace{\begin{bmatrix}{\mathbf{v}}_{t}\\ {\mathbf{a}}_{t}({\mathbf{x}}_{t},{\mathbf{v}}_{t},t)\end{bmatrix}}_{{\mathbf{% f}}({\mathbf{m}},t)}{\textnormal{d}}t+\underbrace{\begin{bmatrix}{\mathbf{0}}&% {\mathbf{0}}\\ {\mathbf{0}}&g_{t}\end{bmatrix}}_{{\mathbf{g}}_{t}}{\textnormal{d}}{\mathbf{w}% }_{t},$		(5)
	$\displaystyle{\mathbf{m}}_{\tau}:=\begin{bmatrix}{\mathbf{x}}_{\tau}\\ {\mathbf{v}}_{\tau}\end{bmatrix}=\begin{bmatrix}x_{\tau}\\ v_{\tau}\end{bmatrix}$	$\displaystyle,\quad{\mathbf{R}}=\begin{bmatrix}{\mathbf{r}}&{\mathbf{0}}\\ {\mathbf{0}}&{\mathbf{r}}\end{bmatrix}\otimes{\bm{I}}_{d},\quad x_{1}\sim p_{% \rm{data}}.$		(5)

In this context, the matrix ${\mathbf{R}}$ is recognized as the terminal cost matrix, serving to assess the proximity between the propagated ${\mathbf{m}}_{1}$ and the ground truth $m_{1}$ at the terminal time $t=1$ . As the parameter ${\mathbf{r}}$ approaches positive infinity, the trajectory converges toward the state $x_{1}$ , prompting a transition to constrained dynamics wherein the system becomes constrained by two predetermined boundaries, namely $m_{0}$ and $m_{1}$ . This configuration aligns seamlessly with the principles of constructing a feasible bridge, as advocated by the tenets of BM. It is worth noting that this interpolation approach essentially represents a natural extension (Chen & Georgiou, 2015) of the well-established concept of the Brownian Bridge (Revuz & Yor, 2013), which has been employed in trajectory inference (Somnath et al., 2023; Tong et al., 2023a) and image inpainting tasks (Liu et al., 2023) and its connection with Diffusion has been discussed in Liu et al. (2023). Indeed, it is evident that the target velocity lacks a precise definition within this problem, allowing for flexibility in the design space for our approach. To address this, we opt for the linear interpolation of the intermediate point and the target point, represented as ${\mathbf{v}}_{1}=({\mathbf{x}}_{1}-{\mathbf{x}}_{t})/(1-t)$ , as the chosen terminal velocity, which also is the optimal control in the original space (see Appendix..D.1). This choice is made due to its ability to construct a trajectory characterized by straightness. Conceptually, the acceleration ${\mathbf{a}}_{t}$ continually guides the dynamics towards the linear interpolation of the two data points, serving to mitigate the impact of introduced stochasticity. In contrast to previous bridge matching frameworks, the velocity’s boundary condition in our approach varies over time since it depends on the state ${\mathbf{x}}_{t}$ and $t$ . The velocity variable serves solely as an auxiliary component aimed at straightening the trajectories. Regarding this SOC problem formulation, the solution is,

Proposition 3 (Phase Space Brownian Bridge).

When ${\mathbf{r}}\rightarrow+\infty$ , The solution w.r.t optimization problem 5 is,

\displaystyle{\mathbf{a}}^{*}({\mathbf{m}}_{t},t)=g_{t}^{2}P_{11}\left(\frac{{% \mathbf{x}}_{1}-{\mathbf{x}}_{t}}{1-t}-{\mathbf{v}}_{t}\right)\quad\text{where% }:\quad P_{11}=\frac{-4}{g_{t}^{2}(t-1)}.

(6)

Proof.

Please see Appendix.D.2. ∎

Remark 4.

$P_{11}$ denotes the second diagonal component in the matrix $P_{t}$ , a solution derived from the Lyapunov equation (see Lemma.9), serving as an implicit representation of the optimality of the control. This value is dependent upon the uncontrolled dynamics, where ${\mathbf{a}}_{t}$ is set to the zero vector in (5), and will vary accordingly when uncontrolled dynamics change.

3.1 Training

By plugging the optimal control (6) back to the dynamics (5), we can obtain the desired SDE. As been suggested by (Song et al., 2020b; Dockhorn et al., 2021), such SDE has a corresponding probablistic ODE which shares the same marginal over time in which the drift term will have an additional score term $\nabla_{{\mathbf{v}}}\log p({\mathbf{m}}_{t},t)$ . Here we summarize the force term for SDE and ODE as:

		$\displaystyle\begin{bmatrix}{\textnormal{d}}{\mathbf{x}}_{t}\\ {\textnormal{d}}{\mathbf{v}}_{t}\end{bmatrix}=\begin{bmatrix}{\mathbf{v}}_{t}% \\ {\mathbf{F}}_{t}\end{bmatrix}{\textnormal{d}}t+\begin{bmatrix}{\mathbf{0}}&{% \mathbf{0}}\\ {\mathbf{0}}&h_{t}\end{bmatrix}{\textnormal{d}}{\mathbf{w}}_{t}\quad\text{s.t}% \quad{\mathbf{m}}_{0}:=\begin{bmatrix}{\mathbf{x}}_{0}\\ {\mathbf{v}}_{0}\end{bmatrix}\sim\mathcal{N}({\bm{\mu}}_{0},{\bm{\Sigma}}_{0}),$		(7)
		$\displaystyle\text{Bridge Matching SDE}:{\mathbf{F}}_{t}:={\mathbf{F}}_{t}^{b}% ({\mathbf{m}}_{t},t)\equiv{\mathbf{a}}_{t}^{*}({\mathbf{m}}_{t},t),\quad\quad% \quad\quad\quad\quad\quad\quad h(t):=g(t),$
		$\displaystyle\text{Probablistic ODE}:{\mathbf{F}}_{t}:={\mathbf{F}}_{t}^{p}({% \mathbf{m}}_{t},t)\equiv{\mathbf{a}}^{*}_{t}({\mathbf{m}}_{t},t)-\frac{1}{2}g_% {t}^{2}\nabla_{{\mathbf{v}}}\log p({\mathbf{m}},t),\quad h(t):=0.$

Henceforth, we refer to the dynamics associated with the Bridge Matching SDE as AGM-SDE, and its corresponding ODE counterpart as AGM-ODE. Meanwhile, the linearity of the system implies the intermediate state ${\mathbf{m}}_{t}$ and the close form solution of score term are analytically available. In particular, the mean ${\bm{\mu}_{t}}$ and covariance matrix ${\bm{\Sigma}_{t}}$ of the intermediate marginal $p_{t}({\mathbf{m}}_{t}|{\mathbf{x}}_{1})=\mathcal{N}({\bm{\mu}_{t}},{\bm{% \Sigma}_{t}})$ of such a system can be analytically computed with ${\bm{\Sigma}}_{t}=\begin{bmatrix}{\Sigma^{xx}_{t}}&{\Sigma^{xv}_{t}}\\ {\Sigma^{xv}_{t}}&{\Sigma^{vv}_{t}}\end{bmatrix}\otimes{\bm{I}}_{d}$ , and ${\bm{\mu}_{t}}=\begin{bmatrix}{\mu^{x}_{t}}\\ {\mu^{v}_{t}}\end{bmatrix}$ , provided we have the boundary conditions ${\bm{\mu}}_{0}$ and ${\bm{\Sigma}}_{0}$ in place, as outlined in Särkkä & Solin (2019). Please see Appendix.D.3 for detail. In order to sample from such multi-variant Gaussian, one need to decompose the covariance matrix by Cholesky decomposition, and ${\mathbf{m}}_{t}$ is reparamertized as:

\displaystyle{\mathbf{m}}_{t}={\bm{\mu}_{t}}+{\mathbf{L}}_{t}\epsilon={\bm{\mu% }_{t}}+\begin{bmatrix}L_{t}^{xx}{\bm{\epsilon}_{0}}\\ L_{t}^{xv}{\bm{\epsilon}_{0}}+L_{t}^{vv}{\bm{\epsilon}_{1}}\\ \end{bmatrix},\nabla_{{\mathbf{v}}}\log p_{t}:=-\ell_{t}{\bm{\epsilon}_{1}}

(8)

where ${\bm{\Sigma}_{t}}={\mathbf{L}}_{t}{\mathbf{L}}_{t}^{\mathsf{T}}$ , $\epsilon=\begin{bmatrix}{\bm{\epsilon}_{0}}\\ {\bm{\epsilon}_{1}}\end{bmatrix}\sim\mathcal{N}({\mathbf{0}},\mathbf{I}_{2d})$ and $\ell_{t}=\mathchoice{{\hbox{$\displaystyle\sqrt{\frac{{\Sigma^{xx}_{t}}}{{% \Sigma^{xx}_{t}}{\Sigma^{vv}_{t}}-({\Sigma^{xv}_{t}})^{2}}\,}$}\lower 0.4pt% \hbox{\vrule height=15.66331pt,depth=-12.5307pt}}}{{\hbox{$\textstyle\sqrt{% \frac{{\Sigma^{xx}_{t}}}{{\Sigma^{xx}_{t}}{\Sigma^{vv}_{t}}-({\Sigma^{xv}_{t}}% )^{2}}\,}$}\lower 0.4pt\hbox{\vrule height=11.01904pt,depth=-8.81528pt}}}{{% \hbox{$\scriptstyle\sqrt{\frac{{\Sigma^{xx}_{t}}}{{\Sigma^{xx}_{t}}{\Sigma^{vv% }_{t}}-({\Sigma^{xv}_{t}})^{2}}\,}$}\lower 0.4pt\hbox{\vrule height=8.65237pt,% depth=-6.92194pt}}}{{\hbox{$\scriptscriptstyle\sqrt{\frac{{\Sigma^{xx}_{t}}}{{% \Sigma^{xx}_{t}}{\Sigma^{vv}_{t}}-({\Sigma^{xv}_{t}})^{2}}\,}$}\lower 0.4pt% \hbox{\vrule height=8.65237pt,depth=-6.92194pt}}}$ .

Parameterization: The Force term can be represented as a composite of the data point and Gaussian noise. Specifically,

\displaystyle{\mathbf{a}}^{*}({\mathbf{m}}_{t},t)=4{\mathbf{x}}_{1}(1-t)^{2}-g% _{t}^{2}P_{11}\left[\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right){\bm{% \epsilon}_{0}}+L_{t}^{vv}{\bm{\epsilon}_{1}}\right].

(9)

We express the force term as ${\mathbf{F}}_{t}^{\theta}={\mathbf{s}}^{\theta}_{t}\cdot{\mathbf{z}}_{t}$ . Here, ${\mathbf{z}}_{t}$ assumes the role of regulating the output of the network ${\mathbf{s}}^{\theta}_{t}$ , ensuring that the variance of the network output is normalized to unity. For the detailed formulation of the normalizer ${\mathbf{z}}_{t}$ , please refer to Appendix.D.8. In a manner similar to the BM approach, one can formulate the objective function for regressing the force term as follows:

\displaystyle\min_{\theta}\mathbb{E}_{t\in[0,1]}\mathbb{E}_{{\mathbf{x}}_{1}% \sim p_{\rm{data}}}\mathbb{E}_{{\mathbf{m}}_{t}\sim p_{t}({\mathbf{m}}_{t}|{% \mathbf{x}}_{1})}\lambda(t)\left[\lVert{\mathbf{F}}_{t}^{\theta}({\mathbf{m}}_% {t},t;\theta)-{\mathbf{F}}_{t}({\mathbf{m}}_{t},t)\rVert_{2}^{2}\right]

(10)

Where $\lambda(t)$ is known as the reweight of the objective function across the time horizon. We defer the derivation of $\ell_{t}$ and the presentation of ${\mathbf{L}}_{t}$ , $\lambda(t)$ and ${\mathbf{a}}_{t}$ in Appendix.D.

3.2 Sampling from AGM

Once the paramterized force term ${\mathbf{F}}_{t}^{\theta}$ is trained, we are ready to simulate the dynamics to generate the samples by plugging it back to the dynamics (7). One can use any type of SDE or ODE sampler to propagate the learnt system. Here we list our choice of sampler for AGM-SDE and AGM-ODE.

Stochastic Sampler: To simulate the SDE, prior works are majorly relying on Euler-Maruyama(EM) (Kloeden et al., 1992) and related methods. We adopt the Symmetric Splitting Sampler(SSS) from Dockhorn et al. (2021) in our AGM-SDE. This selection is based on the compelling performance it offers when dealing with momentum systems.

Deterministic Sampler: It is imperative to acknowledge that this system is inherently underactuated because the force term is exclusively injected into the velocity component, while velocity serves as the driving factor for the position—a variable of primary interest in generative modeling context. More specifically, at time step $t_{i}$ , the impact of force does not immediately manifest in the position but rather takes effect at a subsequent time step, denoted as $t_{i+1}$ after discretizing time horizon. At time $t_{0}$ , it becomes undesirable to propagate the state ${\mathbf{x}}_{0}$ using an initially uncontrolled velocity over an extended time interval $\delta_{0}$ . The presence of this delay phenomenon can also exert an influence when the time interval $\delta_{t}$ is large, thereby impeding our ability to reduce the NFE during sampling. We propose the adoption of an Exponential Integrator (EI) approach, as elaborated in Zhang & Chen (2022). Empirical evidence suggests that this method aligns well with our model. We provide an illustrative example of how the AGM-ODE, in conjunction with the EI technique, can be employed to inject the learnt network into both velocity and position channels simultaneously:

	$\displaystyle\begin{bmatrix}{\mathbf{x}}_{t_{i+1}}\\ {\mathbf{v}}_{t_{i+1}}\end{bmatrix}$	$\displaystyle=\Phi(t_{i+1},t_{i})\begin{bmatrix}{\mathbf{x}}_{t}\\ {\mathbf{v}}_{t}\end{bmatrix}+\sum_{j=0}^{w}\begin{bmatrix}\int_{t_{i}}^{t_{i+% 1}}\left(t_{i+1}-\tau\right){\mathbf{z}}_{\tau}\cdot{\mathbf{M}}_{i,j}(\tau){% \textnormal{d}}\tau\cdot{\ignorespaces\color[rgb]{0,0.5,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0.5,0}{\mathbf{s}}^{\theta}_{t}({\mathbf{m}}_{t_{i-j}},% t_{i-j})})\\ \int_{t_{i}}^{t_{i+1}}{\mathbf{z}}_{\tau}\cdot{\mathbf{M}}_{i,j}(\tau){% \textnormal{d}}\tau\cdot{\ignorespaces\color[rgb]{0,0.5,0}\definecolor[named]{% pgfstrokecolor}{rgb}{0,0.5,0}{\mathbf{s}}^{\theta}_{t}({\mathbf{m}}_{t_{i-j}},% t_{i-j})}\end{bmatrix}$		(11)
		$\displaystyle\text{Where}\ {\mathbf{M}}_{i,j}(\tau)=\prod_{k\neq j}\left(\frac% {\tau-t_{i-k}}{t_{i-j}-t_{i-k}}\right),\quad\text{and}\quad\Phi(t,s)=\begin{% bmatrix}1&t-s\\ 0&1\end{bmatrix}.$		(11)

In Eq.11, $\Phi(s,t)$ denotes the transition matrix for our system, while ${\mathbf{M}}_{i,j}(\tau)$ represents the $w-$ order multistep coefficient (Hochbruck & Ostermann, 2010). For a comprehensive derivation of these terms, please refer to Appendix.D.9. It is worth noting that the map** of ${\mathbf{s}}_{\theta}$ into both the position and velocity channels significantly emulates the errors introduced by discretization delays. Sampling-hop: In the context of CLD (Dockhorn et al., 2021), their focus is on estimating the score function w.r.t. velocity, which essentially corresponds to estimating scaled ${\bm{\epsilon}}_{1}$ in our notation. However, relying solely on the aforementioned information is not sufficient for estimating the data point ${\mathbf{x}}_{1}$ . Additional knowledge regarding ${\bm{\epsilon}_{0}}$ is also required in order to perform such estimation. In our case, the training objective implicitly includes both ${\bm{\epsilon}_{0}}$ and ${\bm{\epsilon}_{1}}$ (see eq.9), hence one can manage to recover ${\mathbf{x}}_{1}$ by Proposition.5. Remarkably, our observations have unveiled that when the network is equipped with additional velocity information, it acquires the capability to estimate the target data point during the early stages of the trajectory, as illustrated in fig.2. This estimation can be seamlessly integrated into AGM-SDE and AGM-ODE and we name it sampling-hop. Specifically,

Proposition 5 (Sampling-Hop).

Given the state, velocity and trained force term ${\mathbf{F}}_{t}^{\theta}$ at time step $t$ in sampling phase, The estimated data point $\tilde{{\mathbf{x}}}_{1}$ can be represented as

\displaystyle\tilde{{\mathbf{x}}}_{1}^{SDE}=\frac{(1-t)({\mathbf{F}}_{t}^{% \theta}+{\mathbf{v}}_{t})}{g_{t}^{2}P_{11}}+{\mathbf{x}}_{t},\ \

\displaystyle\text{or}\quad\tilde{{\mathbf{x}}}^{ODE}_{1}=\frac{{\mathbf{F}}_{% t}^{\theta}+g_{t}^{2}P_{11}(\alpha_{t}{\mathbf{x}}_{t}+\beta_{t}{\mathbf{v}}_{% t})}{4(t-1)^{2}+g_{t}^{2}P_{11}(\alpha_{t}{\mu^{x}_{t}}+\beta_{t}{\mu^{v}_{t}})}

(12)

for AGM-SDE and AGM-ODE dynamics respectively, and $\beta_{t}={L^{vv}_{t}}+\frac{1}{2P_{11}}$ , $\alpha_{t}=\frac{(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv})-\beta_{t}L^{xv}_{t}}{L^{% xx}_{t}}$ .

Proof.

See Appendix.D.10 ∎

This property empowers us to allocate the NFE budget selectively within the time interval $t\in[0,t_{i}]$ , where $t_{i}<t_{N}$ , effectively reducing the discretization error while maintaining the sampling quality. This insight paves the way for efficient low NFE sampling strategies later. Here we summarized the training and sampling procedure of our method in Algorithm.1 and Algorithm.2 respectively. Algorithm 1 Training 1: Input: data distribution $p_{\rm{data}}(\cdot)$ 2: while not converge do 3: $t\sim\mathcal{U}([0,1])$ , ${\mathbf{x}}_{1}\sim p_{\rm{data}}({\mathbf{x}}_{1})$ 4: Compute mean and covariance ${\bm{\mu}_{t}}$ and ${\bm{\Sigma}_{t}}$ . (Appendix.D.3) 5: Sample ${\mathbf{m}}_{t}={\bm{\mu}_{t}}+{\mathbf{L}}_{t}{\bm{\epsilon}}$ .(eq.8) 6: Compute target ${\mathbf{F}}_{t}$ (eq.7) using optimal acceleration (eq.9) 7: Compute loss $\mathbb{E}\left[\lambda\lVert{\mathbf{F}}_{t}^{\theta}-{\mathbf{F}}_{t}\rVert_% {2}^{2}\right]$ (eq.10). 8: Take gradient descent with respect to ${\mathbf{F}}_{t}^{\theta}({\mathbf{m}}_{t},t;\theta)$ . 9: end while Algorithm 2 Sampling 1: Input: trained ${\mathbf{F}}(\cdot,\cdot;\theta)$ , discretized time step [ $t_{0}$ , $\cdots$ , $t_{i}$ ], Choose the sampler from [SSS(SDE), EI(ODE)]. Choose prior mean and covariance ${\bm{\mu}}_{0}$ , ${\bm{\Sigma}}_{0}$ 2: Sample ${\mathbf{m}}_{0}\sim p_{0}({\mathbf{m}};{\bm{\mu}}_{0},{\bm{\Sigma}}_{0})$ . 3: for n = $0$ to $i$ do 4: estimate ${\mathbf{F}}_{t_{n}}^{\theta}({\mathbf{m}}_{t_{n}},t_{n})$ 5: ${\mathbf{m}}_{t_{n+1}}=\textbf{Sampler}({\mathbf{m}}_{t_{n}},F_{t_{n}}^{\theta% },t_{n})$ 6: reconstruct $\hat{{\mathbf{x}}}_{1}$ using Proposition.5. 7: end for 8: Return $\hat{{\mathbf{x}}}_{1}$

4 Experimental Results

Architectures and Hyperparameters: We parameterize ${\mathbf{s}}_{t}^{\theta}(\cdot,\cdot;\theta)$ using modified NCSN++ model as provided in Karras et al. (2022). We employ six input channels, accounting for both position and velocity variables, as opposed to the standard three channels used in the CIFAR-10 (Krizhevsky et al., 2009), AFHQv2 (Choi et al., 2020) and ImageNet (Deng et al., 2009) which leads to a negligible increase of network parameters. For the purpose of comparison with CLD in the toy dataset, we adopt the same ResNet-based architecture utilized in CLD. Throughout all of our experiments, we maintain a monotonically decreasing diffusion coefficient, given by $g(t)=3(1-t)$ . For the detailed experimental setup, please refer further to Appendix.E.

Evaluation: To assess the performance and the sampling speed of various algorithms, we employ the Fréchet Inception Distance score (FID;Heusel et al. (2017)) and the Number of Function Evaluations (NFE) as our metrics. For FID evaluation, we utilize reference statistics of all datasets obtained from EDM (Karras et al., 2022) and use 50k generated samples to evaluate. Additionally, we re-evaluate the FID of CLD and EDM using the same reference statistics to ensure consistency in our comparisons. For all other reported values, we directly source them from respective referenced papers.

Selection of ${\bm{\Sigma}}_{0}$ : The choice of initial covariance ${\bm{\Sigma}}_{0}$ directly influences the path measure of the trajectory. In our case, we set ${\bm{\Sigma}}_{0}:=\bigl{[}\begin{smallmatrix}1&k\\ k&1\end{smallmatrix}\bigr{]}$ with hyperparameter $k$ . We observe that trajectories tend to exhibit pronounced curvature under specific conditions: when the $k$ is positive, the absolute value of the position is large. This behavior is particularly noticeable when dealing with images, where the data scale ranges from -1 to 1. We aim for favorable uncontrolled dynamics, as this can potentially lead to better-controlled dynamics. Our strategy is to design $k$ in such a way that the marginal distribution of uncontrolled dynamics at $t_{N}=1$ effectively covers the range of image data values meanwhile $k$ keeps negative. We can express the marginal of uncontrolled dynamics by leveraging the transition matrix $\Phi(1,0)$ , which gives us ${\mathbf{x}}_{1}:={\mathbf{x}}_{0}+{\mathbf{v}}_{0}$ . Figure 3 illustrates the standard deviation of ${\mathbf{x}}_{1}$ for various values of $k$ . Based on our empirical observations, we choose $k=-0.2$ for all experiments, as it effectively covers the data range. The subsequent controlled dynamics (eq.7) will be constructed based on such desired uncontrolled dynamics as established.

Table 2: FID

\downarrow

Comparison with CLD(Dockhorn et al., 2021) using same SSS Sampler on CIFAR-10.

NFE $\downarrow$	CLD-SDE	AGM-SDE
20	$>$ 100	7.9
50	19.93	3.21
150	2.99	2.68
1000	2.44	2.46

Stochastic Sampling: In experiments, we emphasize the advantages of using the AGM-SDE compared with CLD. Firstly, we show that our model exhibits superior performance when NFE is significantly lower than that of CLD, particularly in toy dataset scenarios. For evaluation, we utilized the multi-modal Mixture of Gaussian and Multi-Swiss-Roll datasets. The results obtained from the toy dataset, as shown in Fig.8, demonstrate that AGM-SDE is capable of generating data that closely aligns with the ground truth, while requiring NFE that is around one order of magnitude lower than CLD. Furthermore, our findings reveal that AGM-SDE outperforms CLD in the context of CIFAR-10 image generation tasks, especially when faced with limited NFE, as illustrated in Table 2.

Deterministic Sampling: We validate our algorithm on high-dimensional image generation with a deterministic sampler. We provide uncurated samples from CIFAR-10, AFHQv2 and ImageNet-64 with varying NFE in Appendix.H. Regarding the quantitative evaluation, Table.4 and Table.4 summarize the FID together with NFE used for sampling on CIFAR-10 and ImageNet-64. Notably, AGM-ODE achieves 2.46 FID score with 50 NFE on CIFAR-10, and 10.55 FID score with 20 NFE in unconditional ImageNet-64 which is comparable to the existing dynamical generative modeling.

We underscore the effectiveness of sampling-hop, especially when faced with a constrained NFE budget, in comparison to baselines. We validate it on the CIFAR-10 and AFHQv2 dataset respectively. Fig.4 illustrates that AGM-ODE is able to generate plausible images even when NFE $=5$ and outperforms EDM(Karras et al., 2022) when NFE is extremely small (NFE $<$ 15) visually and numerically on AFHQv2 dataset. We also compare with other fast sampling algorithms built upon DM in table.5 on CIFAR-10 dataset where AGM-ODE demonstrates competitive performance. Notably, AGM-ODE outperforms the baseline CLD with the same EI sampler by a large margin. We suspect that the improvement is based on the rectified trajectory which is more friendly for the ODE solver.

Conditional Generation We showcase the capability of AGM to generate conditional samples using an unconditional model (fig.5) by incorporating conditional information into the prior velocity variable ${\mathbf{v}}_{0}$ . Instead of employing a randomly sampled ${\mathbf{v}}_{0}$ , we use a linear combination of ${\mathbf{v}}_{0}$ and the desired velocity ${\mathbf{v}}_{1}=({\mathbf{x}}_{1}-{\mathbf{x}}_{t_{0}})/(1-t_{0})$ , where ${\mathbf{x}}_{1}$ is conditioned data. Thus, $t_{0}$ , the initial velocity is defined as ${\mathbf{v}}_{0}^{cond}:=(1-\xi){\mathbf{v}}_{0}+\xi{\mathbf{v}}_{1}$ , with $\xi$ serving as a mixing coefficient. Fig.5 shows that AGM can generate conditional data without augmentation and additional fine-tuning. Such property can be extended to the inpainting task as well and the detail can be found in appendix.F.

Table 3: Unconditional CIFAR-10 generative performance

	Model Name	NFE $\downarrow$	FID $\downarrow$
ODE	EDM (Karras et al., 2022)	35	1.84
	CLD+EI (Zhang et al., 2022)	50	2.26
	FM-OT (Lipman et al., 2022)	142	6.35
	AGM-ODE(ours)	50	2.46
SDE	VP (Song et al., 2020b)	1000	2.66
SDE	VE (Song et al., 2020b)	1000	2.43
	CLD (Dockhorn et al., 2021)	1000	2.44
	AGM-SDE(ours)	1000	2.46

Table 4: Unconditional ImageNet-64 generative performance

Model	NFE $\downarrow$	FID $\downarrow$
FM-OT(Lipman et al., 2022)	138	14.45
MFM(Pooladian et al., 2023)	132	11.82
MFM(Pooladian et al., 2023)	40	12.97
AGM-ODE(ours)	40	10.10
AGM-ODE(ours)	30	10.07
AGM-ODE(ours)	20	10.55

Table 5: Performance comparing with fast sampling algorithm using FID

\downarrow

metric on CIFAR-10

		NFE $\downarrow$	5	10	20
Dynamics Order	Model Name
1st order dynamics	EDM (Karras et al., 2022)		$>$ 100	15.78	2.23
	VP+EI (Zhang & Chen, 2022)		15.37	4.17	3.03
	DDIM (Song et al., 2020a)		26.91	11.14	3.50
	Analytic-DPM(Bao et al., 2022)		51.47	14.06	6.74
2nd order dynamics	CLD+EI (Zhang et al., 2022)		N/A	13.41	3.39
2nd order dynamics	AGM-ODE(ours)		11.93	4.60	2.60

5 Conclusion and Limitation

In this paper, we introduce a novel Acceleration Generative Modeling (AGM) framework rooted in SOC theory. Within this framework, we devise more favorable, straight trajectories for the momentum system. Leveraging the intrinsic characteristics of the momentum system, we capitalize on additional velocity to expedite the sampling process by using the sampling-hop technique, significantly reducing the time required to converge to accurate predictions of realistic data points. Our experimental results, conducted on both toy and image datasets in unconditional generative tasks, demonstrate promising outcomes for fast sampling.

However, it is essential to acknowledge that our approach’s performance lags behind state-of-the-art methods in scenarios with sufficient NFE. This observation suggests avenues for enhancing AGM performance. Such improvements could be achieved by enhancing the training quality through the adoption of techniques proposed in Karras et al. (2022) including data augmentation, fine-tuned noise scheduling, and network preconditioning, among others.

References

Anderson (1982) Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
Bao et al. (2022) Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. arXiv preprint arXiv:2201.06503, 2022.
Bryson (1975) Arthur Earl Bryson. Applied optimal control: optimization, estimation and control. CRC Press, 1975.
Chen et al. (2023) Tianrong Chen, Guan-Horng Liu, Molei Tao, and Evangelos A Theodorou. Deep momentum multi-marginal schr $\backslash$ ” odinger bridge. arXiv preprint arXiv:2303.01751, 2023.
Chen & Georgiou (2015) Yongxin Chen and Tryphon Georgiou. Stochastic bridges of linear systems. IEEE Transactions on Automatic Control, 61(2):526–531, 2015.
Choi et al. (2020) Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. Stargan v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8188–8197, 2020.
De Bortoli et al. (2023) Valentin De Bortoli, Guan-Horng Liu, Tianrong Chen, Evangelos A Theodorou, and Weilie Nie. Augmented bridge matching. arXiv preprint arXiv:2311.06978, 2023.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Dhariwal & Nichol (2021) Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv preprint arXiv:2105.05233, 2021.
Dockhorn et al. (2021) Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with critically-damped langevin diffusion. arXiv preprint arXiv:2112.07068, 2021.
Haussmann & Pardoux (1986) Ulrich G Haussmann and Etienne Pardoux. Time reversal of diffusions. The Annals of Probability, pp. 1188–1205, 1986.
Heng et al. (2021) Jeremy Heng, Valentin De Bortoli, Arnaud Doucet, and James Thornton. Simulating diffusion bridges with score matching. arXiv preprint arXiv:2111.07243, 2021.
Heusel et al. (2017) Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239, 2020.
Hochbruck & Ostermann (2010) Marlis Hochbruck and Alexander Ostermann. Exponential integrators. Acta Numerica, 19:209–286, 2010.
Inc. (2022) The MathWorks Inc. Matlab version: 9.13.0 (r2022b), 2022. URL https://www.mathworks.com.
Kappen (2008) HJ Kappen. Stochastic optimal control theory. ICML, Helsinki, Radbound University, Nijmegen, Netherlands, 2008.
Karras et al. (2022) Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
Kirk (2004) Donald E Kirk. Optimal control theory: an introduction. Courier Corporation, 2004.
Kloeden et al. (1992) Peter E Kloeden, Eckhard Platen, Peter E Kloeden, and Eckhard Platen. Stochastic differential equations. Springer, 1992.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Léonard et al. (2014) Christian Léonard, Sylvie Rœlly, and Jean-Claude Zambrini. Reciprocal processes. a measure-theoretical point of view. 2014.
Lipman et al. (2022) Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022.
Liu et al. (2023) Guan-Horng Liu, Arash Vahdat, De-An Huang, Evangelos A Theodorou, Weili Nie, and Anima Anandkumar. I2sb: Image-to-image schr $\backslash$ ” odinger bridge. arXiv preprint arXiv:2302.05872, 2023.
Liu et al. (2022) Xingchao Liu, Lemeng Wu, Mao Ye, and Qiang Liu. Let us build bridges: Understanding and extending diffusion generative models. arXiv preprint arXiv:2208.14699, 2022.
Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
O’Connell (2003) Neil O’Connell. Conditioned random walks and the rsk correspondence. Journal of Physics A: Mathematical and General, 36(12):3049, 2003.
Øksendal (2003) Bernt Øksendal. Stochastic differential equations. In Stochastic differential equations, pp. 65–84. Springer, 2003.
Pandey et al. (2023) Kushagra Pandey, Maja Rudolph, and Stephan Mandt. Efficient integrators for diffusion generative models. arXiv preprint arXiv:2310.07894, 2023.
Peluchetti (2021) Stefano Peluchetti. Non-denoising forward-time diffusions. 2021.
Peluchetti (2023) Stefano Peluchetti. Diffusion bridge mixture transports, schr $\backslash$ ” odinger bridge problems and generative modeling. arXiv preprint arXiv:2304.00917, 2023.
Pooladian et al. (2023) Aram-Alexandre Pooladian, Heli Ben-Hamu, Carles Domingo-Enrich, Brandon Amos, Yaron Lipman, and Ricky Chen. Multisample flow matching: Straightening flows with minibatch couplings. arXiv preprint arXiv:2304.14772, 2023.
Revuz & Yor (2013) Daniel Revuz and Marc Yor. Continuous martingales and Brownian motion, volume 293. Springer Science & Business Media, 2013.
Särkkä & Solin (2019) Simo Särkkä and Arno Solin. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.
Shi et al. (2022) Yuyang Shi, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. Conditional simulation using diffusion schrödinger bridges. In Uncertainty in Artificial Intelligence, pp. 1792–1802. PMLR, 2022.
Shi et al. (2023) Yuyang Shi, Valentin De Bortoli, Andrew Campbell, and Arnaud Doucet. Diffusion schr $\backslash$ ” odinger bridge matching. arXiv preprint arXiv:2303.16852, 2023.
Somnath et al. (2023) Vignesh Ram Somnath, Matteo Pariset, Ya-** Hsieh, Maria Rodriguez Martinez, Andreas Krause, and Charlotte Bunne. Aligned diffusion schr $\backslash$ ” odinger bridges. arXiv preprint arXiv:2302.11419, 2023.
Song et al. (2020a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
Song et al. (2021) Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. arXiv e-prints, pp. arXiv–2101, 2021.
Stengel (1994) Robert F Stengel. Optimal control and estimation. Courier Corporation, 1994.
Tong et al. (2023a) Alexander Tong, Nikolay Malkin, Kilian Fatras, Lazar Atanackovic, Yanlei Zhang, Guillaume Huguet, Guy Wolf, and Yoshua Bengio. Simulation-free schr $\backslash$ ” odinger bridges via score and flow matching. arXiv preprint arXiv:2307.03672, 2023a.
Tong et al. (2023b) Alexander Tong, Nikolay Malkin, Guillaume Huguet, Yanlei Zhang, Jarrid Rector-Brooks, Kilian Fatras, Guy Wolf, and Yoshua Bengio. Improving and generalizing flow-based generative models with minibatch optimal transport. In ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems, 2023b.
Yong & Zhou (1999) Jiongmin Yong and Xun Yu Zhou. Stochastic controls: Hamiltonian systems and HJB equations, volume 43. Springer Science & Business Media, 1999.
Zhang & Chen (2022) Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
Zhang et al. (2022) Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022.
Zhang et al. (2023) Qinsheng Zhang, Jiaming Song, and Yongxin Chen. Improved order analysis and design of exponential integrator for diffusion models sampling. arXiv preprint arXiv:2308.02157, 2023.

Appendix A supplementary Summary

We state the assumptions in Appendix.B. We provide the technique details appearing in Section.3 at Appendix.D. The details of the experiments can be found in Appendix.E. The visualization of generated figures can be found in Appendix.H.

Appendix B Assumptions

We will use the following assumptions to construct the proposed method. These assumptions are adopted from stochastic analysis for SGM (Song et al., 2021; Yong & Zhou, 1999; Anderson, 1982),

(i)

$p_{0}$ and $p_{1}$ with finite second-order moment.
(ii)

$g_{t}$ is continuous functions, and $|g(t)|^{2}>0$ is uniformly lower-bounded w.r.t. $t$ .
(iii)

$\forall t\in[0,1]$ , we have $\nabla_{\mathbf{v}}\log p_{t}({\mathbf{m}}_{t},t)$ Lipschitz and at most linear growth w.r.t. ${\mathbf{x}}$ and ${\mathbf{v}}$ .

Assumptions (i) (ii) are standard conditions in stochastic analysis to ensure the existence-uniqueness of the SDEs; hence also appear in SGM analysis (Song et al., 2021).

Appendix C Stochastic Optimal Control (SOC) in the Wild

In this section, we are going to provide a gentle introduction of Stochastic Optimal Control (SOC). Our work is majorly relying on the prior work Chen & Georgiou (2015) in which some technical details are missing. Here we first clarify some core derivations that may help the broader audience to understand Chen & Georgiou (2015) and our work.

C.1 Linear Quadratic Stochastic Optimal Control

SOC has wide applications in finance, robotics, and manufacturing. Here we will focus on Linear Quadratic SOC which usually refers to Linear Quadratic Regulator because the dynamic is linear and the objective function is quadratic (Bryson, 1975; Stengel, 1994). The problem states as:

		$\displaystyle\min_{{\mathbf{u}}_{t}}\int_{0}^{1}\frac{1}{2}\lVert{\mathbf{u}}_% {t}\rVert_{2}^{2}{\textnormal{d}}t+{\mathbf{x}}_{1}^{\mathsf{T}}R{\mathbf{x}}_% {1}\quad$		(13)
	$\displaystyle s.t\ \ {\textnormal{d}}{\mathbf{x}}_{t}$	$\displaystyle=[A(t){\mathbf{x}}_{t}+g_{t}{\mathbf{u}}_{t}]{\textnormal{d}}t+g_% {t}{\textnormal{d}}w_{t},\quad{\mathbf{x}}_{0}=x_{0}.$		(13)

In this formulation, ${\mathbf{x}}_{t}$ means the state and ${\mathbf{u}}_{t}$ is the control variable. Conceptually, the SOC problem is aiming to design the controller ${\mathbf{u}}_{t}$ to drive the system from point $x_{0}$ to $x_{1}\equiv 0$ with minimum effort. In the case of first-order system, the control will be the optimal vector field ${\mathbf{v}}_{t}^{*}$ and for the second-order system, the control is denoted as the optimal acceleration ${\mathbf{a}}_{t}^{*}$ . The presence of stochasticity, introduced by the Wiener Process denoted as ${\textnormal{d}}w_{t}$ , prevents the system from precisely converging to the Dirac mass $x_{1}$ . In order to strike a balance between the objective of converging to $x_{1}$ and minimizing overall control effort $\int\lVert{\mathbf{u}}_{t}\rVert_{2}^{2}{\textnormal{d}}t$ , the terminal cost ${\mathbf{x}}_{1}^{\mathsf{T}}R{\mathbf{x}}_{1}$ has been imposed.

One special case is $R\rightarrow\infty$ . Intuitively, it means the controlled dynamics should precisely converge to $x_{1}$ . However, one can notice that the stochastic trajectory which connects $x_{0}$ and $x_{1}$ is not unique in this case. Based on this constraint (pinned down at $x_{1}$ and $x_{0}$ at two boundaries), the optimization problem of SOC finds the optimal solution with minimum effort ${\mathbf{u}}_{t}$ which can be understood as the regularization of the trajectories, hence, such stochastic trajectory is unique while the regularization of controller is still applied. One can also draw the connection with such pinned-down SDE with well-known Doob- $h$ transform. For the people who are not familiar with these, here are some interesting papers (Heng et al., 2021; O’Connell, 2003).

The classical procedure to solve the SOC problem includes:

1.

write down the Hamilton–Jacobi–Bellman equation (HJB PDE) which explicitly represents the propagation of value function over time.
2.

Construct the Ricatti/Lyapunov Equation.
3.

Solve Ricatti/Lyapunov Equation and obtain the optimal control.

C.2 Value Function, Hamilton-Jacobian (Hamilton–Jacobi–Bellman equation) and Ricatti Equation

We adopt the classical notation in the SOC for the value function. Specifically, the underscript of the value function $V$ represents the partial derivative of it. For example, $V_{t}$ , $V_{x}$ and $V_{xx}$ represent for the first order derivative of $V$ w.r.t time $t$ , state ${\mathbf{x}}$ and second order derivate of $V$ w.r.t ${\mathbf{x}}$ . We first define the value function as:

\displaystyle V({\mathbf{x}}_{t},t)=\inf_{{\mathbf{u}}}\mathbb{E}\left[\int_{t% }^{1}\frac{1}{2}\lVert{\mathbf{u}}_{t}\rVert_{2}^{2}{\textnormal{d}}\tau+{% \mathbf{x}}_{1}^{\mathsf{T}}R{\mathbf{x}}_{1}\right]

and the dynamics is,

\displaystyle{\textnormal{d}}{\mathbf{x}}_{t}=(A{\mathbf{x}}_{t}+g_{t}{\mathbf% {u}}_{t}){\textnormal{d}}t+g_{t}{\textnormal{d}}{\mathbf{w}}_{t}

From Bellman’s principle to the value function, one can get:

	$\displaystyle V(t,{\mathbf{x}}_{t})$	$\displaystyle=\inf_{{\mathbf{u}}}\mathbb{E}\left[V(t+{\textnormal{d}}t,{% \mathbf{x}}_{t+{\textnormal{d}}t})+\int_{t}^{t+{\textnormal{d}}t}\frac{1}{2}% \lVert{\mathbf{u}}_{t}\rVert_{2}^{2}{\textnormal{d}}\tau\right]$
		$\displaystyle=\inf_{{\mathbf{u}}}\mathbb{E}\left[\frac{1}{2}\lVert{\mathbf{u}}% _{t}\rVert_{2}^{2}{\textnormal{d}}t+V(t,{\mathbf{x}}_{t})+V_{t}(t,{\mathbf{x}}% _{t}){\textnormal{d}}t+V_{x}(t,{\mathbf{x}}){\textnormal{d}}{\mathbf{x}}+\frac% {1}{2}tr\left[V_{xx}gg^{\mathsf{T}}\right]{\textnormal{d}}t\right]$
		$\displaystyle=\text{Plug in the dynamics ${\textnormal{d}}{\mathbf{x}}_{t}=% \cdots$}$
		$\displaystyle=\inf_{{\mathbf{u}}}\mathbb{E}\left[\frac{1}{2}\lVert{\mathbf{u}}% _{t}\rVert_{2}^{2}{\textnormal{d}}t+V(t,{\mathbf{x}}_{t})+V_{t}(t,{\mathbf{x}}% _{t}){\textnormal{d}}t+V_{x}(t,{\mathbf{x}})^{\mathsf{T}}((A{\mathbf{x}}_{t}+g% _{t}{\mathbf{u}}_{t})dt+g{\textnormal{d}}{\mathbf{w}}_{t})\right.$
		$\displaystyle\left.+\frac{1}{2}tr\left[V_{xx}gg^{\mathsf{T}}\right]{% \textnormal{d}}t\right]$
		$\displaystyle=\inf_{{\mathbf{u}}}\left[\frac{1}{2}\lVert{\mathbf{u}}_{t}\rVert% _{2}^{2}{\textnormal{d}}t+V(t,{\mathbf{x}}_{t})+V_{t}(t,{\mathbf{x}}_{t}){% \textnormal{d}}t+V_{x}(t,{\mathbf{x}})^{\mathsf{T}}(A{\mathbf{x}}_{t}+g_{t}{% \mathbf{u}}_{t}){\textnormal{d}}t\right.$
		$\displaystyle\left.+\frac{1}{2}tr\left[V_{xx}gg^{\mathsf{T}}\right]{% \textnormal{d}}t\right]$

One obtain:

\displaystyle V_{t}+\inf_{{\mathbf{u}}}\left[\frac{1}{2}\lVert{\mathbf{u}}_{t}% \rVert_{2}^{2}+V_{x}^{\mathsf{T}}(A{\mathbf{x}}_{t}+g_{t}{\mathbf{u}}_{t})% \right]+\frac{1}{2}tr\left[V_{xx}gg^{\mathsf{T}}\right]=0

The optimal control can be obtained by

\displaystyle{\mathbf{u}}_{t}^{*}=-g_{t}V_{x}

Plugging it back, one can obtain the HJB PDE:

\displaystyle V_{t}-\frac{1}{2}V_{x}gg^{\mathsf{T}}V_{x}+V_{x}^{\mathsf{T}}A{% \mathbf{x}}_{t}+\frac{1}{2}tr\left[V_{xx}gg^{\mathsf{T}}\right]=0

We assume that there exist certain matrix $Q$ , s.t. $V({\mathbf{x}},t)\equiv\frac{1}{2}{\mathbf{x}}^{\mathsf{T}}Q{\mathbf{x}}+\Xi(t)$ . By matching the different power terms of HJB, one can write:

\displaystyle-\dot{\Xi}-\frac{1}{2}{\mathbf{x}}^{\mathsf{T}}\dot{Q}{\mathbf{x}% }=-\frac{1}{2}{\mathbf{x}}^{\mathsf{T}}Qgg^{\mathsf{T}}Q{\mathbf{x}}^{\mathsf{% T}}+{\mathbf{x}}^{\mathsf{T}}A^{\mathsf{T}}Q{\mathbf{x}}+\frac{1}{2}tr\left[% Qgg^{\mathsf{T}}\right]

(14)

with boundary condition:

\displaystyle\Xi(1)=0,\quad Q(1)=R

(15)

Due to the fact that ${\mathbf{x}}^{\mathsf{T}}A^{\mathsf{T}}Q{\mathbf{x}}={\mathbf{x}}^{\mathsf{T}}% QA{\mathbf{x}}$ , one arrives Riccati Equation:

\displaystyle-\dot{Q}=A^{\mathsf{T}}Q+QA-Qgg^{\mathsf{T}}Q

(16)

Recall that the optimal solution is ${\mathbf{u}}_{t}^{*}=-g_{t}V_{x}$ and $V:=\frac{1}{2}{\mathbf{x}}^{\mathsf{T}}Q{\mathbf{x}}+\Xi(t)$ , the optimal control can be expressed in the way of the solution of Ricatti equation: ${\mathbf{u}}_{t}^{*}=-g^{\mathsf{T}}Q(t){\mathbf{x}}_{t}$ .

C.3 Ricatti Equation and Lyapunov Equation

Here we provide the connection between Ricatti Equation and Lyapunov Equation in the current setup.

Lemma 6.

Define $P(t):=Q(t)^{-1}$ in which $Q(t)$ is the solution of Ricatti equation (eq.16), Then $P(t)$ solve the Lyapunov equation:

\displaystyle\dot{P}=AP+PA^{\mathsf{T}}-gg^{\mathsf{T}}

(17)

For notation consistency, we name the elements in $P$ matrix as,

\displaystyle P=\begin{bmatrix}P_{00}&P_{01}\\ P_{10}&P_{11}\end{bmatrix}

Proof.

By plugging in the Lyapunov equation $P(t):=Q(t)^{-1}$ , one can get:

	$\displaystyle\dot{Q^{-1}}$	$\displaystyle=AQ^{-1}+Q^{-1}A^{\mathsf{T}}-gg^{\mathsf{T}}$
	$\displaystyle\Leftrightarrow-Q^{-1}\dot{Q}Q^{-1}$	$\displaystyle=AQ^{-1}+Q^{-1}A^{\mathsf{T}}-gg^{\mathsf{T}}$
	$\displaystyle\Leftrightarrow-\dot{Q}$	$\displaystyle=QA+A^{\mathsf{T}}Q-Qgg^{\mathsf{T}}Q$

∎

By Lemma.6, the optimal control can also be represented as the solution of the Lyapunov equation: ${\mathbf{u}}_{t}^{*}=-g^{\mathsf{T}}P(t)^{-1}{\mathbf{x}}_{t}$ which is indeed the optimal control term used in Chen & Georgiou (2015) after adopting their notation, and it is same as the optimal control term we used in the Lemma.12 without base dynamics compensation.

C.4 SOC Connection with Schrödinger Bridge

The optimal control solution is also the solution of Schrödinger Bridge when the terminal condition degenerates to the point mass (see example of Brownian Bridge in Appendix.D.1). It is also the solution of the Schrödinger Bridge when the optimal pairing is available to see proposition.2 De Bortoli et al. (2023).

So in our case, we are not solving the momentum Schrödinger Bridge as shown in Chen et al. (2023) (also see. fig.6), even though the problem formulation is similar. Specifically, AGM is a special case of momentum Schrödinger Bridge when the boundary conditions are degenerated to Dirac Distributions.

Appendix D Technique Details in Section.3

D.1 Brownian Bridge as the solution of Stochastic Optimal Control

We adopt the presentation form Kappen (2008). We consider the control problem:

	$\displaystyle\min_{{\mathbf{u}}_{t}}\int_{t}^{1}\frac{1}{2}$	$\displaystyle\lVert{\mathbf{u}}_{t}\rVert_{2}^{2}{\textnormal{d}}t+\frac{{% \mathbf{r}}}{2}\lVert{\mathbf{x}}_{1}-x_{1}\rVert_{2}^{2}$
	$\displaystyle\text{s.t.}\quad{\textnormal{d}}{\mathbf{x}}_{t}$	$\displaystyle={\mathbf{u}}_{t}{\textnormal{d}}t,\quad{\mathbf{x}}_{0}=x_{0}$

Where ${\mathbf{r}}$ is the terminal cost coefficient. According to Pontryagin Maximum Principle (PMP;Kirk (2004)) recipe, one can construct the Hamiltonian:

\displaystyle H(t,{\mathbf{x}},{\mathbf{u}},\gamma)

\displaystyle=-\frac{1}{2}\lVert{\mathbf{u}}_{t}\rVert_{2}^{2}+\gamma{\mathbf{% u}}_{t}

By setting:

\displaystyle\frac{\partial H}{\partial{\mathbf{u}}_{t}}=0,

the optimized Hamiltonian is:

\displaystyle H(t,{\mathbf{x}},{\mathbf{u}},\gamma)^{*}

\displaystyle=\frac{1}{2}\gamma^{2},\quad\text{where}\quad{\mathbf{u}}_{t}=\gamma

Then we solve the Hamiltonian equation of motion:

		$\displaystyle\frac{{\textnormal{d}}{\mathbf{x}}_{t}}{{\textnormal{d}}t}=\frac{% \partial H^{*}}{\partial\gamma}=\gamma$
		$\displaystyle\frac{{\textnormal{d}}\gamma}{{\textnormal{d}}t}=\frac{\partial H% ^{*}}{\partial{\mathbf{x}}}=0$
	$\displaystyle\text{where}\quad{\mathbf{x}}_{0}$	$\displaystyle=x_{0}\quad\text{and}\quad\gamma_{1}=-{\mathbf{r}}\cdot({\mathbf{% x}}_{1}-x_{1})$

One can notice that the solution for $\gamma_{t}$ is the constant $\gamma_{t}=\gamma=-{\mathbf{r}}\cdot({\mathbf{x}}_{1}-x_{1})$ , hence the solution for ${\mathbf{x}}_{t}$ is ${\mathbf{x}}_{t}={\mathbf{x}}_{1}+\gamma t$ .

	$\displaystyle\gamma$	$\displaystyle=-{\mathbf{r}}({\mathbf{x}}_{1}-x_{1})=-{\mathbf{r}}({\mathbf{x}}% _{0}+(1-t)\gamma-x_{1})$
	$\displaystyle\rightarrow$	$\displaystyle\quad{\mathbf{u}}^{*}_{t}:=\gamma=\frac{{\mathbf{r}}(x_{1}-{% \mathbf{x}}_{0})}{1+{\mathbf{r}}(1-t)}$

When ${\mathbf{r}}\rightarrow+\infty$ , we arrive the optimal control as ${\mathbf{u}}_{t}^{*}=\frac{x_{1}-{\mathbf{x}}_{0}}{1-t}$ . Due to certainty equivalence, this is also the optimal control law for

\displaystyle{\textnormal{d}}{\mathbf{x}}_{t}={\mathbf{u}}_{t}{\textnormal{d}}% t+{\textnormal{d}}{\mathbf{w}}_{t}

By plugging it back into the dynamics, we obtain the well-known Brownian Bridge:

\displaystyle{\textnormal{d}}{\mathbf{x}}_{t}=\frac{x_{1}-{\mathbf{x}}_{t}}{1-% t}{\textnormal{d}}t+{\textnormal{d}}{\mathbf{w}}_{t}

Remark 7.

If there is not stochasticity ${\textnormal{d}}{\mathbf{w}}_{t}$ , one can get ${\mathbf{u}}_{t}:=\frac{x_{1}-{\mathbf{x}}_{t}}{1-t}=x_{1}-{\mathbf{x}}_{0}$ which is the vector field constructed by Lipman et al. (2022) during traning.

D.2 Proof of Proposition.3

Proposition 8.

The solution of the stochastic bridge problem of linear momentum system (Chen & Georgiou, 2015) is

\displaystyle{\mathbf{a}}^{*}({\mathbf{m}}_{t},t)=g_{t}^{2}P_{11}\left(\frac{{% \mathbf{x}}_{1}-{\mathbf{x}}_{t}}{1-t}-{\mathbf{v}}_{t}\right)\quad\text{where% }:\quad P_{11}=\frac{-4}{g_{t}^{2}(t-1)}.

(18)

Proof.

From Lemma.12, one can get the optimal control for this problem is

\displaystyle{\mathbf{u}}^{*}_{t}=-{\mathbf{g}}{\mathbf{g}}^{\mathsf{T}}{% \mathbf{P}}_{t}^{-1}\left({\mathbf{m}}_{t}-\Phi(t,1){\mathbf{m}}_{1}\right)

where state transition function $\Phi$ can be obtained from Lemma.11 and ${\mathbf{P}}_{t}$ is the solution of Lyapunov equation and ${\mathbf{P}}_{t}^{-1}$ can be found in Lemma.9.

Then we have:

	$\displaystyle{\mathbf{u}}_{t}^{*}$	$\displaystyle=-{\mathbf{g}}{\mathbf{g}}^{\mathsf{T}}{\mathbf{P}}_{t}^{-1}\left% ({\mathbf{m}}_{t}-\Phi(t,1){\mathbf{m}}_{1}\right)$
		$\displaystyle=-{\mathbf{g}}{\mathbf{g}}^{\mathsf{T}}{\mathbf{P}}_{t}^{-1}{% \mathbf{m}}_{t}+{\mathbf{g}}{\mathbf{g}}^{\mathsf{T}}{\mathbf{P}}_{t}^{-1}\Phi% (t,1){\mathbf{m}}_{1}$
		$\displaystyle=-\begin{bmatrix}0&0\\ 0&g^{2}\\ \end{bmatrix}{\mathbf{P}}_{t}^{-1}{\mathbf{m}}_{t}+{\mathbf{g}}{\mathbf{g}}^{% \mathsf{T}}{\mathbf{P}}_{t}^{-1}\begin{bmatrix}1&t-1\\ 0&1\\ \end{bmatrix}{\mathbf{m}}_{1}$
		$\displaystyle=-g_{t}^{2}\begin{bmatrix}0&0\\ P_{10}&P_{11}\\ \end{bmatrix}{\mathbf{m}}_{t}+\begin{bmatrix}0&0\\ 0&g^{2}_{t}\end{bmatrix}\begin{bmatrix}P_{00}&P_{01}\\ P_{10}&P_{11}\\ \end{bmatrix}\begin{bmatrix}1&t-1\\ 0&1\\ \end{bmatrix}{\mathbf{m}}_{1}$
		$\displaystyle=-g_{t}^{2}\begin{bmatrix}0&0\\ P_{10}&P_{11}\\ \end{bmatrix}{\mathbf{m}}_{t}+g_{t}^{2}\begin{bmatrix}0&0\\ P_{10}&P_{11}\\ \end{bmatrix}\begin{bmatrix}1&t-1\\ 0&1\\ \end{bmatrix}{\mathbf{m}}_{1}$
		$\displaystyle=-g_{t}^{2}\begin{bmatrix}0&0\\ P_{10}&P_{11}\\ \end{bmatrix}{\mathbf{m}}_{t}+g_{t}^{2}\begin{bmatrix}0&0\\ P_{10}&P_{10}(t-1)+P_{11}\\ \end{bmatrix}{\mathbf{m}}_{1}$
		$\displaystyle=\begin{bmatrix}0\\ g_{t}^{2}P_{10}({\mathbf{x}}_{1}-{\mathbf{x}}_{t})+g_{t}^{2}P_{10}(t-1)\cdot{% \mathbf{v}}_{1}+g_{t}^{2}P_{11}({\mathbf{v}}_{1}-{\mathbf{v}}_{t})\end{bmatrix}$
		$\displaystyle\text{Plug in }{\mathbf{v}}_{1}:=\frac{{\mathbf{x}}_{1}-{\mathbf{% x}}_{t}}{1-t}$
		$\displaystyle=\begin{bmatrix}0\\ g_{t}^{2}P_{11}\left(\frac{{\mathbf{x}}_{1}-{\mathbf{x}}_{t}}{1-t}-{\mathbf{v}% }_{t}\right)\end{bmatrix}$

∎

Lemma 9.

The Lyapunov equation corresponding to the optimization problem showed in Lemma.12:

	$\displaystyle{\mathbf{u}}^{*}_{t}$	$\displaystyle\in\operatorname*{arg\,min}_{{\mathbf{u}}_{t}\in\mathcal{U}}% \mathbb{E}\left[\int_{0}^{T}\frac{1}{2}\lVert{\mathbf{u}}_{t}\rVert^{2}\right]% {\textnormal{d}}t+{\mathbf{x}}_{1}^{\mathsf{T}}{\mathbf{R}}{\mathbf{x}}_{1}$
	$\displaystyle s.t\quad$	$\displaystyle{\textnormal{d}}{\mathbf{m}}_{t}=\underbrace{\begin{bmatrix}0&1\\ 0&0\\ \end{bmatrix}}_{A}{\mathbf{m}}_{t}{\textnormal{d}}t+{\mathbf{u}}_{t}{% \textnormal{d}}t+{\mathbf{g}}{\textnormal{d}}{\mathbf{w}}_{t}$
		$\displaystyle{\mathbf{m}}_{0}=m_{0},\quad{\mathbf{m}}_{1}=m_{1}$

is depited as

\displaystyle\dot{{\mathbf{P}}}=A{\mathbf{P}}+{\mathbf{P}}A^{\mathsf{T}}-{\bm{% g}}{\bm{g}}^{T}.

(19)

When ${\bm{g}}=\begin{bmatrix}0\\ g\\ \end{bmatrix}$ , the solution for Lyapunov equation above, with terminal condition

\displaystyle{\mathbf{P}}_{1}={\mathbf{R}}^{-1}=\lim_{{\mathbf{r}}\rightarrow% \inf}\begin{bmatrix}{\mathbf{r}}&0\\ 0&{\mathbf{r}}\\ \end{bmatrix}^{-1}=\begin{bmatrix}0&0\\ 0&0\\ \end{bmatrix}

(20)

However, one does not need the force to converge exactly at ${\mathbf{v}}_{1}$ because we only care about the generated quality of ${\mathbf{x}}_{1}$ . Here we give a general case in which the ${\mathbf{r}}$ keeps a small value $\omega$ for the velocity channel:

\displaystyle{\mathbf{P}}_{1}={\mathbf{R}}^{-1}=\begin{bmatrix}0&0\\ 0&\omega\\ \end{bmatrix}

(21)

Then the solution is given by

\displaystyle{\mathbf{P}}_{t}=\begin{bmatrix}&\omega(t-1)^{2}-\frac{1}{3}g^{2}% (t-1)^{3}&\omega(t-1)-\frac{1}{2}g^{2}(t-1)^{2}\\ &\omega(t-1)-\frac{1}{2}g^{2}(t-1)^{2}&g^{2}(1-t)+\omega\\ \end{bmatrix}

and the inverse of ${\mathbf{P}}_{t}$ is,

\displaystyle{\mathbf{P}}_{t}^{-1}

\displaystyle=\frac{1}{g^{2}(-4\omega+g^{2}(t-1))(t-1)}\begin{bmatrix}&\frac{1% 2(\omega-g^{2}(t-1))}{(t-1)^{2}}&\frac{6(-2\omega+g^{2}(t-1))}{t-1}\\ &\frac{6(-2\omega+g^{2}(-1+t))}{t-1}&12\omega-4g^{2}(t-1)\\ \end{bmatrix}

Thus,

	$\displaystyle P_{10}$	$\displaystyle=\frac{-12\omega+6g^{2}(t-1)}{g^{2}[-4\omega+g^{2}(t-1)](t-1)^{2}% }=\frac{-12\omega}{g^{2}[-4\omega+g^{2}(t-1)](t-1)^{2}}+\frac{6}{[-4\omega+g^{% 2}(t-1)](t-1)}$
	$\displaystyle P_{11}$	$\displaystyle=\frac{12\omega-4g^{2}(t-1)}{g^{2}[-4\omega+g^{2}(t-1)](t-1)}=% \frac{12\omega}{g^{2}[-4\omega+g^{2}(t-1)](t-1)}+\frac{-4}{[-4\omega+g^{2}(t-1% )]}$

Proof.

One can plug in the solution of ${\mathbf{P}}_{t}$ into the Lyapunov equation ${\mathbf{P}}_{t}$ and it validates ${\mathbf{P}}_{t}$ is indeed the solution.

Remark 10.

Here we provide a general form when the terminal condition of the Lyapunov function is not a zero matrix. It explicitly means that it allows that the velocity does not necessarily need to converge to the exact predefined ${\mathbf{v}}_{1}$ . It will have the same results as shown in the paper by setting $\omega=0$ .

∎

Lemma 11.

The state transition function $\Phi(t,s)$ of following dynamics,

\displaystyle{\textnormal{d}}{\mathbf{m}}_{t}=\begin{bmatrix}0&1\\ 0&0\\ \end{bmatrix}{\mathbf{m}}_{t}{\textnormal{d}}t

is,

\displaystyle\Phi(t,s)=\begin{bmatrix}1&t-s\\ 0&1\\ \end{bmatrix}

Proof.

One can easily verify that such $\Phi$ satisfies $\partial\Phi/\partial t=\begin{bmatrix}0&1\\ 0&0\\ \end{bmatrix}\Phi$ . ∎

Lemma 12 (Chen & Georgiou (2015)).

When $R\rightarrow\infty$ , The optimal control ${\mathbf{u}}^{*}_{t}$ of following problem,

	$\displaystyle{\mathbf{u}}^{*}_{t}\>=\begin{bmatrix}{\mathbf{0}}\\ {\mathbf{a}}_{t}\end{bmatrix}$	$\displaystyle\in\operatorname*{arg\,min}_{{\mathbf{u}}_{t}\in\mathcal{U}}\int_% {0}^{T}\frac{1}{2}\lVert{\mathbf{u}}_{t}\rVert^{2}{\textnormal{d}}t+{\mathbf{x% }}_{1}^{\mathsf{T}}R{\mathbf{x}}_{1}$
	$\displaystyle s.t\quad$	$\displaystyle{\textnormal{d}}{\mathbf{m}}_{t}=\begin{bmatrix}0&1\\ 0&0\\ \end{bmatrix}{\mathbf{m}}_{t}{\textnormal{d}}t+{\mathbf{u}}_{t}{\textnormal{d}% }t+{\mathbf{g}}_{t}{\textnormal{d}}{\mathbf{w}}_{t}$
		$\displaystyle{\mathbf{m}}_{0}=m_{0}$

is given by

\displaystyle{\mathbf{u}}^{*}_{t}=-{\mathbf{g}}{\mathbf{g}}^{\mathsf{T}}{% \mathbf{P}}_{t}^{-1}\left({\mathbf{m}}_{t}-\Phi(t,1){\mathbf{m}}_{1}\right)

Where ${\mathbf{P}}_{t}$ follows Lyapunov equation (eq.19) with boundary condition ${\mathbf{P}}_{1}=\mathbf{0}$ . and function $\Phi(t,s)$ is the transition matrix from time-step $s$ to time-step $t$ given uncontrolled dynamics.

And it is indeed the stochastic bridge of the following system:

		$\displaystyle{\textnormal{d}}{\mathbf{m}}_{t}=\begin{bmatrix}0&1\\ 0&0\\ \end{bmatrix}{\mathbf{m}}_{t}{\textnormal{d}}t+{\mathbf{u}}_{t}{\textnormal{d}% }t+g{\textnormal{d}}{\mathbf{w}}_{t}$		(22)
		$\displaystyle{\mathbf{m}}_{0}=m_{0},\quad{\mathbf{m}}_{1}=m_{1}$		(23)

Proof.

See page 8 in Chen & Georgiou (2015). ∎

D.3 Mean and Covariance of SDE

By plugging the optimal control into the system, one can obtain the system as:

	$\displaystyle{\textnormal{d}}{\mathbf{m}}_{t}$	$\displaystyle=\begin{bmatrix}{\mathbf{v}}_{t}\\ {\mathbf{F}}_{t}\\ \end{bmatrix}{\textnormal{d}}t+{\mathbf{g}}_{t}{\textnormal{d}}{\mathbf{w}}_{t}$
		$\displaystyle=\begin{bmatrix}{\mathbf{v}}_{t}\\ g_{t}^{2}P_{11}\left(\frac{{\mathbf{x}}_{1}-{\mathbf{x}}_{t}}{1-t}-{\mathbf{v}% }_{t}\right)\\ \end{bmatrix}{\textnormal{d}}t+{\mathbf{g}}_{t}{\textnormal{d}}{\mathbf{w}}_{t}$
		$\displaystyle=\underbrace{\begin{bmatrix}{\mathbf{0}}&\mathbf{1}\\ -\frac{g_{t}^{2}P_{11}}{1-t}&-g_{t}^{2}P_{11}\\ \end{bmatrix}}_{\tilde{{\mathbf{F}}_{t}}}\begin{bmatrix}{\mathbf{x}}_{t}\\ {\mathbf{v}}_{t}\end{bmatrix}{\textnormal{d}}t+\underbrace{\begin{bmatrix}{% \mathbf{0}}\\ \frac{g_{t}^{2}P_{11}}{1-t}{\mathbf{x}}_{1}\end{bmatrix}}_{\tilde{{\mathbf{D}}% }_{t}}{\textnormal{d}}t+{\mathbf{g}}_{t}{\textnormal{d}}{\mathbf{w}}_{t}$

We follow the recipe of Särkkä & Solin (2019). The mean ${\bm{\mu}_{t}}$ and variance ${\bm{\Sigma}_{t}}$ of the matrix of random variable ${\mathbf{m}}_{t}$ obey the following respective ordinary differential equations (ODEs):

	$\displaystyle{\textnormal{d}}{\bm{\mu}_{t}}$	$\displaystyle=\tilde{{\mathbf{F}}}_{t}{\bm{\mu}_{t}}{\textnormal{d}}t+\tilde{{% \mathbf{D}}}_{t}{\textnormal{d}}t$
	$\displaystyle{\textnormal{d}}{\bm{\Sigma}_{t}}$	$\displaystyle=\tilde{{\mathbf{F}}}_{t}{\bm{\Sigma}_{t}}{\textnormal{d}}t+\left% [\tilde{{\mathbf{F}}}_{t}{\bm{\Sigma}_{t}}\right]^{\mathsf{T}}{\textnormal{d}}% t+{\mathbf{g}}{\mathbf{g}}^{\mathsf{T}}{\textnormal{d}}t$

One can solve it by numerically simulating two ODEs whose dimension is just two. Or one can use software such as Inc. (2022) to get analytic solutions. If you opt to the later approach, you can get:

	$\displaystyle{\mu^{x}_{t}}$	$\displaystyle=\frac{1}{3}{\mathbf{x}}_{1}t^{2}(t^{2}-4t+6)$
	$\displaystyle{\mu^{v}_{t}}$	$\displaystyle=\frac{4t{\mathbf{x}}_{1}}{3}(t^{2}-3t+3)$
	$\displaystyle{\Sigma^{xx}_{t}}$	$\displaystyle=-\frac{1}{9}\left\{(-1+t)^{2}\left[-9+2(-1+k)t\left(3+(-3+t)t% \right)\left(3+t\left[3+(-3+t)t\right]\right)\right]\right\}$
	$\displaystyle{\Sigma^{xv}_{t}}$	$\displaystyle=\frac{1}{9}\left\{(-1+t)\left[t\left(3+(-3+t)t\right)\left(9+8t% \left(3+(-3+t)t\right)\right)+k\left(9-t\left(3+(-3+t)t\right)\left(9+8t\left(% 3+(-3+t)t\right)\right)\right)\right]\right\}$
	$\displaystyle{\Sigma^{vv}_{t}}$	$\displaystyle=1-\frac{8}{9}(-1+k)t\left[3+(-3+t)t\right]\left\{-3+4t\left(3+(-% 3+t)t\right)\right\}$

Remark 13.

The expressions above are too complicated. Hence, we provide the Python functional bracket in Appendix.E.1 with general initial covariance and diffusion coefficient for easy copy-paste. The equations above are ones we used throughout this paper and feel free to play around with other hyperparameters.

D.4 Derivation from SDE to ODE for phase dynamics

One can represent the dynamics in the form of,

\displaystyle\begin{bmatrix}{\textnormal{d}}{\mathbf{x}}_{t}\\ {\textnormal{d}}{\mathbf{v}}_{t}\end{bmatrix}=\begin{bmatrix}{\mathbf{v}}_{t}% \\ {\mathbf{F}}_{t}\end{bmatrix}{\textnormal{d}}t+\begin{bmatrix}{\mathbf{0}}&{% \mathbf{0}}\\ {\mathbf{0}}&g_{t}\end{bmatrix}{\textnormal{d}}{\mathbf{w}}_{t}\quad\text{s.t}% \quad{\mathbf{m}}_{0}:=\begin{bmatrix}{\mathbf{x}}_{0}\\ {\mathbf{v}}_{0}\end{bmatrix}\sim\mathcal{N}({\bm{\mu}}_{0},{\bm{\Sigma}}_{0})

(24)

\displaystyle{\textnormal{d}}{\mathbf{m}}_{t}=f({\mathbf{m}}_{t}){\textnormal{% d}}t+{\mathbf{g}}_{t}{\textnormal{d}}{\mathbf{w}}_{t}

And its corresponding Fokker-Planck Partial Differential Equation Øksendal (2003) reads,

\displaystyle\frac{\partial p_{t}}{\partial t}=-\sum_{d}\frac{\partial}{% \partial{\mathbf{m}}_{i}}[f_{i}({\mathbf{m}},t)p_{t}({\mathbf{m}}_{t})]+\frac{% 1}{2}\sum_{d}\frac{\partial^{2}}{\partial{\mathbf{m}}_{i}{\mathbf{m}}_{j}}% \left[\sum_{d}{\mathbf{g}}_{t}{\mathbf{g}}_{t}^{\mathsf{T}}p_{t}({\mathbf{m}}_% {t})\right]

(25)

According to eq.(37) in Song et al. (2020b), One can rewrite such PDE,

$\displaystyle\frac{\partial p_{t}}{\partial t}$	$\displaystyle=-\sum_{d}\frac{\partial}{\partial{\mathbf{m}}_{i}}\left\{f_{i}({% \mathbf{m}}_{t},t)p_{t}({\mathbf{m}}_{t})-\frac{1}{2}\left[p({\mathbf{m}}_{t})% \nabla_{{\mathbf{m}}}\cdot({\mathbf{g}}_{t}{\mathbf{g}}_{t}^{\mathsf{T}})+p({% \mathbf{m}}_{t}){\mathbf{g}}_{t}{\mathbf{g}}_{t}^{\mathsf{T}}\nabla_{{\mathbf{% m}}}\log p({\mathbf{m}}_{t})\right]\right\}$	(26)
	$\displaystyle\text{due to the fact }{\mathbf{g}}_{t}\equiv\begin{bmatrix}{% \mathbf{0}}&{\mathbf{0}}\\ {\mathbf{0}}&g_{t}\end{bmatrix}$	(27)
	$\displaystyle=-\sum_{d}\frac{\partial}{\partial{\mathbf{m}}_{i}}\left\{f_{i}({% \mathbf{m}}_{t},t)p_{t}({\mathbf{m}}_{t})-\frac{1}{2}p({\mathbf{m}}_{t})\left[% g_{t}^{2}\nabla_{{\mathbf{v}}}\log p({\mathbf{m}}_{t})\right]\right\}$	(28)

Then one can get the equivalent ODE:

\displaystyle{\textnormal{d}}{\mathbf{m}}_{t}=\left[f({\mathbf{m}}_{t},t)-% \frac{1}{2}g_{t}^{2}\nabla_{{\mathbf{v}}}\log p({\mathbf{m}},t)\right]{% \textnormal{d}}t

(29)

D.5 Decomposition of Covariance Matrix and representation of score

Here we follow the procedure in Dockhorn et al. (2021). Given the covariance matrix ${\bm{\Sigma}_{t}}$ , the decomposition of the positive definite symmetric matrix is,

\displaystyle{\bm{\Sigma}_{t}}={\mathbf{L}}_{t}^{\mathsf{T}}{\mathbf{L}}_{t}

(30)

Where,

\displaystyle{\mathbf{L}}_{t}=\begin{bmatrix}{L^{xx}_{t}}&{L^{xv}_{t}}\\ {L^{xv}_{t}}&{L^{vv}_{t}}\end{bmatrix}=\begin{bmatrix}\mathchoice{{\hbox{$% \displaystyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=8.03886% pt,depth=-6.43112pt}}}{{\hbox{$\textstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4% pt\hbox{\vrule height=8.03886pt,depth=-6.43112pt}}}{{\hbox{$\scriptstyle\sqrt{% \Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=5.64444pt,depth=-4.51558pt% }}}{{\hbox{$\scriptscriptstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{% \vrule height=4.27777pt,depth=-3.42224pt}}}&0\\ \frac{\Sigma_{t}^{xv}}{\mathchoice{{\hbox{$\displaystyle\sqrt{\Sigma_{t}^{xx}% \,}$}\lower 0.4pt\hbox{\vrule height=5.62721pt,depth=-4.5018pt}}}{{\hbox{$% \textstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=5.62721pt,% depth=-4.5018pt}}}{{\hbox{$\scriptstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt% \hbox{\vrule height=3.95111pt,depth=-3.1609pt}}}{{\hbox{$\scriptscriptstyle% \sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=2.99445pt,depth=-2.3% 9557pt}}}}&\mathchoice{{\hbox{$\displaystyle\sqrt{\frac{\Sigma_{t}^{xx}\Sigma_% {t}^{vv}-\Sigma_{t}^{vv}}{\Sigma_{t}^{xx}}\,}$}\lower 0.4pt\hbox{\vrule height% =14.64163pt,depth=-11.71336pt}}}{{\hbox{$\textstyle\sqrt{\frac{\Sigma_{t}^{xx}% \Sigma_{t}^{vv}-\Sigma_{t}^{vv}}{\Sigma_{t}^{xx}}\,}$}\lower 0.4pt\hbox{\vrule h% eight=10.3119pt,depth=-8.24956pt}}}{{\hbox{$\scriptstyle\sqrt{\frac{\Sigma_{t}% ^{xx}\Sigma_{t}^{vv}-\Sigma_{t}^{vv}}{\Sigma_{t}^{xx}}\,}$}\lower 0.4pt\hbox{% \vrule height=8.2619pt,depth=-6.60956pt}}}{{\hbox{$\scriptscriptstyle\sqrt{% \frac{\Sigma_{t}^{xx}\Sigma_{t}^{vv}-\Sigma_{t}^{vv}}{\Sigma_{t}^{xx}}\,}$}% \lower 0.4pt\hbox{\vrule height=8.2619pt,depth=-6.60956pt}}}\\ \end{bmatrix}

(31)

We borrow results from Dockhorn et al. (2021), the score function reads,

	$\displaystyle\nabla_{{\mathbf{m}}}\log p({\mathbf{m}}_{t}\|{\mathbf{m}}_{1})$	$\displaystyle=-\nabla_{{\mathbf{m}}_{t}}\frac{1}{2}({\mathbf{m}}_{t}-{\bm{\mu}% _{t}}){\bm{\Sigma}_{t}}^{-1}({\mathbf{m}}_{t}-{\bm{\mu}_{t}})$
		$\displaystyle=-{\bm{\Sigma}_{t}}^{-1}({\mathbf{m}}_{t}-{\bm{\mu}_{t}})$
		Cholesky decomposition of ${\bm{\Sigma}_{t}}$
		$\displaystyle=-{\mathbf{L}}^{-T}{\mathbf{L}}^{-1}({\mathbf{m}}_{t}-{\bm{\mu}_{% t}})$
		$\displaystyle=-{\mathbf{L}}^{-T}\epsilon$

The form of ${\mathbf{L}}$ reads,

\displaystyle{\mathbf{L}}_{t}=\begin{bmatrix}\mathchoice{{\hbox{$\displaystyle% \sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=8.03886pt,depth=-6.4% 3112pt}}}{{\hbox{$\textstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule h% eight=8.03886pt,depth=-6.43112pt}}}{{\hbox{$\scriptstyle\sqrt{\Sigma_{t}^{xx}% \,}$}\lower 0.4pt\hbox{\vrule height=5.64444pt,depth=-4.51558pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=4.% 27777pt,depth=-3.42224pt}}}&0\\ \frac{\Sigma_{t}^{xv}}{\mathchoice{{\hbox{$\displaystyle\sqrt{\Sigma_{t}^{xx}% \,}$}\lower 0.4pt\hbox{\vrule height=5.62721pt,depth=-4.5018pt}}}{{\hbox{$% \textstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=5.62721pt,% depth=-4.5018pt}}}{{\hbox{$\scriptstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt% \hbox{\vrule height=3.95111pt,depth=-3.1609pt}}}{{\hbox{$\scriptscriptstyle% \sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=2.99445pt,depth=-2.3% 9557pt}}}}&\mathchoice{{\hbox{$\displaystyle\sqrt{\frac{\Sigma_{t}^{xx}\Sigma_% {t}^{vv}-(\Sigma_{t}^{xv})^{2}}{\Sigma_{t}^{xx}}\,}$}\lower 0.4pt\hbox{\vrule h% eight=16.68498pt,depth=-13.34804pt}}}{{\hbox{$\textstyle\sqrt{\frac{\Sigma_{t}% ^{xx}\Sigma_{t}^{vv}-(\Sigma_{t}^{xv})^{2}}{\Sigma_{t}^{xx}}\,}$}\lower 0.4pt% \hbox{\vrule height=11.72618pt,depth=-9.38098pt}}}{{\hbox{$\scriptstyle\sqrt{% \frac{\Sigma_{t}^{xx}\Sigma_{t}^{vv}-(\Sigma_{t}^{xv})^{2}}{\Sigma_{t}^{xx}}\,% }$}\lower 0.4pt\hbox{\vrule height=9.04285pt,depth=-7.23431pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{\frac{\Sigma_{t}^{xx}\Sigma_{t}^{vv}-(\Sigma_{t}^{xv})% ^{2}}{\Sigma_{t}^{xx}}\,}$}\lower 0.4pt\hbox{\vrule height=9.04285pt,depth=-7.% 23431pt}}}\end{bmatrix}

and the transpose inverse of ${\mathbf{L}}$ reads,

\displaystyle{\mathbf{L}}_{t}^{-T}=\begin{bmatrix}\frac{1}{\mathchoice{{\hbox{% $\displaystyle\sqrt{(\Sigma_{t}^{xx}+\epsilon_{xx})\,}$}\lower 0.4pt\hbox{% \vrule height=5.62721pt,depth=-4.5018pt}}}{{\hbox{$\textstyle\sqrt{(\Sigma_{t}% ^{xx}+\epsilon_{xx})\,}$}\lower 0.4pt\hbox{\vrule height=5.62721pt,depth=-4.50% 18pt}}}{{\hbox{$\scriptstyle\sqrt{(\Sigma_{t}^{xx}+\epsilon_{xx})\,}$}\lower 0% .4pt\hbox{\vrule height=3.95111pt,depth=-3.1609pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{(\Sigma_{t}^{xx}+\epsilon_{xx})\,}$}\lower 0.4pt\hbox{% \vrule height=2.99445pt,depth=-2.39557pt}}}}&\frac{-\Sigma_{t}^{xv}}{% \mathchoice{{\hbox{$\displaystyle\sqrt{(\Sigma_{t}^{xx})\,}$}\lower 0.4pt\hbox% {\vrule height=5.62721pt,depth=-4.5018pt}}}{{\hbox{$\textstyle\sqrt{(\Sigma_{t% }^{xx})\,}$}\lower 0.4pt\hbox{\vrule height=5.62721pt,depth=-4.5018pt}}}{{% \hbox{$\scriptstyle\sqrt{(\Sigma_{t}^{xx})\,}$}\lower 0.4pt\hbox{\vrule height% =3.95111pt,depth=-3.1609pt}}}{{\hbox{$\scriptscriptstyle\sqrt{(\Sigma_{t}^{xx}% )\,}$}\lower 0.4pt\hbox{\vrule height=2.99445pt,depth=-2.39557pt}}}\mathchoice% {{\hbox{$\displaystyle\sqrt{(\Sigma_{t}^{xx})(\Sigma_{t}^{vv}+)-(\Sigma_{t}^{% xv})^{2}\,}$}\lower 0.4pt\hbox{\vrule height=6.5131pt,depth=-5.21051pt}}}{{% \hbox{$\textstyle\sqrt{(\Sigma_{t}^{xx})(\Sigma_{t}^{vv}+)-(\Sigma_{t}^{xv})^{% 2}\,}$}\lower 0.4pt\hbox{\vrule height=6.5131pt,depth=-5.21051pt}}}{{\hbox{$% \scriptstyle\sqrt{(\Sigma_{t}^{xx})(\Sigma_{t}^{vv}+)-(\Sigma_{t}^{xv})^{2}\,}% $}\lower 0.4pt\hbox{\vrule height=4.57721pt,depth=-3.66179pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{(\Sigma_{t}^{xx})(\Sigma_{t}^{vv}+)-(\Sigma_{t}^{xv})^% {2}\,}$}\lower 0.4pt\hbox{\vrule height=3.52722pt,depth=-2.8218pt}}}}\\ 0&\frac{\mathchoice{{\hbox{$\displaystyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4% pt\hbox{\vrule height=5.62721pt,depth=-4.5018pt}}}{{\hbox{$\textstyle\sqrt{% \Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=5.62721pt,depth=-4.5018pt}% }}{{\hbox{$\scriptstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule heig% ht=3.95111pt,depth=-3.1609pt}}}{{\hbox{$\scriptscriptstyle\sqrt{\Sigma_{t}^{xx% }\,}$}\lower 0.4pt\hbox{\vrule height=2.99445pt,depth=-2.39557pt}}}}{% \mathchoice{{\hbox{$\displaystyle\sqrt{(\Sigma_{t}^{xx})(\Sigma_{t}^{vv})-(% \Sigma_{t}^{xv})^{2}\,}$}\lower 0.4pt\hbox{\vrule height=6.5131pt,depth=-5.210% 51pt}}}{{\hbox{$\textstyle\sqrt{(\Sigma_{t}^{xx})(\Sigma_{t}^{vv})-(\Sigma_{t}% ^{xv})^{2}\,}$}\lower 0.4pt\hbox{\vrule height=6.5131pt,depth=-5.21051pt}}}{{% \hbox{$\scriptstyle\sqrt{(\Sigma_{t}^{xx})(\Sigma_{t}^{vv})-(\Sigma_{t}^{xv})^% {2}\,}$}\lower 0.4pt\hbox{\vrule height=4.57721pt,depth=-3.66179pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{(\Sigma_{t}^{xx})(\Sigma_{t}^{vv})-(\Sigma_{t}^{xv})^{% 2}\,}$}\lower 0.4pt\hbox{\vrule height=3.52722pt,depth=-2.8218pt}}}}\end{bmatrix}

Hence, the score function reads,

\displaystyle\nabla_{{\mathbf{v}}}\log p({\mathbf{m}}_{t}|{\mathbf{m}}_{1})

\displaystyle=-\underbrace{\frac{\mathchoice{{\hbox{$\displaystyle\sqrt{\Sigma% _{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=8.03886pt,depth=-6.43112pt}}}{{% \hbox{$\textstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=8.0% 3886pt,depth=-6.43112pt}}}{{\hbox{$\scriptstyle\sqrt{\Sigma_{t}^{xx}\,}$}% \lower 0.4pt\hbox{\vrule height=5.64444pt,depth=-4.51558pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{\Sigma_{t}^{xx}\,}$}\lower 0.4pt\hbox{\vrule height=4.% 27777pt,depth=-3.42224pt}}}}{\mathchoice{{\hbox{$\displaystyle\sqrt{(\Sigma_{t% }^{xx}+\epsilon_{xx})(\Sigma_{t}^{vv}+\epsilon_{vv})-(\Sigma_{t}^{xv})^{2}\,}$% }\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359pt}}}{{\hbox{$% \textstyle\sqrt{(\Sigma_{t}^{xx}+\epsilon_{xx})(\Sigma_{t}^{vv}+\epsilon_{vv})% -(\Sigma_{t}^{xv})^{2}\,}$}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.% 44359pt}}}{{\hbox{$\scriptstyle\sqrt{(\Sigma_{t}^{xx}+\epsilon_{xx})(\Sigma_{t% }^{vv}+\epsilon_{vv})-(\Sigma_{t}^{xv})^{2}\,}$}\lower 0.4pt\hbox{\vrule heigh% t=6.53888pt,depth=-5.23112pt}}}{{\hbox{$\scriptscriptstyle\sqrt{(\Sigma_{t}^{% xx}+\epsilon_{xx})(\Sigma_{t}^{vv}+\epsilon_{vv})-(\Sigma_{t}^{xv})^{2}\,}$}% \lower 0.4pt\hbox{\vrule height=5.03888pt,depth=-4.03113pt}}}}}_{\ell_{t}}{\bm% {\epsilon}_{1}}

D.6 Representation of acceleration ${\mathbf{a}}_{t}$

As been shown in Proposition.3, the optimal control can be represented as,

	$\displaystyle{\mathbf{a}}_{t}^{*}$	$\displaystyle=g_{t}^{2}P_{11}\left(\frac{{\mathbf{x}}_{1}-{\mathbf{x}}_{t}}{1-% t}-{\mathbf{v}}_{t}\right)$
		$\displaystyle=g_{t}^{2}P_{11}\frac{{\mathbf{x}}_{1}}{1-t}-g_{t}^{2}P_{11}\left% (\frac{{\mathbf{x}}_{t}}{1-t}+{\mathbf{v}}_{t}\right)$
		$\displaystyle=g_{t}^{2}P_{11}\frac{{\mathbf{x}}_{1}}{1-t}-g_{t}^{2}P_{11}\left% (\frac{{\mu^{x}_{t}}+{L^{xx}_{t}}{\bm{\epsilon}_{0}}}{1-t}+({\mu^{v}_{t}}+{L^{% xv}_{t}}{\bm{\epsilon}_{0}}+{L^{vv}_{t}}{\bm{\epsilon}_{1}})\right)$
		$\displaystyle=g_{t}^{2}P_{11}\left[\left(\frac{{\mathbf{x}}_{1}-{\mu^{x}_{t}}}% {1-t}-{\mu^{v}_{t}}\right)-\left(\frac{{L^{xx}_{t}}}{1-t}{\bm{\epsilon}_{0}}+{% L^{xv}_{t}}{\bm{\epsilon}_{0}}+{L^{vv}_{t}}{\bm{\epsilon}_{1}}\right)\right]$
		$\displaystyle\text{solving eq.\ref{Appendix:mean-cov} we can get}:{\mu^{x}_{t}% }=\frac{1}{3}{\mathbf{x}}_{1}t^{2}(t^{2}-4t+6),{\mu^{v}_{t}}=\frac{4t{\mathbf{% x}}_{1}}{3}(t^{2}-3t+3)$
		$\displaystyle\text{Plug in}{\mathbf{x}}_{t},{\mathbf{v}}_{t}$
		$\displaystyle=g_{t}^{2}P_{11}\left[\left(\frac{{\mathbf{x}}_{1}-\frac{1}{3}{% \mathbf{x}}_{1}t^{2}\left(6-4t+t^{2}\right)}{1-t}-\frac{4t{\mathbf{x}}_{1}}{3}% (t^{2}-3t+3)\right)-\left(\frac{{L^{xx}_{t}}}{1-t}{\bm{\epsilon}_{0}}+{L^{xv}_% {t}}{\bm{\epsilon}_{0}}+{L^{vv}_{t}}{\bm{\epsilon}_{1}}\right)\right]$
		$\displaystyle=g_{t}^{2}P_{11}\left[\left(\frac{(-t^{4}+4t^{3}-6t^{2}+3)}{3(1-t% )}-\frac{4t}{3}(t^{2}-3t+3)\right){\mathbf{x}}_{1}-\left(\frac{{L^{xx}_{t}}}{1% -t}{\bm{\epsilon}_{0}}+{L^{xv}_{t}}{\bm{\epsilon}_{0}}+{L^{vv}_{t}}{\bm{% \epsilon}_{1}}\right)\right]$
		$\displaystyle=g_{t}^{2}P_{11}\left[\left(\frac{-(t-1)(t^{3}-3t^{2}+3t+3)}{3(1-% t)}-\frac{4t}{3}(t^{2}-3t+3)\right){\mathbf{x}}_{1}-\left(\frac{{L^{xx}_{t}}}{% 1-t}{\bm{\epsilon}_{0}}+{L^{xv}_{t}}{\bm{\epsilon}_{0}}+{L^{vv}_{t}}{\bm{% \epsilon}_{1}}\right)\right]$
		$\displaystyle=g_{t}^{2}P_{11}\left[\left(\frac{(t^{3}-3t^{2}+3t+3)}{3}-\frac{1% }{3}(4t^{3}-12t^{2}+12t)\right){\mathbf{x}}_{1}-\left(\frac{{L^{xx}_{t}}}{1-t}% {\bm{\epsilon}_{0}}+{L^{xv}_{t}}{\bm{\epsilon}_{0}}+{L^{vv}_{t}}{\bm{\epsilon}% _{1}}\right)\right]$
		$\displaystyle=g_{t}^{2}P_{11}\left[(1-t)^{3}{\mathbf{x}}_{1}-\left(\frac{{L^{% xx}_{t}}}{1-t}{\bm{\epsilon}_{0}}+{L^{xv}_{t}}{\bm{\epsilon}_{0}}+{L^{vv}_{t}}% {\bm{\epsilon}_{1}}\right)\right]$
		$\displaystyle=4(1-t)^{2}{\mathbf{x}}_{1}+g_{t}^{2}P_{11}\left(\frac{{L^{xx}_{t% }}}{1-t}{\bm{\epsilon}_{0}}+{L^{xv}_{t}}{\bm{\epsilon}_{0}}+{L^{vv}_{t}}{\bm{% \epsilon}_{1}}\right)$

D.7 Loss Reweight

In practice, we use the following loss function

	$\displaystyle\mathcal{L}=\min_{\theta}\mathbb{E}_{t\in[0,1]}\mathbb{E}_{{% \mathbf{x}}_{1}\sim p_{\rm{data}}}\mathbb{E}_{{\mathbf{m}}_{t}\sim p_{t}({% \mathbf{m}}_{t}\|{\mathbf{x}}_{1})}\lambda(t)\left[\lVert{\mathbf{F}}_{t}^{% \theta}({\mathbf{m}}_{t},t;\theta)-{\mathbf{F}}_{t}({\mathbf{m}}_{t},t)\rVert_% {2}^{2}\right]$		(32)
	$\displaystyle\propto\min_{\theta}\mathbb{E}_{t\in[0,1]}\mathbb{E}_{{\mathbf{x}% }_{1}\sim p_{\rm{data}}}\mathbb{E}_{{\mathbf{m}}_{t}\sim p_{t}({\mathbf{m}}_{t% }\|{\mathbf{x}}_{1})}\frac{1}{1-t}\left[\lVert{\mathbf{F}}_{t}^{\theta}({% \mathbf{m}}_{t},t;\theta)-{\mathbf{F}}_{t}({\mathbf{m}}_{t},t)/{\mathbf{z}}_{t% }\rVert_{2}^{2}\right]$		(33)

We admit that this might not be an optimal selection. The motivation behind this is simply increasing the weight of training when $t\rightarrow 1$ and normalize the label with normalizer ${\mathbf{z}}_{t}$ .

D.8 Normalizer of AGM-SDE and AGM-ODE

Since the optimal control term can be represented as,

\displaystyle{\mathbf{a}}^{*}({\mathbf{m}}_{t},t)=4{\mathbf{x}}_{1}(1-t)^{2}-g% _{t}^{2}P_{11}\left[\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right){\bm{% \epsilon}_{0}}+L_{t}^{vv}{\bm{\epsilon}_{1}}\right].

Then we introduce the normalizer as

	$\displaystyle{\mathbf{z}}_{SDE}$	$\displaystyle=\mathchoice{{\hbox{$\displaystyle\sqrt{(4(1-t)^{2}\cdot\sigma_{% data})^{2}+g_{t}^{2}P_{11}\left[\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right)% ^{2}+(L_{t}^{vv})^{2}\right]\,}$}\lower 0.4pt\hbox{\vrule height=12.9833pt,dep% th=-10.38669pt}}}{{\hbox{$\textstyle\sqrt{(4(1-t)^{2}\cdot\sigma_{data})^{2}+g% _{t}^{2}P_{11}\left[\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right)^{2}+(L_{t}^% {vv})^{2}\right]\,}$}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359% pt}}}{{\hbox{$\scriptstyle\sqrt{(4(1-t)^{2}\cdot\sigma_{data})^{2}+g_{t}^{2}P_% {11}\left[\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right)^{2}+(L_{t}^{vv})^{2}% \right]\,}$}\lower 0.4pt\hbox{\vrule height=7.11903pt,depth=-5.69525pt}}}{{% \hbox{$\scriptscriptstyle\sqrt{(4(1-t)^{2}\cdot\sigma_{data})^{2}+g_{t}^{2}P_{% 11}\left[\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right)^{2}+(L_{t}^{vv})^{2}% \right]\,}$}\lower 0.4pt\hbox{\vrule height=7.11903pt,depth=-5.69525pt}}}$
	$\displaystyle{\mathbf{z}}_{ODE}$	$\displaystyle=\mathchoice{{\hbox{$\displaystyle\sqrt{(4(1-t)^{2}\cdot\sigma_{% data})^{2}+g_{t}^{2}P_{11}+g_{t}^{2}P_{11}\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{% xv}\right)^{2}+\left[\left(g_{t}^{2}P_{11}L_{t}^{vv}-\frac{1}{2}g_{t}^{2}\ell_% {t}\right)^{2}\right]\,}$}\lower 0.4pt\hbox{\vrule height=12.9833pt,depth=-10.% 38669pt}}}{{\hbox{$\textstyle\sqrt{(4(1-t)^{2}\cdot\sigma_{data})^{2}+g_{t}^{2% }P_{11}+g_{t}^{2}P_{11}\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right)^{2}+% \left[\left(g_{t}^{2}P_{11}L_{t}^{vv}-\frac{1}{2}g_{t}^{2}\ell_{t}\right)^{2}% \right]\,}$}\lower 0.4pt\hbox{\vrule height=9.30444pt,depth=-7.44359pt}}}{{% \hbox{$\scriptstyle\sqrt{(4(1-t)^{2}\cdot\sigma_{data})^{2}+g_{t}^{2}P_{11}+g_% {t}^{2}P_{11}\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right)^{2}+\left[\left(g_% {t}^{2}P_{11}L_{t}^{vv}-\frac{1}{2}g_{t}^{2}\ell_{t}\right)^{2}\right]\,}$}% \lower 0.4pt\hbox{\vrule height=7.11903pt,depth=-5.69525pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{(4(1-t)^{2}\cdot\sigma_{data})^{2}+g_{t}^{2}P_{11}+g_{% t}^{2}P_{11}\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right)^{2}+\left[\left(g_{% t}^{2}P_{11}L_{t}^{vv}-\frac{1}{2}g_{t}^{2}\ell_{t}\right)^{2}\right]\,}$}% \lower 0.4pt\hbox{\vrule height=7.11903pt,depth=-5.69525pt}}}$

Where $\ell:=\mathchoice{{\hbox{$\displaystyle\sqrt{\frac{{\Sigma^{xx}_{t}}}{{\Sigma^% {xx}_{t}}{\Sigma^{vv}_{t}}-({\Sigma^{xv}_{t}})^{2}}\,}$}\lower 0.4pt\hbox{% \vrule height=15.66331pt,depth=-12.5307pt}}}{{\hbox{$\textstyle\sqrt{\frac{{% \Sigma^{xx}_{t}}}{{\Sigma^{xx}_{t}}{\Sigma^{vv}_{t}}-({\Sigma^{xv}_{t}})^{2}}% \,}$}\lower 0.4pt\hbox{\vrule height=11.01904pt,depth=-8.81528pt}}}{{\hbox{$% \scriptstyle\sqrt{\frac{{\Sigma^{xx}_{t}}}{{\Sigma^{xx}_{t}}{\Sigma^{vv}_{t}}-% ({\Sigma^{xv}_{t}})^{2}}\,}$}\lower 0.4pt\hbox{\vrule height=8.65237pt,depth=-% 6.92194pt}}}{{\hbox{$\scriptscriptstyle\sqrt{\frac{{\Sigma^{xx}_{t}}}{{\Sigma^% {xx}_{t}}{\Sigma^{vv}_{t}}-({\Sigma^{xv}_{t}})^{2}}\,}$}\lower 0.4pt\hbox{% \vrule height=8.65237pt,depth=-6.92194pt}}}$

D.9 Exponential Integrator Derivation

As suggested by Zhang & Chen (2022), one can write the discretized dynamics as,

	$\displaystyle\begin{bmatrix}{\mathbf{x}}_{t_{i+1}}\\ {\mathbf{v}}_{t_{i+1}}\end{bmatrix}$	$\displaystyle=\Phi(t_{i+1},t_{i})\begin{bmatrix}{\mathbf{x}}_{t}\\ {\mathbf{v}}_{t}\end{bmatrix}+\sum_{j=0}^{r}C_{i,j}\begin{bmatrix}{\mathbf{0}}% \\ {\mathbf{s}}_{\theta}({\mathbf{m}}_{t_{i-j}},t_{i-j})\end{bmatrix}$		(34)
	$\displaystyle\text{Where}\ C_{i,j}$	$\displaystyle=\int_{t}^{t+\delta_{t}}\Phi(t+\delta_{t},\tau)\begin{bmatrix}{% \mathbf{0}}&{\mathbf{0}}\\ {\mathbf{0}}&{\mathbf{z}}_{\tau}\end{bmatrix}\prod_{k\neq j}\left[\frac{\tau-t% _{i-k}}{t_{i-j}-t_{i-k}}\right]{\textnormal{d}}\tau,\quad\Phi(t,s)=\begin{% bmatrix}1&t-s\\ 0&1\end{bmatrix}$		(34)

After plugging in the transition kernel $\Phi(t,s)$ , one can easily obtain the results shown in (11).

Remark 14.

In light of the momentum system, there are numerous methods for achieving high accuracy in its resolution. However, the practical performance in generative modeling remains untested. We recommend that readers consult the classical numerical physics text book or recent momentum dynamics solver (Pandey et al., 2023; Dockhorn et al., 2021).

D.10 Proof of Proposition.5

The estimated data point ${\mathbf{x}}_{1}$ can be represented as

\displaystyle\tilde{{\mathbf{x}}}_{1}^{SDE}=\frac{(1-t)({\mathbf{F}}_{t}^{% \theta}+{\mathbf{v}}_{t})}{g_{t}^{2}P_{11}}+{\mathbf{x}}_{t},\ \

\displaystyle\text{or}\quad\tilde{{\mathbf{x}}}^{ODE}_{1}=\frac{{\mathbf{F}}_{% t}^{\theta}+g_{t}^{2}P_{11}(\alpha_{t}{\mathbf{x}}_{t}+\beta_{t}{\mathbf{v}}_{% t})}{4(t-1)^{2}+g_{t}^{2}P_{11}(\alpha_{t}{\mu^{x}_{t}}+\beta_{t}{\mu^{v}_{t}})}

(35)

for SDE and probablistic ODE dynamics respectively, and $\beta_{t}={L^{vv}_{t}}+\frac{1}{2P_{11}}$ , $\alpha_{t}=\frac{(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv})-\beta_{t}L^{xv}_{t}}{L^{% xx}_{t}}$ .

Proof.

It is easy to derive the representation of ${\mathbf{x}}_{1}$ of the SDE due to the fact that the network is essentially estimating:

	$\displaystyle{\mathbf{F}}_{t}^{\theta}\approx g_{t}^{2}P_{11}\left(\frac{{% \mathbf{x}}_{1}-{\mathbf{x}}_{t}}{1-t}-{\mathbf{v}}_{t}\right)$
	$\displaystyle\Leftrightarrow{\mathbf{x}}_{1}\approx\frac{(1-t)({\mathbf{F}}_{t% }^{\theta}+{\mathbf{v}}_{t})}{g_{t}^{2}P_{11}}+{\mathbf{x}}_{t}$

It will become slightly more complicated for probabilistic ODE cases. We notice that

	$\displaystyle{\mathbf{m}}_{t}$	$\displaystyle={\bm{\mu}_{t}}+{\mathbf{L}}\epsilon$
	$\displaystyle\Leftrightarrow\quad{\mathbf{x}}_{t}={\mu^{x}_{t}}+L_{t}^{xx}{\bm% {\epsilon}_{1}},$	$\displaystyle\quad{\mathbf{v}}_{t}={\mu^{v}_{t}}+{L^{xv}_{t}}{\bm{\epsilon}_{0% }}+{L^{vv}_{t}}{\bm{\epsilon}_{1}}$

In probabilistic ODE case, the force term can be represented as,

\displaystyle{\mathbf{F}}({\mathbf{m}}_{t},t)=4{\mathbf{x}}_{1}(1-t)^{2}-g_{t}% ^{2}P_{11}\left[\left(\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}\right){\bm{\epsilon}_{% 0}}+L_{t}^{vv}{\bm{\epsilon}_{1}}\right]-\frac{1}{2}g_{t}^{2}\ell{\bm{\epsilon% }_{1}}

In order to use linear combination of ${\mathbf{x}}_{t}$ and ${\mathbf{v}}_{t}$ to represent ${\mathbf{F}}$ one needs to match the stochastic term in ${\mathbf{F}}_{t}$ by using

	$\displaystyle\alpha_{t}{L^{xx}_{t}}+\beta_{t}{L^{xv}_{t}}$	$\displaystyle=\underbrace{\frac{L_{t}^{xx}}{1-t}+L_{t}^{xv}}_{\hat{\zeta}_{t}},$
	$\displaystyle\beta_{t}{L^{vv}_{t}}$	$\displaystyle=\underbrace{{L^{vv}_{t}}+\frac{1}{2P_{11}}}_{\zeta_{t}}.$

The solution can be obtained by:

	$\displaystyle\beta_{t}$	$\displaystyle=\frac{\zeta_{t}}{{L^{vv}_{t}}}$
	$\displaystyle\alpha_{t}$	$\displaystyle=\frac{\hat{\zeta}_{t}-\beta_{t}{L^{xv}_{t}}}{{L^{xx}_{t}}}$

By subsitute it back to ${\mathbf{F}}_{t}$ , one can get:

	$\displaystyle{\mathbf{F}}({\mathbf{m}}_{t},t)$	$\displaystyle=4{\mathbf{x}}_{1}(1-t)^{2}-g_{t}^{2}P_{11}\left[\alpha_{t}({% \mathbf{x}}_{t}-{\mu^{x}_{t}})+\beta_{t}({\mathbf{v}}_{t}-{\mu^{v}_{t}})\right]$
		$\displaystyle=\left[4(1-t)^{2}+g_{t}^{2}P_{11}(\alpha_{t}{\mu^{x}_{t}}+\beta_{% t}{\mu^{v}_{t}})\right]{\mathbf{x}}_{1}-g_{t}^{2}P_{11}\left[\alpha_{t}{% \mathbf{x}}_{t}+\beta_{t}{\mathbf{v}}_{t}\right]$
	$\displaystyle\Leftrightarrow{\mathbf{x}}_{1}$	$\displaystyle=\frac{{\mathbf{F}}_{t}^{\theta}+g_{t}^{2}P_{11}(\alpha_{t}{% \mathbf{x}}_{t}+\beta_{t}{\mathbf{v}}_{t})}{4(t-1)^{2}+g_{t}^{2}P_{11}(\alpha_% {t}{\mu^{x}_{t}}+\beta_{t}{\mu^{v}_{t}})}$

∎

Appendix E Experimental Details

Training: We stick with hyperparameters introduced in the section.4. We use AdamW(Loshchilov & Hutter, 2017) as our optimizer and Exponential Moving Averaging with the exponential decay rate of 0.9999. We use 8 $\times$ Nvidia A100 GPU for all experiments. For further, training setup, please refer to Table.6.

Table 6: Additional experimental details

dataset	Training Iter	Learning rate	Batch Size	network architecture
toy	0.05M	1e-3	1024	ResNet(Dockhorn et al., 2021)
CIFAR-10	0.5M	1e-3	512	NCSN++(Karras et al., 2022)
AFHQv2	0.5M	1e-3	512	NCSN++(Karras et al., 2022)
ImageNet-64	1.6M	2e-4	512	ADM(Dhariwal & Nichol, 2021)

Sampling: For Exponential Integrator, we choose the multistep order $w=2$ consistently for all experiments. Different from previous work (Dockhorn et al., 2021; Karras et al., 2022; Zhang et al., 2023), we use quadratic timesteps scheme with $\kappa=2$ :

\displaystyle t_{i}=\left(\frac{N-i}{N}t_{0}^{\frac{1}{\kappa}}+\frac{i}{N}t_{% N}^{\frac{1}{\kappa}}\right)^{\kappa}

Which is opposite to the classical DM. Namely, the time discretization will get larger when the dynamics is propagated close to data. For numerical stability, we use $t_{0}=1E-5$ for all experiments. For $NFE=5$ , we use $t_{N}=0.5$ and $NFE=10$ , $T_{N}=0.7$ . For the rest of the sampling, we use $t_{N}=0.999$ .

Due to the fact that EDM(Karras et al., 2022) is using second-order ODE solver, in practice, we allow it to have an extra one NFE as reported for all the tables.

E.1 Code Example for Covariance

We will abuse the notation in this coding section. Here we provide the example code for computing the covariance matrix. Here we consider the general case where ${\bm{\Sigma}}_{0}:=\begin{bmatrix}m&-k\mathchoice{{\hbox{$\displaystyle\sqrt{% mn\,}$}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{$% \textstyle\sqrt{mn\,}$}\lower 0.4pt\hbox{\vrule height=4.30554pt,depth=-3.4444% 6pt}}}{{\hbox{$\scriptstyle\sqrt{mn\,}$}\lower 0.4pt\hbox{\vrule height=3.0138% 9pt,depth=-2.41113pt}}}{{\hbox{$\scriptscriptstyle\sqrt{mn\,}$}\lower 0.4pt% \hbox{\vrule height=2.15277pt,depth=-1.72223pt}}}\\ -k\mathchoice{{\hbox{$\displaystyle\sqrt{mn\,}$}\lower 0.4pt\hbox{\vrule heigh% t=4.30554pt,depth=-3.44446pt}}}{{\hbox{$\textstyle\sqrt{mn\,}$}\lower 0.4pt% \hbox{\vrule height=4.30554pt,depth=-3.44446pt}}}{{\hbox{$\scriptstyle\sqrt{mn% \,}$}\lower 0.4pt\hbox{\vrule height=3.01389pt,depth=-2.41113pt}}}{{\hbox{$% \scriptscriptstyle\sqrt{mn\,}$}\lower 0.4pt\hbox{\vrule height=2.15277pt,depth% =-1.72223pt}}}&n\\ \end{bmatrix}$ and the diffusion coefficient is $g(t):=p(tt-t)$ where $p$ is the scaling coefficient and $tt$ is the dam** coefficient.

    def Sigmaxx(t,p,tt,m,n):
        return  \
        (t - 1)**2*(30*m*(t**3 - 3*t**2 + 3*t + 3)**2\
        - 60*p**2*(t - 1)**3*torch.log(1 - t) \
        - t*(60*k*np.sqrt(m*n)*(t**5 - 6*t**4 + 15*t**3 - 15*t**2 + 9)\
        - 30*n*t*(t**2 - 3*t + 3)**2 + p**2*(t**5*(6*tt**2 + 3*tt + 1) \
        - 6*t**4*(6*tt**2 + 3*tt + 1)\
        + 15*t**3*(6*tt**2 + 3*tt + 1)\
        - 10*t**2*(9*tt**2 + 11) + 150*t - 60)))/270

    def Sigmaxv(t,p,tt,m,n):
        return  \
        (1/270 - t/270)*(30*k*np.sqrt(m*n)*(8*t**6 - 48*t**5\
        + 120*t**4 - 135*t**3 + 45*t**2 + 27*t - 9) +\
        150*p**2*(t - 1)**3*torch.log(1 - t)\
        + t*(-120*m*(t**5 - 6*t**4 + 15*t**3 - 15*t**2 + 9)\
        - 30*n*(4*t**5 - 24*t**4 + 60*t**3 - 75*t**2 + 45*t - 9)\
        + p**2*(4*t**5*(6*tt**2 + 3*tt + 1) - 24*t**4*(6*tt**2 + 3*tt + 1)\
        + 60*t**3*(6*tt**2 + 3*tt + 1) - 5*t**2*(81*tt**2 + 18*tt + 55)\
        + 15*t*(9*tt**2 + 25) - 150)))

    def Sigmavv(t,p,tt,m,n):
        return  \
        n*(-4*t**3 + 12*t**2 - 12*t + 3)**2/9\
        - 8*p**2*(t - 1)**3*torch.log(1 - t)/9\
        + t*(-120*k*np.sqrt(m*n)*(4*t**5 - 24*t**4 + 60*t**3\
        - 75*t**2 + 45*t - 9) + 240*m*t*(t**2 - 3*t + 3)**2 \
        + p**2*(-8*t**5*(6*tt**2 + 3*tt + 1) + 48*t**4*(6*tt**2 + 3*tt + 1)\
        - 120*t**3*(6*tt**2 + 3*tt + 1) + 5*t**2*(180*tt**2 + 72*tt + 53)\
        - 15*t*(36*tt**2 + 9*tt + 20) + 135*tt**2 + 120))/135

Appendix F Conditional Generation Details

Here we provide the details of conditional generation details.

F.1 Storke Based Generation

For stroke-based generation, we provide two types of conditional generation.

initial Velocity (IV):Please refer to section.4.
Dynamics Velocity (dyn-V): Since the mean and variance of velocity and position are available, one can specify the velocity which is valid. In this case, we can set the velocity as

\displaystyle v_{t}=\mu_{t}^{v_{t}|x_{t}}+\Sigma^{v_{t}|x_{t}}_{t}\epsilon

(36)

In which,

	$\displaystyle\mu_{t}^{v_{t}\|x_{t}}$	$\displaystyle=\mu_{t}^{v}+\frac{{\Sigma^{xv}_{t}}}{{\Sigma^{xx}_{t}}}({\mathbf% {x}}_{t}-{\mu^{x}_{t}})$		(37)
	$\displaystyle\Sigma_{t}^{v_{t}\|x_{t}}$	$\displaystyle={\Sigma^{vv}_{t}}-\frac{{\Sigma^{xv}_{t}}^{2}}{{\Sigma^{xx}_{t}}}$		(38)

when $t\leq c$ . The $c$ is the guidance length. We typically set it to be $c=0.25$ .

F.2 Inpainting

In the inpainting case, we apply a similar strategy as dyn-V. Specifically, in this case, the $\tilde{{\mathbf{x}}}_{1}$ will be represented as:

\displaystyle\hat{{\mathbf{x}}}_{1}

\displaystyle:=\text{MASK}\odot\mu_{t}^{x}+(1-\text{MASK})\odot\tilde{{\mathbf% {x}}}_{1}

(39)

where MASK represents the mask matrix which zero out the pixel of the original image. Such $\hat{{\mathbf{x}}}_{1}$ will serve as the source to estimate $\mu_{t}^{x}$ in eq.37.

F.3 inpainting Based Generation

For stroke-based generation, we provide two types of conditional generation.

Appendix G Ablation Study of Stoke-Based Conditional Generation

In order to investigate the diversity and faithfulness of stoke-based conditional generation, we conduct the ablation study with respect to the hyperparameter $\xi$ .

Appendix H Additional Figures

We demonstrate the samples for different datasets with varying NFE.

Generative Modeling with Phase Stochastic Bridges

Abstract

1 Introduction

2 Preliminary

2.1 Dynamical Generative Modeling

Remark 1.

3 Acceleration Generative Model

Definition 2 (Stochastic Bridge problem of linear momentum system (Chen & Georgiou, 2015)).

Proposition 3 (Phase Space Brownian Bridge).

Proof.

Remark 4.

3.1 Training

3.2 Sampling from AGM

Proposition 5 (Sampling-Hop).

Proof.

4 Experimental Results

5 Conclusion and Limitation

References

Appendix A supplementary Summary

Appendix B Assumptions

Appendix C Stochastic Optimal Control (SOC) in the Wild

C.1 Linear Quadratic Stochastic Optimal Control

C.2 Value Function, Hamilton-Jacobian (Hamilton–Jacobi–Bellman equation) and Ricatti Equation

C.3 Ricatti Equation and Lyapunov Equation

Lemma 6.

Proof.

C.4 SOC Connection with Schrödinger Bridge

Appendix D Technique Details in Section.3

D.1 Brownian Bridge as the solution of Stochastic Optimal Control

Remark 7.

D.2 Proof of Proposition.3

Proposition 8.

Proof.

Lemma 9.

Proof.

Remark 10.

Lemma 11.

Proof.

Lemma 12 (Chen & Georgiou (2015)).

Proof.

D.3 Mean and Covariance of SDE

Remark 13.

D.4 Derivation from SDE to ODE for phase dynamics

D.5 Decomposition of Covariance Matrix and representation of score

D.6 Representation of acceleration 𝐚tsubscript𝐚𝑡{\mathbf{a}}_{t}bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

D.7 Loss Reweight

D.8 Normalizer of AGM-SDE and AGM-ODE

D.9 Exponential Integrator Derivation

Remark 14.

D.10 Proof of Proposition.5

Proof.

Appendix E Experimental Details

E.1 Code Example for Covariance

Appendix F Conditional Generation Details

F.1 Storke Based Generation

F.2 Inpainting

F.3 inpainting Based Generation

Appendix G Ablation Study of Stoke-Based Conditional Generation

Appendix H Additional Figures

H.1 Toy dataset compared with CLD

H.2 AFHQv2 Inpainting Generation

H.3 AFHQv2 Stroke Based Generation

H.4 CIFAR-10

H.5 AFHQv2

H.6 Imagenet-64

D.6 Representation of acceleration ${\mathbf{a}}_{t}$