ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models

Fei Kong1  **hao Duan2  Lichao Sun3  Hao Cheng4  Ren**g Xu4
Hengtao Shen1  Xiaofeng Zhu1  Xiaoshuang Shi1  Kaidi Xu2 *
1University of Electronic Science and Technology of China
2Drexel University
3Lehigh University
4The Hong Kong University of Science and Technology (Guangzhou)
[email protected][email protected][email protected]
Equal corresponding author
Abstract

Though diffusion models excel in image generation, their step-by-step denoising leads to slow generation speeds. Consistency training addresses this issue with single-step sampling but often produces lower-quality generations and requires high training costs. In this paper, we show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. As timestep increases, the upper bound accumulates previous consistency training losses. Therefore, larger batch sizes are needed to reduce both current and accumulated losses. We propose Adversarial Consistency Training (ACT), which directly minimizes the Jensen-Shannon (JS) divergence between distributions at each timestep using a discriminator. Theoretically, ACT enhances generation quality, and convergence. By incorporating a discriminator into the consistency training framework, our method achieves improved FID scores on CIFAR10 and ImageNet 64×\times×64 and LSUN Cat 256×\times×256 datasets, retains zero-shot image inpainting capabilities, and uses less than 1/6161/61 / 6 of the original batch size and fewer than 1/2121/21 / 2 of the model parameters and training steps compared to the baseline method, this leads to a substantial reduction in resource consumption. Our code is available: https://github.com/kong13661/ACT

1 Introduction

Diffusion models, known for their success in image generation [19, 44, 43, 53, 12, 31], utilize diffusion processes to produce high-quality, diverse images. They also perform tasks like zero-shot inpainting [32] and audio generation [36, 25, 24]. However, they have a significant drawback: lengthy sampling times. These models generate target distribution samples by iterative denoising a Gaussian noise input, a process that involves gradual noise reduction until samples match the target distribution. This limitation affects their practicality and efficiency in real-world applications.

The lengthy sampling times of diffusion models have spurred the creation of various strategies to tackle this issue. Several models and techniques have been suggested to enhance the efficiency of diffusion-based image generation [4, 29, 57]. Recently, consistency models [45] have been introduced to speed up the diffusion models’ sampling process. A consistency function is one that consistently yields the same output along a specific trajectory. To use consistency models, the trajectory from noise to the target sample must be obtained. By fitting the consistency function, the model can generate data within 1 or 2 steps.

The score-based model [44], an extension of the diffusion model in continuous time, gradually samples from a normal distribution pTsubscript𝑝𝑇p_{T}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to the sample distribution p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In deterministic sampling, it essentially solves an Ordinary Differential Equation (ODE), with each sample representing an ODE trajectory. Consistency models generate samples using a consistency function that aligns every point on the ODE trajectory with the ODE endpoint. However, deriving the true ODE trajectory is complex. To tackle this, consistency models suggest two methods. The first, consistency distillation, trains a score-based model to obtain the ODE trajectory. The second, consistency training, approximates the trajectory using a conditional one. Compared to distillation, consistency training has a larger error, leading to lower sample quality. The consistency function is trained by equating the model’s output at time tn+1subscript𝑡𝑛1t_{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT with its output at time tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Generative Adversarial Networks (GANs) [3, 55, 15], unlike consistency training, can directly minimize the distance between the model’s generated and target distributions via the discriminator, independent of the model’s output at previous time tn1subscript𝑡𝑛1t_{n-1}italic_t start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT. Drawing from GANs, we introduce Adversarial Consistency Training. We first theoretically explain the need for large batch sizes in consistency training by showing its equivalence to optimizing the upper bound of the Wasserstein-distance between the model’s generated and target distributions. This upper bound consists of the accumulated consistency training loss CTtksubscriptsuperscriptsubscript𝑡𝑘𝐶𝑇\mathcal{L}^{t_{k}}_{CT}caligraphic_L start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, the distance between sampling distributions, and the accumulated error, all of which increase with t𝑡titalic_t. Hence, a large batch size is crucial to minimize the error from the previous time t𝑡titalic_t. To mitigate the impact of CTtksubscriptsuperscriptsubscript𝑡𝑘𝐶𝑇\mathcal{L}^{t_{k}}_{CT}caligraphic_L start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and accumulated error, we incorporate the discriminator into consistency training, enabling direct reduction of the JS-divergence between the generated and target distributions at each timestep t𝑡titalic_t. Our experiments on CIFAR10 [26], ImageNet 64×\times×64 [7] and LSUN Cat 256×\times×256 [51] show that ACT significantly surpasses consistency training while needing less than 1/6161/61 / 6 of the original batch size and less than 1/2121/21 / 2 of the original model parameters and training steps, leading to considerable resource savings. For comparison, we use 1 NVIDIA GeForce RTX 3090 for CIFAR10, 4 NVIDIA A100 GPUs for ImageNet 64×\times×64 and 8 NVIDIA A100 GPUs for LSUN Cat 256×\times×256, while consistency training requires 8, 64, 64 A100 GPUs for CIFAR10, ImageNet 64×\times×64 and LSUN Cat 256×\times×256, respectively.

Our contributions are summarized as follows:

  • We demonstrate that consistency training is equivalent to optimizing the upper bound of the W-distance. By analyzing this upper bound, we have identified one reason why consistency training requires a larger batch size.

  • Following our analysis, we propose Adversarial Consistency Training (ACT) to directly optimize the JS divergence between the sampling distribution and the target distribution at each timestep t𝑡titalic_t, by incorporating a discriminator into the consistency training process.

  • Experimental results demonstrate that the proposed ACT significantly outperforms the original consistency training with only less than 1/6161/61 / 6 of the original batch size and less than 1/2121/21 / 2 of the training steps. This leads to a substantial reduction in resource consumption.

2 Related works

Generative Adversarial Networks   GANs have achieved tremendous success in various domains, including image generation [15, 52, 54] and audio synthesis [10]. However, GAN training faces challenges such as instability and mode collapse, where the generator fails to capture the diversity of the training data. To address these issues, several methods have been proposed. For example, spectral normalization, gradient penalty, and differentiable data augmentation techniques have been developed. Spectral normalization [33] constrains the Lipschitz constant of the discriminator, promoting more stable training. Gradient penalty, as employed in the WGAN-GP [17], utilizes the gradient penalty to discriminator to limit the range of gradient, so as to avoid the tend of concentrating the weights around extreme values, when using weight clip** in WGAN [1]. [48] introduces the concept of zero centered gradient penalty, and StyleGAN2 [47] introduces lazy regularization which performs multiple steps of iteration before computing the gradient penalty to improve the efficiency. Moreover, differentiable data augmentation techniques [56] have been introduced to enhance the diversity and robustness of GAN models during training. StyleGAN2-ADA [46] improves GAN performance on small datasets by employing adaptive differentiable data augmentation techniques.

Diffusion Models   Diffusion models have emerged as highly successful approaches for generating images [37, 38]. In contrast to the traditional approach of Generative Adversarial Networks (GANs), which involve a generator and a discriminator, diffusion models generate samples by modeling the inverse process of a diffusion process from Gaussian noise. Diffusion models have shown superior stable training process compared to GANs, effectively addressing issues such as checkerboard artifacts [40, 11, 13]. The diffusion process is defined as follows: 𝒙t=αt𝒙t1+βtϵt,ϵt𝒩(𝟎,𝐈)formulae-sequencesubscript𝒙𝑡subscript𝛼𝑡subscript𝒙𝑡1subscript𝛽𝑡subscriptbold-italic-ϵ𝑡similar-tosubscriptbold-italic-ϵ𝑡𝒩0𝐈\boldsymbol{x}_{t}=\sqrt{\alpha_{t}}\boldsymbol{x}_{t-1}+\sqrt{\beta_{t}}% \boldsymbol{\epsilon}_{t},\boldsymbol{\epsilon}_{t}\sim\mathcal{N}(\mathbf{0},% \mathbf{I})bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ). As t𝑡titalic_t increases, βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT gradually increases, causing 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to approximate random Gaussian noise. In the reverse diffusion process, 𝒙tsubscriptsuperscript𝒙𝑡\boldsymbol{x}^{\prime}_{t}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT follows a Gaussian distribution, assuming the same variance as in the forward diffusion process. The mean of 𝒙tsubscriptsuperscript𝒙𝑡\boldsymbol{x}^{\prime}_{t}bold_italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as: 𝝁~t=1at(𝒙tβt1a¯tϵ¯θ(𝒙t,t))subscript~𝝁𝑡1subscript𝑎𝑡subscript𝒙𝑡subscript𝛽𝑡1subscript¯𝑎𝑡subscript¯bold-italic-ϵ𝜃subscript𝒙𝑡𝑡\tilde{\boldsymbol{\mu}}_{t}=\frac{1}{\sqrt{a_{t}}}\left(\boldsymbol{x}_{t}-% \frac{\beta_{t}}{\sqrt{1-\bar{a}_{t}}}\bar{\boldsymbol{\epsilon}}_{\theta}(% \boldsymbol{x}_{t},t)\right)over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ), where α¯t=k=0tαksubscript¯𝛼𝑡superscriptsubscriptproduct𝑘0𝑡subscript𝛼𝑘\bar{\alpha}_{t}=\prod_{k=0}^{t}\alpha_{k}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and α¯t+β¯t=1subscript¯𝛼𝑡subscript¯𝛽𝑡1\bar{\alpha}_{t}+\bar{\beta}_{t}=1over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1. The reverse diffusion process becomes: 𝒙t1=𝝁~t+βtϵ,ϵ𝒩(𝟎,𝐈)formulae-sequencesubscript𝒙𝑡1subscript~𝝁𝑡subscript𝛽𝑡bold-italic-ϵsimilar-tobold-italic-ϵ𝒩0𝐈\boldsymbol{x}_{t-1}=\tilde{\boldsymbol{\mu}}_{t}+\sqrt{\beta_{t}}\boldsymbol{% \epsilon},\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = over~ start_ARG bold_italic_μ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ , bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ). The loss function is defined as 𝔼x0,ϵ¯t[ϵ¯tϵθ(α¯tx0+1α¯tϵ¯t,t)2].subscript𝔼subscript𝑥0subscript¯bold-italic-ϵ𝑡delimited-[]superscriptnormsubscript¯bold-italic-ϵ𝑡subscriptbold-italic-ϵ𝜃subscript¯𝛼𝑡subscript𝑥01subscript¯𝛼𝑡subscript¯bold-italic-ϵ𝑡𝑡2\mathbb{E}_{x_{0},\bar{\boldsymbol{\epsilon}}_{t}}\left[\left\|\bar{% \boldsymbol{\epsilon}}_{t}-\boldsymbol{\epsilon}_{\theta}\left(\sqrt{\bar{% \alpha}_{t}}x_{0}+\sqrt{1-\bar{\alpha}_{t}}\bar{\boldsymbol{\epsilon}}_{t},t% \right)\right\|^{2}\right].blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over¯ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . Score-based models [44] transforms the discrete-time diffusion process into a continuous-time process and employs Stochastic Differential Equations (SDEs) to express the diffusion process. Moreover, the forward and backward processes are no longer restricted to the diffusion process. They employ the forward process defined as d𝒙=(𝒇t(𝒙)12(gt2σt2)𝒙logpt(𝒙))dt+σtd𝒘𝑑𝒙subscript𝒇𝑡𝒙12superscriptsubscript𝑔𝑡2superscriptsubscript𝜎𝑡2subscript𝒙subscript𝑝𝑡𝒙𝑑𝑡subscript𝜎𝑡𝑑𝒘d\boldsymbol{x}=\left(\boldsymbol{f}_{t}(\boldsymbol{x})-\frac{1}{2}\left(g_{t% }^{2}-\sigma_{t}^{2}\right)\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})% \right)dt+\sigma_{t}d\boldsymbol{w}italic_d bold_italic_x = ( bold_italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ) italic_d italic_t + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_italic_w, and the corresponding backward process is d𝒙=(𝒇t(𝒙)12(gt2+σt2)𝒙logpt(𝒙))dt+σtd𝒘¯𝑑𝒙subscript𝒇𝑡𝒙12superscriptsubscript𝑔𝑡2superscriptsubscript𝜎𝑡2subscript𝒙subscript𝑝𝑡𝒙𝑑𝑡subscript𝜎𝑡𝑑bold-¯𝒘d\boldsymbol{x}=\left(\boldsymbol{f}_{t}(\boldsymbol{x})-\frac{1}{2}\left(g_{t% }^{2}+\sigma_{t}^{2}\right)\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})% \right)dt+\sigma_{t}d\boldsymbol{\bar{w}}italic_d bold_italic_x = ( bold_italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ) italic_d italic_t + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d overbold_¯ start_ARG bold_italic_w end_ARG, where 𝒘𝒘\boldsymbol{w}bold_italic_w is the forward time Brownian motion and 𝒘¯bold-¯𝒘\boldsymbol{\bar{w}}overbold_¯ start_ARG bold_italic_w end_ARG is the forward time Brownian motion. Compared to GANs, diffusion models have longer sampling time consummations. Several methods have been proposed to accelerate the generation process, including [39, 9, 50], DDIM [42], Consistency models [45], etc.

Consistency type models   A function is called a consistency function if its output is the same at every point on a trajectory. Formally, given a trajectory, 𝒙t,t[0,T]subscript𝒙𝑡𝑡0𝑇\boldsymbol{x}_{t},t\in[0,T]bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ∈ [ 0 , italic_T ], the function satisfies f(𝒙t1)=𝔼[f(𝒙t2)]𝑓subscript𝒙subscript𝑡1𝔼delimited-[]𝑓subscript𝒙subscript𝑡2f(\boldsymbol{x}_{t_{1}})=\mathbb{E}[f(\boldsymbol{x}_{t_{2}})]italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = blackboard_E [ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ], if t1,t2[0,T]subscript𝑡1subscript𝑡20𝑇t_{1},t_{2}\in[0,T]italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ 0 , italic_T ]. If this trajectory is not a probability trajectory, then the expected symbol 𝔼𝔼\mathbb{E}blackboard_E in the above formula can be removed. [6] proposed Consistency Diffusion Models (CDM), which proves that when the forward diffusion process satisfies d𝒙t=g(t)d𝒘t𝑑subscript𝒙𝑡𝑔𝑡𝑑subscript𝒘𝑡d\boldsymbol{x}_{t}=g(t)d\boldsymbol{w}_{t}italic_d bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_g ( italic_t ) italic_d bold_italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, 𝒉(𝒙,t)=logqt(𝒙)g2(t)+𝒙𝒉𝒙𝑡subscript𝑞𝑡𝒙superscript𝑔2𝑡𝒙\boldsymbol{h}(\boldsymbol{x},t)=\nabla\log q_{t}(\boldsymbol{x})g^{2}(t)+% \boldsymbol{x}bold_italic_h ( bold_italic_x , italic_t ) = ∇ roman_log italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) + bold_italic_x is a consistency function. They add consistency regularity above during training to improve the sampling effectiveness of the model. [45] proposed consistency models. Unlike consistency diffusion models, Consistency Models (CM) utilize deterministic sampling to obtain a one-step sampling model by learning the map** from each point 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT on the trajectory to 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. When training a diffusion model to obtain the trajectory 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it is called consistency distillation. When using conditional-trajectories to approximate non-conditional trajectories, it is called consistency training. Compared to consistency distillation, consistency training has a lower sampling effectiveness. Concurrently, [22] induces a new temporal variable, while calculating the previous step’s x𝑥xitalic_x through multi-step iteration, and incorporates a discriminator after a period of training and achieved SOTA results in distillation. Our work concentrates on energy-efficient training from scratch also with different objective functions.

3 Method

3.1 Preliminary

3.1.1 Score-Based Generative Models

Score-Based Generative Models [44], as an extension of diffusion models, extends the diffusion to continuous time, and the forward and backward processes are no longer limited to the diffusion process. Given a distribution ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ], p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the data distribution and pTsubscript𝑝𝑇p_{T}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is normal distribution. From p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to pTsubscript𝑝𝑇p_{T}italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, this distribution increasingly approximates a normal distribution. We sample 𝒙tsubscript𝒙𝑡\boldsymbol{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT distribution. If we can obtain 𝒙tsubscript𝒙superscript𝑡\boldsymbol{x}_{t^{\prime}}bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from the formula d𝒙=(𝒇t(𝒙)12(gt2σt2)𝒙logpt(𝒙))dt+σtd𝒘𝑑𝒙subscript𝒇𝑡𝒙12superscriptsubscript𝑔𝑡2superscriptsubscript𝜎𝑡2subscript𝒙subscript𝑝𝑡𝒙𝑑𝑡subscript𝜎𝑡𝑑𝒘d\boldsymbol{x}=\left(\boldsymbol{f}_{t}(\boldsymbol{x})-\frac{1}{2}\left(g_{t% }^{2}-\sigma_{t}^{2}\right)\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})% \right)dt+\sigma_{t}d\boldsymbol{w}italic_d bold_italic_x = ( bold_italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ) italic_d italic_t + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_italic_w, where 𝒘𝒘\boldsymbol{w}bold_italic_w is the forward time Brownian motion and t>tsuperscript𝑡𝑡t^{\prime}>titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > italic_t, then we can obtain 𝒙tsubscript𝒙superscript𝑡\boldsymbol{x}_{t^{\prime}}bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT from the formula d𝒙=(𝒇t(𝒙)12(gt2+σt2)𝒙logpt(𝒙))dt+σtd𝒘𝑑𝒙subscript𝒇𝑡𝒙12superscriptsubscript𝑔𝑡2superscriptsubscript𝜎𝑡2subscript𝒙subscript𝑝𝑡𝒙𝑑𝑡subscript𝜎𝑡𝑑𝒘d\boldsymbol{x}=\left(\boldsymbol{f}_{t}(\boldsymbol{x})-\frac{1}{2}\left(g_{t% }^{2}+\sigma_{t}^{2}\right)\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})% \right)dt+\sigma_{t}d\boldsymbol{w}italic_d bold_italic_x = ( bold_italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ) italic_d italic_t + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_italic_w, where 𝒘𝒘\boldsymbol{w}bold_italic_w is the backward time Brownian motion and t<tsuperscript𝑡𝑡t^{\prime}<titalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_t. If σt=0subscript𝜎𝑡0\sigma_{t}=0italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0, this formula turns into a ordinary differential equation d𝒙=(𝒇t(𝒙)12gt2𝒙logpt(𝒙))dt.𝑑𝒙subscript𝒇𝑡𝒙12superscriptsubscript𝑔𝑡2subscript𝒙subscript𝑝𝑡𝒙𝑑𝑡d\boldsymbol{x}=\left(\boldsymbol{f}_{t}(\boldsymbol{x})-\frac{1}{2}g_{t}^{2}% \nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})\right)dt.italic_d bold_italic_x = ( bold_italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) ) italic_d italic_t . We can generate a new sample by numerically solving this Ordinary Differential Equation (ODE). For each 𝒙TpTsimilar-tosubscript𝒙𝑇subscript𝑝𝑇\boldsymbol{x}_{T}\sim p_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, this ODE describes a trajectory from 𝒙Tsubscript𝒙𝑇\boldsymbol{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to 𝒙0subscript𝒙0\boldsymbol{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

3.1.2 Consistency Training

Denote {𝒙t}subscript𝒙𝑡\{\boldsymbol{x}_{t}\}{ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } as a ODE trajectory, a function is called consistency function, if 𝒈(𝒙t1,t1)=𝒈(𝒙t2,t2)𝒈subscript𝒙subscript𝑡1subscript𝑡1𝒈subscript𝒙subscript𝑡2subscript𝑡2\boldsymbol{g}(\boldsymbol{x}_{t_{1}},t_{1})=\boldsymbol{g}(\boldsymbol{x}_{t_% {2}},t_{2})bold_italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = bold_italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), for any 𝒙t1,𝒙t2{𝒙t}subscript𝒙subscript𝑡1subscript𝒙subscript𝑡2subscript𝒙𝑡\boldsymbol{x}_{t_{1}},\boldsymbol{x}_{t_{2}}\in\{\boldsymbol{x}_{t}\}bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ { bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }. To reduce the time consumption for sampling from diffusion models, consistency training utilizes a model to fit the consistency function 𝒈(𝒙t1,t1)=𝒈(𝒙t2,t2)=𝒙0𝒈subscript𝒙subscript𝑡1subscript𝑡1𝒈subscript𝒙subscript𝑡2subscript𝑡2subscript𝒙0\boldsymbol{g}(\boldsymbol{x}_{t_{1}},t_{1})=\boldsymbol{g}(\boldsymbol{x}_{t_% {2}},t_{2})=\boldsymbol{x}_{0}bold_italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) = bold_italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The ODE trajectory selected by consistency training is

d𝒙=t𝒙logpt(𝒙)dt,t[0,T].formulae-sequence𝑑𝒙𝑡subscript𝒙subscript𝑝𝑡𝒙𝑑𝑡𝑡0𝑇d\boldsymbol{x}=t\nabla_{\boldsymbol{x}}\log p_{t}(\boldsymbol{x})dt,t\in[0,T].italic_d bold_italic_x = italic_t ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) italic_d italic_t , italic_t ∈ [ 0 , italic_T ] . (1)

In this setting, the distribution of

pt(𝒙)=p0(𝒙)𝒩(0,t2𝑰),subscript𝑝𝑡𝒙subscript𝑝0𝒙𝒩0superscript𝑡2𝑰p_{t}(\boldsymbol{x})=p_{0}(\boldsymbol{x})\ast\mathcal{N}(0,t^{2}\boldsymbol{% I}),italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x ) = italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_x ) ∗ caligraphic_N ( 0 , italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) ,

where \ast is convolution operator. The consistency models are denoted as 𝒇(𝒙t,t,𝜽)𝒇subscript𝒙𝑡𝑡𝜽\boldsymbol{f}(\boldsymbol{x}_{t},t,\boldsymbol{\theta})bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_θ ). Consistency model is defined as

𝒇(𝒙t,t,𝜽)=0.52rt2+0.52𝒙t+0.5rt0.52+rt2𝑭𝜽((1rt2+0.52)𝒙t,t),𝒇subscript𝒙𝑡𝑡𝜽superscript0.52superscriptsubscript𝑟𝑡2superscript0.52subscript𝒙𝑡0.5subscript𝑟𝑡superscript0.52superscriptsubscript𝑟𝑡2subscript𝑭𝜽1superscriptsubscript𝑟𝑡2superscript0.52subscript𝒙𝑡𝑡\boldsymbol{f}(\boldsymbol{x}_{t},t,\boldsymbol{\theta})=\frac{0.5^{2}}{r_{t}^% {2}+0.5^{2}}\boldsymbol{x}_{t}+\frac{0.5r_{t}}{\sqrt{0.5^{2}+r_{t}^{2}}}% \boldsymbol{F}_{\boldsymbol{\theta}}((\frac{1}{\sqrt{r_{t}^{2}+0.5^{2}}})% \boldsymbol{x}_{t},t),bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_θ ) = divide start_ARG 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG 0.5 italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG bold_italic_F start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (2)

where 𝜽𝜽\boldsymbol{\theta}bold_italic_θ represents the parameters of the model, 𝑭𝜽subscript𝑭𝜽\boldsymbol{F}_{\boldsymbol{\theta}}bold_italic_F start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT is the output of network, rt=tϵsubscript𝑟𝑡𝑡italic-ϵr_{t}=t-\epsilonitalic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t - italic_ϵ, and ϵitalic-ϵ\epsilonitalic_ϵ is a small number for numeric stability.

To train the consistency model 𝒇(𝒙t,t,θ)𝒇subscript𝒙𝑡𝑡𝜃\boldsymbol{f}(\boldsymbol{x}_{t},t,\theta)bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_θ ), we need to divide the time interval [0,T]0𝑇[0,T][ 0 , italic_T ] into several discrete time steps, denoted as t0=ϵ<t1<t2<<tN=Tsubscript𝑡0italic-ϵsubscript𝑡1subscript𝑡2subscript𝑡𝑁𝑇t_{0}=\epsilon<t_{1}<t_{2}<\dots<t_{N}=Titalic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_ϵ < italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < ⋯ < italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = italic_T. N𝑁Nitalic_N gradually increases as the training progresses, satisfying

N(k)=kK((s1+1)2s02)+s021+1,𝑁𝑘𝑘𝐾superscriptsubscript𝑠112superscriptsubscript𝑠02superscriptsubscript𝑠0211N(k)=\lceil\sqrt{\frac{k}{K}((s_{1}+1)^{2}-s_{0}^{2})+s_{0}^{2}}-1\rceil+1,italic_N ( italic_k ) = ⌈ square-root start_ARG divide start_ARG italic_k end_ARG start_ARG italic_K end_ARG ( ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 1 ⌉ + 1 ,

where K𝐾Kitalic_K denotes the total number of training steps, s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the end of time steps, s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the beginning of time steps and k𝑘kitalic_k refers to the current training step. Denote

CDn=k=1n𝔼[d(𝒇(𝒙tk,tk,𝜽),𝒇(𝒙tk1Φ,tk1,𝜽))],superscriptsubscript𝐶𝐷𝑛superscriptsubscript𝑘1𝑛𝔼delimited-[]𝑑𝒇subscript𝒙subscript𝑡𝑘subscript𝑡𝑘𝜽𝒇superscriptsubscript𝒙subscript𝑡𝑘1Φsubscript𝑡𝑘1superscript𝜽\mathcal{L}_{CD}^{n}=\sum_{k=1}^{n}\mathbb{E}[d(\boldsymbol{f}(\boldsymbol{x}_% {t_{k}},t_{k},\boldsymbol{\theta}),\boldsymbol{f}(\boldsymbol{x}_{t_{k-1}}^{% \Phi},t_{k-1},\boldsymbol{\theta}^{-}))],caligraphic_L start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E [ italic_d ( bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) , bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) ] ,

where d()𝑑d(\cdot)italic_d ( ⋅ ) is a distance function, 𝜽superscript𝜽\boldsymbol{\theta}^{-}bold_italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT is the exponentially moving average of each batch of 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, and 𝒙tn+1ptn+1similar-tosubscript𝒙subscript𝑡𝑛1subscript𝑝subscript𝑡𝑛1\boldsymbol{x}_{t_{n+1}}\sim p_{t_{n+1}}bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. 𝒙tnΦsuperscriptsubscript𝒙subscript𝑡𝑛Φ\boldsymbol{x}_{t_{n}}^{\Phi}bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT is obtained from 𝒙tn+1subscript𝒙subscript𝑡𝑛1\boldsymbol{x}_{t_{n+1}}bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT through the ODE solver ΦΦ\Phiroman_Φ using Eq. 1. About 𝜽𝜽\boldsymbol{\theta}bold_italic_θ and 𝜽superscript𝜽\boldsymbol{\theta}^{-}bold_italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT, the equation is given as 𝜽k+1=μ(k)𝜽k+(1μ(k))𝜽ksubscriptsuperscript𝜽𝑘1𝜇𝑘superscriptsubscript𝜽𝑘1𝜇𝑘subscript𝜽𝑘\boldsymbol{\theta}^{-}_{k+1}=\mu(k)\boldsymbol{\theta}_{k}^{-}+(1-\mu(k))% \boldsymbol{\theta}_{k}bold_italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_μ ( italic_k ) bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + ( 1 - italic_μ ( italic_k ) ) bold_italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, where μ(k)=exp(s0logμ0N(k))𝜇𝑘subscript𝑠0subscript𝜇0𝑁𝑘\mu(k)=\exp(\frac{s_{0}\log\mu_{0}}{N(k)})italic_μ ( italic_k ) = roman_exp ( divide start_ARG italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT roman_log italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_N ( italic_k ) end_ARG ) and μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the coefficient at the beginning.

However, calculating CDΦsubscriptsuperscriptΦ𝐶𝐷\mathcal{L}^{\Phi}_{CD}caligraphic_L start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT requires training another score-based generative model. They also propose using conditional trajectories to approximate xtnΦsuperscriptsubscript𝑥subscript𝑡𝑛Φx_{t_{n}}^{\Phi}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_Φ end_POSTSUPERSCRIPT. This loss is denoted as

CTn=k=1n𝔼[d(f(𝒙0+tk𝒛,tk,𝜽),f(𝒙0+tk1𝒛,tk1,𝜽))],subscriptsuperscript𝑛𝐶𝑇superscriptsubscript𝑘1𝑛𝔼delimited-[]𝑑𝑓subscript𝒙0subscript𝑡𝑘𝒛subscript𝑡𝑘𝜽𝑓subscript𝒙0subscript𝑡𝑘1𝒛subscript𝑡𝑘1superscript𝜽\mathcal{L}^{n}_{CT}=\sum_{k=1}^{n}\mathbb{E}[d(f(\boldsymbol{x}_{0}+t_{k}% \boldsymbol{z},t_{k},\boldsymbol{\theta}),f(\boldsymbol{x}_{0}+t_{k-1}% \boldsymbol{z},t_{k-1},\boldsymbol{\theta}^{-}))],caligraphic_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_E [ italic_d ( italic_f ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) , italic_f ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) ] ,

where 𝒙0p0similar-tosubscript𝒙0subscript𝑝0\boldsymbol{x}_{0}\sim p_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒛𝒩(0,I)similar-to𝒛𝒩0𝐼\boldsymbol{z}\sim\mathcal{N}(0,I)bold_italic_z ∼ caligraphic_N ( 0 , italic_I ). CT=CTNsubscript𝐶𝑇subscriptsuperscript𝑁𝐶𝑇\mathcal{L}_{CT}=\mathcal{L}^{N}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT = caligraphic_L start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT is called consistency training loss. Using this loss to train the consistency model is called consistency training. This loss is proven [45] to satisfy

CTn=CDn+o(Δt),subscriptsuperscript𝑛𝐶𝑇subscriptsuperscript𝑛𝐶𝐷𝑜Δ𝑡\mathcal{L}^{n}_{CT}=\mathcal{L}^{n}_{CD}+o(\Delta t),caligraphic_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT = caligraphic_L start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT + italic_o ( roman_Δ italic_t ) , (3)

when the ODE solver ΦΦ\Phiroman_Φ is Euler solver.

3.1.3 Generative Adversarial Networks

Generative Adversarial Networks (GANs), as generative models, are divided into two parts during training. One part is the generator, denoted as G()𝐺G(\cdot)italic_G ( ⋅ ), which is used to generate samples from the approximated target distribution. The other part is the discriminator, denoted as D()𝐷D(\cdot)italic_D ( ⋅ ). The training of GANs is alternatively optimizing G()𝐺G(\cdot)italic_G ( ⋅ ) and D()𝐷D(\cdot)italic_D ( ⋅ ): 1) train to distinguish whether the sample is a generated sample; 2) train G()𝐺G(\cdot)italic_G ( ⋅ ) to deceive the discriminator. These two steps are alternated in training. One type of GANs can be described as the following minimax problem: minGmaxDV(G,D)=𝔼𝒙pdata (𝒙)[logD(𝒙)]+𝔼𝒛p𝒛(𝒛)[log(1D(G(𝒛)))]subscript𝐺subscript𝐷𝑉𝐺𝐷subscript𝔼similar-to𝒙subscript𝑝data 𝒙delimited-[]𝐷𝒙subscript𝔼similar-to𝒛subscript𝑝𝒛𝒛delimited-[]1𝐷𝐺𝒛\min_{G}\max_{D}V(G,D)=\mathbb{E}_{\boldsymbol{x}\sim p_{\text{data }}(% \boldsymbol{x})}[\log D(\boldsymbol{x})]+\mathbb{E}_{\boldsymbol{z}\sim p_{% \boldsymbol{z}}(\boldsymbol{z})}[\log(1-D(G(\boldsymbol{z})))]roman_min start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_V ( italic_G , italic_D ) = blackboard_E start_POSTSUBSCRIPT bold_italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT ( bold_italic_x ) end_POSTSUBSCRIPT [ roman_log italic_D ( bold_italic_x ) ] + blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_p start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT ( bold_italic_z ) end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D ( italic_G ( bold_italic_z ) ) ) ]. It can be proven that this minimax problem is equivalent to minimizing the JS-divergence between pdatasubscript𝑝datap_{\text{data}}italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT and G(𝒛)𝐺𝒛G(\boldsymbol{z})italic_G ( bold_italic_z ), where 𝒛p𝒛similar-to𝒛subscript𝑝𝒛\boldsymbol{z}\sim p_{\boldsymbol{z}}bold_italic_z ∼ italic_p start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT.

To improve the training stability of GANs, many methods have been proposed. A practical approach is the zero-centered gradient penalty. This is achieved by using the following regularization:

gp=𝒙D(𝒙)2,𝒙pdata.formulae-sequencesubscript𝑔𝑝superscriptnormsubscript𝒙𝐷𝒙2similar-to𝒙subscript𝑝data\mathcal{L}_{gp}=\|\nabla_{\boldsymbol{x}}D(\boldsymbol{x})\|^{2},\boldsymbol{% x}\sim p_{\text{data}}.caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT = ∥ ∇ start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_D ( bold_italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , bold_italic_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT . (4)

To reduce computational overhead, this regularization can be applied intermittently every few training steps, rather than at every step.

3.2 Analysis the Loss Function

Theorem 3.1.

If the consistency model satisfies the Lipschitz condition: there exists L>0𝐿0L>0italic_L > 0 such that for all 𝐱𝐱\boldsymbol{x}bold_italic_x, 𝐲𝐲\boldsymbol{y}bold_italic_y and t𝑡titalic_t, we have 𝐟(𝐱,t,𝛉)𝐟(𝐲,t,𝛉)2L𝐱𝐲2subscriptnorm𝐟𝐱𝑡𝛉𝐟𝐲𝑡𝛉2𝐿subscriptnorm𝐱𝐲2\|\boldsymbol{f}(\boldsymbol{x},t,\boldsymbol{\theta})-\boldsymbol{f}(% \boldsymbol{y},t,\boldsymbol{\theta})\|_{2}\leq L\|\boldsymbol{x}-\boldsymbol{% y}\|_{2}∥ bold_italic_f ( bold_italic_x , italic_t , bold_italic_θ ) - bold_italic_f ( bold_italic_y , italic_t , bold_italic_θ ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_L ∥ bold_italic_x - bold_italic_y ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then minimizing the consistency loss will reduce the upper boundary of the W-distance between the two distributions. This can be formally articulated as the following theorem:

𝒲[ftk,gtk]=𝒲[ftk,p0]L𝒲[qtk,ptk]+CTtk+tkO(Δt)+o(Δt),𝒲subscript𝑓subscript𝑡𝑘subscript𝑔subscript𝑡𝑘𝒲subscript𝑓subscript𝑡𝑘subscript𝑝0𝐿𝒲subscript𝑞subscript𝑡𝑘subscript𝑝subscript𝑡𝑘subscriptsuperscriptsubscript𝑡𝑘𝐶𝑇subscript𝑡𝑘𝑂Δ𝑡𝑜Δ𝑡\begin{split}\mathcal{W}[f_{t_{k}},g_{t_{k}}]&=\mathcal{W}[f_{t_{k}},p_{0}]\\ &\leq L\mathcal{W}[q_{t_{k}},p_{t_{k}}]+\mathcal{L}^{t_{k}}_{CT}+t_{k}O(\Delta t% )+o(\Delta t),\end{split}start_ROW start_CELL caligraphic_W [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] end_CELL start_CELL = caligraphic_W [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ italic_L caligraphic_W [ italic_q start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] + caligraphic_L start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_O ( roman_Δ italic_t ) + italic_o ( roman_Δ italic_t ) , end_CELL end_ROW (5)

where the definition of ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,𝐟𝐟\boldsymbol{f}bold_italic_f, CTtksuperscriptsubscript𝐶𝑇subscript𝑡𝑘\mathcal{L}_{CT}^{t_{k}}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝐠𝐠\boldsymbol{g}bold_italic_g is consistent with that in Sec. 3.1.2. Δt=max(tktk1)Δ𝑡subscript𝑡𝑘subscript𝑡𝑘1\Delta t=\max(t_{k}-t_{k-1})roman_Δ italic_t = roman_max ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ). The distribution ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as 𝐟(𝐱t,t,𝛉)𝐟subscript𝐱𝑡𝑡𝛉\boldsymbol{f}(\boldsymbol{x}_{t},t,\boldsymbol{\theta})bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_θ ), where 𝐱tqtsimilar-tosubscript𝐱𝑡subscript𝑞𝑡\boldsymbol{x}_{t}\sim q_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the distribution gtsubscript𝑔𝑡g_{t}italic_g start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined as 𝐠(𝐲t,t)𝐠subscript𝐲𝑡𝑡\boldsymbol{g}(\boldsymbol{y}_{t},t)bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), where 𝐲tptsimilar-tosubscript𝐲𝑡subscript𝑝𝑡\boldsymbol{y}_{t}\sim p_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The distribution qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the noise distribution when generating samples.

Proof.

The W-distance (Wasserstein-distance) is defined as follows:

𝒲ρ[p,q]=infγ[p,q]γ(𝒙,𝒚)𝒙𝒚ρ𝑑𝒙𝑑𝒚,subscript𝒲𝜌𝑝𝑞subscriptinfimum𝛾product𝑝𝑞double-integral𝛾𝒙𝒚subscriptnorm𝒙𝒚𝜌differential-d𝒙differential-d𝒚\mathcal{W}_{\rho}[p,q]=\inf_{\gamma\in\prod[p,q]}\iint\gamma(\boldsymbol{x},% \boldsymbol{y})\|\boldsymbol{x}-\boldsymbol{y}\|_{\rho}d\boldsymbol{x}d% \boldsymbol{y},caligraphic_W start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT [ italic_p , italic_q ] = roman_inf start_POSTSUBSCRIPT italic_γ ∈ ∏ [ italic_p , italic_q ] end_POSTSUBSCRIPT ∬ italic_γ ( bold_italic_x , bold_italic_y ) ∥ bold_italic_x - bold_italic_y ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT italic_d bold_italic_x italic_d bold_italic_y ,

where γ𝛾\gammaitalic_γ is any joint distribution of p𝑝pitalic_p and q𝑞qitalic_q. For convenience, we take the case of ρ=2𝜌2\rho=2italic_ρ = 2 and simply denote \|\cdot\|∥ ⋅ ∥ as 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and denote 𝒲[p,q]𝒲𝑝𝑞\mathcal{W}[p,q]caligraphic_W [ italic_p , italic_q ] as 𝒲2[p,q]subscript𝒲2𝑝𝑞\mathcal{W}_{2}[p,q]caligraphic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT [ italic_p , italic_q ]. Let {𝒙tk}subscript𝒙subscript𝑡𝑘\{\boldsymbol{x}_{t_{k}}\}{ bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } or {𝒚tk}subscript𝒚subscript𝑡𝑘\{\boldsymbol{y}_{t_{k}}\}{ bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT } be the points on the same trajectory defined by the ODE in Eq. 1 on the ODE trajectory. For 𝒲[ftk,gtk]𝒲subscript𝑓subscript𝑡𝑘subscript𝑔subscript𝑡𝑘\mathcal{W}[f_{t_{k}},g_{t_{k}}]caligraphic_W [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ], we have the following inequality:

𝒲[ftk,gtk]𝒲subscript𝑓subscript𝑡𝑘subscript𝑔subscript𝑡𝑘\displaystyle\mathcal{W}[f_{t_{k}},g_{t_{k}}]caligraphic_W [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]
=\displaystyle== infγ[ftk,gtk]γ(𝒙^tk,𝒚^tk)𝒙^tk𝒚^tkρ𝑑𝒙^tk𝑑𝒚^tksubscriptinfimumsuperscript𝛾productsubscript𝑓subscript𝑡𝑘subscript𝑔subscript𝑡𝑘double-integralsuperscript𝛾subscript^𝒙subscript𝑡𝑘subscript^𝒚subscript𝑡𝑘subscriptnormsubscript^𝒙subscript𝑡𝑘subscript^𝒚subscript𝑡𝑘𝜌differential-dsubscript^𝒙subscript𝑡𝑘differential-dsubscript^𝒚subscript𝑡𝑘\displaystyle\inf_{\gamma^{*}\in\prod[f_{t_{k}},g_{t_{k}}]}\iint\gamma^{*}(% \hat{\boldsymbol{x}}_{t_{k}},\hat{\boldsymbol{y}}_{t_{k}})\|\hat{\boldsymbol{x% }}_{t_{k}}-\hat{\boldsymbol{y}}_{t_{k}}\|_{\rho}d\hat{\boldsymbol{x}}_{t_{k}}d% \hat{\boldsymbol{y}}_{t_{k}}roman_inf start_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ ∏ [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT ∬ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT italic_d over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT
(i)𝑖\displaystyle\overset{({i})}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG γ(𝒙^tk,𝒚^tk)𝒙^tk𝒚^tk𝑑𝒙^tk𝑑𝒚^tk,γ[ftk,gtk]double-integral𝛾subscript^𝒙subscript𝑡𝑘subscript^𝒚subscript𝑡𝑘normsubscript^𝒙subscript𝑡𝑘subscript^𝒚subscript𝑡𝑘differential-dsubscript^𝒙subscript𝑡𝑘differential-dsubscript^𝒚subscript𝑡𝑘𝛾productsubscript𝑓subscript𝑡𝑘subscript𝑔subscript𝑡𝑘\displaystyle\iint\gamma(\hat{\boldsymbol{x}}_{t_{k}},\hat{\boldsymbol{y}}_{t_% {k}})\|\hat{\boldsymbol{x}}_{t_{k}}-\hat{\boldsymbol{y}}_{t_{k}}\|d\hat{% \boldsymbol{x}}_{t_{k}}d\hat{\boldsymbol{y}}_{t_{k}},\gamma\in\prod[f_{t_{k}},% g_{t_{k}}]∬ italic_γ ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ∥ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_d over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_γ ∈ ∏ [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]
=\displaystyle{=}= 𝔼𝒙^tk,𝒚^tkγ[ftk,gtk][𝒙^tk𝒚^tk]subscript𝔼similar-tosubscript^𝒙subscript𝑡𝑘subscript^𝒚subscript𝑡𝑘𝛾productsubscript𝑓subscript𝑡𝑘subscript𝑔subscript𝑡𝑘delimited-[]normsubscript^𝒙subscript𝑡𝑘subscript^𝒚subscript𝑡𝑘\displaystyle\mathbb{E}_{\hat{\boldsymbol{x}}_{t_{k}},\hat{\boldsymbol{y}}_{t_% {k}}\sim\gamma\in\prod[f_{t_{k}},g_{t_{k}}]}[\|\hat{\boldsymbol{x}}_{t_{k}}-% \hat{\boldsymbol{y}}_{t_{k}}\|]blackboard_E start_POSTSUBSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ ∈ ∏ [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT [ ∥ over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - over^ start_ARG bold_italic_y end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ]
=(ii)𝑖𝑖\displaystyle\overset{({ii})}{=}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG = end_ARG 𝔼𝒙tk,𝒚tkγ[qtk,ptk][𝒇(𝒙tk,tk,ϕ)𝒈(𝒚tk,tk)].subscript𝔼similar-tosubscript𝒙subscript𝑡𝑘subscript𝒚subscript𝑡𝑘𝛾productsubscript𝑞subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒇subscript𝒙subscript𝑡𝑘subscript𝑡𝑘italic-ϕ𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘\displaystyle\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim% \gamma\in\prod[q_{t_{k}},p_{t_{k}}]}[\|\boldsymbol{f}(\boldsymbol{x}_{t_{k}},t% _{k},\phi)-\boldsymbol{g}(\boldsymbol{y}_{t_{k}},t_{k})\|].blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ ∈ ∏ [ italic_q start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT [ ∥ bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_ϕ ) - bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ] .

Here, (i) holds because γ𝛾\gammaitalic_γ is the joint distribution of any ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and qtsubscript𝑞𝑡q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. (ii) is obtained through the law of the unconscious statistician. Since the joint distribution γ[qtk,ptk]𝛾productsubscript𝑞subscript𝑡𝑘subscript𝑝subscript𝑡𝑘\gamma\in\prod[q_{t_{k}},p_{t_{k}}]italic_γ ∈ ∏ [ italic_q start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] in the above formula is arbitrary, so we choose the distribution satisfying 𝔼𝒙tk,𝒚tkγ[𝒚tk𝒙tk]=𝒲[qtk,ptk]subscript𝔼similar-tosubscript𝒙subscript𝑡𝑘subscript𝒚subscript𝑡𝑘superscript𝛾delimited-[]normsubscript𝒚subscript𝑡𝑘subscript𝒙subscript𝑡𝑘𝒲subscript𝑞subscript𝑡𝑘subscript𝑝subscript𝑡𝑘\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim\gamma^{*}}[\|% \boldsymbol{y}_{t_{k}}-\boldsymbol{x}_{t_{k}}\|]=\mathcal{W}[q_{t_{k}},p_{t_{k% }}]blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ] = caligraphic_W [ italic_q start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]. We denote it as γsuperscript𝛾\gamma^{*}italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The expectation 𝔼𝒙tk,𝒚tk𝜸[f(𝒙tk,tk,θ)g(𝒚tk,tk)]subscript𝔼similar-tosubscript𝒙subscript𝑡𝑘subscript𝒚subscript𝑡𝑘superscript𝜸delimited-[]norm𝑓subscript𝒙subscript𝑡𝑘subscript𝑡𝑘𝜃𝑔subscript𝒚subscript𝑡𝑘subscript𝑡𝑘\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim\boldsymbol{% \gamma}^{*}}[\|f(\boldsymbol{x}_{t_{k}},t_{k},\theta)-g(\boldsymbol{y}_{t_{k}}% ,t_{k})\|]blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ bold_italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_θ ) - italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ] satisfies the following inequality:

𝔼𝒙tk,𝒚tkγ[𝒇(𝒙tk,tk,𝜽)𝒈(𝒚tk,tk)]𝔼𝒚tkptk[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)]+L𝒲[qtk,ptk].subscript𝔼similar-tosubscript𝒙subscript𝑡𝑘subscript𝒚subscript𝑡𝑘superscript𝛾delimited-[]delimited-∥∥𝒇subscript𝒙subscript𝑡𝑘subscript𝑡𝑘𝜽𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]delimited-∥∥𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽𝐿𝒲subscript𝑞subscript𝑡𝑘subscript𝑝subscript𝑡𝑘\begin{split}&\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim% \gamma^{*}}[\|\boldsymbol{f}(\boldsymbol{x}_{t_{k}},t_{k},\boldsymbol{\theta})% -\boldsymbol{g}(\boldsymbol{y}_{t_{k}},t_{k})\|]\\ {\leq}&\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{g}(% \boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},% \boldsymbol{\theta})\|]+L\mathcal{W}[q_{t_{k}},p_{t_{k}}].\end{split}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ] end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ] + italic_L caligraphic_W [ italic_q start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] . end_CELL end_ROW (6)

If the ODE solver is Euler ODE solver, we have:

𝔼𝒚tkptk[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)]𝔼𝒚tk1ptk1[𝒈(𝒚tk1,tk1)𝒇(𝒚tk1,tk1,𝜽)]+L(tktk1)O(tktk1)+𝔼𝒚tkptk[𝒇(𝒚tk1ϕ,tk1,𝜽)𝒇(𝒚tk,tk,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]delimited-∥∥𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘1subscript𝑝subscript𝑡𝑘1delimited-[]delimited-∥∥𝒈subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝒇subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝜽𝐿subscript𝑡𝑘subscript𝑡𝑘1𝑂subscript𝑡𝑘subscript𝑡𝑘1subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]delimited-∥∥𝒇superscriptsubscript𝒚subscript𝑡𝑘1italic-ϕsubscript𝑡𝑘1𝜽𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽\begin{split}&\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{% g}(\boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},% \boldsymbol{\theta})\|]\\ {\leq}&\mathbb{E}_{\boldsymbol{y}_{t_{k-1}}\sim p_{t_{k-1}}}[\|\boldsymbol{g}(% \boldsymbol{y}_{t_{k-1}},t_{k-1})-\boldsymbol{f}(\boldsymbol{y}_{t_{k-1}},t_{k% -1},\boldsymbol{\theta})\|]\\ &\quad+L(t_{k}-t_{k-1})O(t_{k}-t_{k-1})\\ &\quad+\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{f}(% \boldsymbol{y}_{t_{k-1}}^{\phi},t_{k-1},\boldsymbol{\theta})-\boldsymbol{f}(% \boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{\theta})\|]\\ \end{split}start_ROW start_CELL end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ] end_CELL end_ROW start_ROW start_CELL ≤ end_CELL start_CELL blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_L ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) italic_O ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ] end_CELL end_ROW (7)

The detailed proofs for the aforementioned inequalities can be found in Appendix B. We iterate multiple times until t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. At this point, from Eq. 2, we have g(yt0,t0)f(yt0,t0,θ)=0norm𝑔subscript𝑦subscript𝑡0subscript𝑡0𝑓subscript𝑦subscript𝑡0subscript𝑡0𝜃0\|g(y_{t_{0}},t_{0})-f(y_{t_{0}},t_{0},\theta)\|=0∥ italic_g ( italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_θ ) ∥ = 0. So, we can obtain the inequality below:

𝔼𝒚tkptk[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽\displaystyle\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{g% }(\boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},% \boldsymbol{\theta})\|]blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
\displaystyle\leq CDk+i=1kL(titi1)O((titi1))subscriptsuperscript𝑘𝐶𝐷superscriptsubscript𝑖1𝑘𝐿subscript𝑡𝑖subscript𝑡𝑖1𝑂subscript𝑡𝑖subscript𝑡𝑖1\displaystyle\mathcal{L}^{k}_{CD}+\sum_{i=1}^{k}L(t_{i}-t_{i-1})O((t_{i}-t_{i-% 1}))caligraphic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_L ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) italic_O ( ( italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) )
=(i)𝑖\displaystyle\overset{({i})}{=}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG CTk+i=1ktkO((Δt))+o(Δt).subscriptsuperscript𝑘𝐶𝑇superscriptsubscript𝑖1𝑘subscript𝑡𝑘𝑂Δ𝑡𝑜Δ𝑡\displaystyle\mathcal{L}^{k}_{CT}+\sum_{i=1}^{k}t_{k}O((\Delta t))+o(\Delta t).caligraphic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_O ( ( roman_Δ italic_t ) ) + italic_o ( roman_Δ italic_t ) .

Here, (i) holds because Δt=max(tktk1)Δ𝑡subscript𝑡𝑘subscript𝑡𝑘1\Delta t=\max(t_{k}-t_{k-1})roman_Δ italic_t = roman_max ( italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ), and the relationship between CDksubscriptsuperscript𝑘𝐶𝐷\mathcal{L}^{k}_{CD}caligraphic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_D end_POSTSUBSCRIPT and CTksubscriptsuperscript𝑘𝐶𝑇\mathcal{L}^{k}_{CT}caligraphic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT in Eq. 3. Since consistency function 𝒈(𝒙t,t)=𝒙0𝒈subscript𝒙𝑡𝑡subscript𝒙0\boldsymbol{g}(\boldsymbol{x}_{t},t)=\boldsymbol{x}_{0}bold_italic_g ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, it follows that 𝒲[ftk,gtk]=𝒲[ftk,p0]𝒲subscript𝑓subscript𝑡𝑘subscript𝑔subscript𝑡𝑘𝒲subscript𝑓subscript𝑡𝑘subscript𝑝0\mathcal{W}[f_{t_{k}},g_{t_{k}}]=\mathcal{W}[f_{t_{k}},p_{0}]caligraphic_W [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] = caligraphic_W [ italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]. Putting these together, the proof is complete. ∎

Analyzing Eq. 5, 𝒲[qtk,ptk]𝒲subscript𝑞subscript𝑡𝑘subscript𝑝subscript𝑡𝑘\mathcal{W}[q_{t_{k}},p_{t_{k}}]caligraphic_W [ italic_q start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] is the W-distance between the two sampling distributions, which is independent of the model. We set qt=ptsubscript𝑞𝑡subscript𝑝𝑡q_{t}=p_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to eliminate 𝒲[qtk,ptk]𝒲subscript𝑞subscript𝑡𝑘subscript𝑝subscript𝑡𝑘\mathcal{W}[q_{t_{k}},p_{t_{k}}]caligraphic_W [ italic_q start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]. The term o(Δt)𝑜Δ𝑡o(\Delta t)italic_o ( roman_Δ italic_t ) and tkO(Δt)subscript𝑡𝑘𝑂Δ𝑡t_{k}O(\Delta t)italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_O ( roman_Δ italic_t ) originate from approximation errors, where tkO(Δt)subscript𝑡𝑘𝑂Δ𝑡t_{k}O(\Delta t)italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_O ( roman_Δ italic_t ) increases with the increase of tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. The remaining term is CTk=i=1k𝔼[d(f(𝒙0+ti𝒛,ti,𝜽),f(𝒙0+ti1𝒛,ti1,𝜽))]subscriptsuperscript𝑘𝐶𝑇superscriptsubscript𝑖1𝑘𝔼delimited-[]𝑑𝑓subscript𝒙0subscript𝑡𝑖𝒛subscript𝑡𝑖𝜽𝑓subscript𝒙0subscript𝑡𝑖1𝒛subscript𝑡𝑖1superscript𝜽\mathcal{L}^{k}_{CT}=\sum_{i=1}^{k}\mathbb{E}[d(f(\boldsymbol{x}_{0}+t_{i}% \boldsymbol{z},t_{i},\boldsymbol{\theta}),f(\boldsymbol{x}_{0}+t_{i-1}% \boldsymbol{z},t_{i-1},\boldsymbol{\theta}^{-}))]caligraphic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT blackboard_E [ italic_d ( italic_f ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_θ ) , italic_f ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) ]. It can be seen that this term also accumulates errors. The quality of the model’s generation depends not only on the current loss at tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, 𝔼[d(f(𝒙0+tk𝒛,tk,𝜽),f(𝒙0+tk1𝒛,tk1,𝜽))]𝔼delimited-[]𝑑𝑓subscript𝒙0subscript𝑡𝑘𝒛subscript𝑡𝑘𝜽𝑓subscript𝒙0subscript𝑡𝑘1𝒛subscript𝑡𝑘1superscript𝜽\mathbb{E}[d(f(\boldsymbol{x}_{0}+t_{k}\boldsymbol{z},t_{k},\boldsymbol{\theta% }),f(\boldsymbol{x}_{0}+t_{k-1}\boldsymbol{z},t_{k-1},\boldsymbol{\theta}^{-}))]blackboard_E [ italic_d ( italic_f ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) , italic_f ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) ) ], but also on the sum of all losses for values less than k𝑘kitalic_k. These two accumulated errors may be one of the reasons why consistency training requires as large a batch size and large model size as possible. During training, it is not only necessary to ensure a smaller loss at the current tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, but also to use a larger batch size and larger model size to ensure a smaller loss at previous t𝑡titalic_t values. Besides, reducing ΔtΔ𝑡\Delta troman_Δ italic_t can help to lower this upper bound. However, as described in the original text [45], reducing ΔtΔ𝑡\Delta troman_Δ italic_t in practical applications does not always lead to performance improvements.

3.3 Enhancing Consistency Training with Discriminator

Following the analysis in Sec. 3.2, it can be observed that the W-distance at time tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT depends not only on the loss at tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, but also on the loss at previous times. This could be one of the reasons why consistency training requires as large a batch size and model size as possible. However, it can be noted that at each moment tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the ultimate goal is to reduce the distance between the generated distribution and the target distribution. In order to reduce the gap between two distributions, we propose not only using the W-distance, but also other distances, such as JS-divergence. Inspired by GANs, we suggest incorporating a discriminator into the training process.

It can be proven that when the generator training loss is given by

G=log(1D(𝒇(𝒙+tn+1𝒛,tn+1,𝜽g),tn+1,𝜽d)),subscript𝐺1𝐷𝒇𝒙subscript𝑡𝑛1𝒛subscript𝑡𝑛1subscript𝜽𝑔subscript𝑡𝑛1subscript𝜽𝑑\mathcal{L}_{G}=\log(1-D(\boldsymbol{f}(\boldsymbol{x}+t_{n+1}\boldsymbol{z},t% _{n+1},\boldsymbol{\theta}_{g}),t_{n+1},\boldsymbol{\theta}_{d})),caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = roman_log ( 1 - italic_D ( bold_italic_f ( bold_italic_x + italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) , (8)

and the discriminator training loss is given by

D=log(1D(𝒇(𝒙g+tn+1𝒛,tn+1),𝜽d)log(D(𝒙r,tn+1,𝜽d)),\begin{split}\mathcal{L}_{D}=&-\log(1-D(\boldsymbol{f}(\boldsymbol{x}_{g}+t_{n% +1}\boldsymbol{z},t_{n+1}),\boldsymbol{\theta}_{d})\\ &-\log(D(\boldsymbol{x}_{r},t_{n+1},\boldsymbol{\theta}_{d})),\end{split}start_ROW start_CELL caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT = end_CELL start_CELL - roman_log ( 1 - italic_D ( bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - roman_log ( italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (9)

minimizing the loss leads to min𝒇(2log2+2JSD(ftkp0))subscript𝒇222𝐽𝑆𝐷conditionalsubscript𝑓subscript𝑡𝑘subscript𝑝0\min_{\boldsymbol{f}}(-2\log 2+2JSD\left(f_{t_{k}}\|p_{0}\right))roman_min start_POSTSUBSCRIPT bold_italic_f end_POSTSUBSCRIPT ( - 2 roman_log 2 + 2 italic_J italic_S italic_D ( italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ), which is equivalent to minimizing the JS-divergence. D𝐷Ditalic_D is the discriminator. It can be observed that this loss does not depend on the previous tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT loss, and can directly optimize the distance between the current tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT distributions. Therefore, the required batch size and model size can be smaller compared to consistency training.

However, although the ultimate goals of the two distances are the same, e.g., when the JS-divergence is 00, the W-distance is also 00, at which point the gradient of the discriminator is also 00. However, at this point, the gradient of CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT may not be 00 due to the aforementioned error. Moreover, when CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT is relatively large, the optimization direction of CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT may conflict with Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Consider the extreme case where the output of ftnsubscript𝑓subscript𝑡𝑛f_{t_{n}}italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT is completely random, it is clear that CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT are in conflict, when training 𝒇𝒇\boldsymbol{f}bold_italic_f at time tn+1subscript𝑡𝑛1t_{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT. On the other hand, when CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT is relatively small, the model f𝑓fitalic_f is easier to fit at tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT than at tn+1subscript𝑡𝑛1t_{n+1}italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, thus generating better quality. Also, since xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and xtn+1subscript𝑥subscript𝑡𝑛1x_{t_{n+1}}italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are close enough, their discriminators are also close enough, thus jointly improving the generation quality. Therefore, we employ the coefficient λ𝜆\lambdaitalic_λ to balance the proportion between CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Furthermore, as CTksubscriptsuperscript𝑘𝐶𝑇\mathcal{L}^{k}_{CT}caligraphic_L start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT increases with k𝑘kitalic_k, the W-distance also increases. In order to improve the performance of consistency training, the weight of Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT should also increase. We utilize the formula Eq. 10 to give Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT more weight, where w𝑤witalic_w is the weight at n=N1𝑛𝑁1n=N-1italic_n = italic_N - 1, and wmidsubscript𝑤𝑚𝑖𝑑w_{mid}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT is the weight at n=(N1)/2𝑛𝑁12n=(N-1)/2italic_n = ( italic_N - 1 ) / 2.

λN(n)=w(nN1)log12(wmidw).subscript𝜆𝑁𝑛𝑤superscript𝑛𝑁1subscript12subscript𝑤𝑚𝑖𝑑𝑤\lambda_{N}(n)=w\left(\frac{n}{N-1}\right)^{\log_{\frac{1}{2}}(\frac{w_{mid}}{% w})}.italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ( italic_n ) = italic_w ( divide start_ARG italic_n end_ARG start_ARG italic_N - 1 end_ARG ) start_POSTSUPERSCRIPT roman_log start_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT ( divide start_ARG italic_w start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT end_ARG start_ARG italic_w end_ARG ) end_POSTSUPERSCRIPT . (10)

Please note, even though the fitting targets of all ftksubscript𝑓subscript𝑡𝑘f_{t_{k}}italic_f start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT are q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we choose for the form D(𝒙t,t,𝜽d)𝐷subscript𝒙𝑡𝑡subscript𝜽𝑑D(\boldsymbol{x}_{t},t,\boldsymbol{\theta}_{d})italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) rather than D(𝒙t,𝜽d)𝐷subscript𝒙𝑡subscript𝜽𝑑D(\boldsymbol{x}_{t},\boldsymbol{\theta}_{d})italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) when constructing the discriminator. Although theoretically, the optimal distribution of the generator trained by these two discriminators is p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and for two similar samples, the discriminator in the form of D(𝒙t,𝜽d)𝐷subscript𝒙𝑡subscript𝜽𝑑D(\boldsymbol{x}_{t},\boldsymbol{\theta}_{d})italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) will generate similar gradients at different t𝑡titalic_t, we find in our experiments Sec. 4.3.3 that this form of discriminator is not as effective as D(𝒙t,t,𝜽d)𝐷subscript𝒙𝑡𝑡subscript𝜽𝑑D(\boldsymbol{x}_{t},t,\boldsymbol{\theta}_{d})italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). The training algorithm is described in Algorithm 1.

3.4 Gradient Penalty Based Adaptive Data Augmentation

For smaller datasets, in the field of GANs, there are many data augmentation works to improve generation effects. Inspired by StyleGAN2-ADA[46], we also utilize adaptive differentiable data augmentation. However, unlike StyleGAN2-ADA, which adjusts the probability of data augmentation based on the accuracy of the discriminator over time, it is difficult to adjust the augmentation probability through the accuracy of a single discriminator in our model due to the varying training difficulties at different t𝑡titalic_t. As described in Sec. 4.3.2, we find that the stability of the discriminator’s gradient has a significant impact on training. This may be due to the interaction between CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. We propose to adjust the probability of data augmentation based on the value of the gradient penalty over time. Given a differential data augmentation function A(𝒙,paug)𝐴𝒙subscript𝑝𝑎𝑢𝑔A(\boldsymbol{x},p_{aug})italic_A ( bold_italic_x , italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ), where paugsubscript𝑝𝑎𝑢𝑔p_{aug}italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT is the probability of applying the data augmentation, the augmented discriminator is defined by:

Daug(𝒙t,t,paug,𝜽d)=D(A(𝒙t,paug),t,𝜽d).subscript𝐷𝑎𝑢𝑔subscript𝒙𝑡𝑡subscript𝑝𝑎𝑢𝑔subscript𝜽𝑑𝐷𝐴subscript𝒙𝑡subscript𝑝𝑎𝑢𝑔𝑡subscript𝜽𝑑D_{aug}(\boldsymbol{x}_{t},t,p_{aug},\boldsymbol{\theta}_{d})=D(A(\boldsymbol{% x}_{t},p_{aug}),t,\boldsymbol{\theta}_{d}).italic_D start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = italic_D ( italic_A ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ) , italic_t , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) .

The probability paugsubscript𝑝𝑎𝑢𝑔p_{aug}italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT is updated by

paugClip[0,1](paug+2([gpτ]0.5)pr),subscript𝑝𝑎𝑢𝑔subscriptClip01subscript𝑝𝑎𝑢𝑔2delimited-[]superscriptsubscript𝑔𝑝𝜏0.5subscript𝑝𝑟p_{aug}\leftarrow\text{Clip}_{[0,1]}(p_{aug}+2([\mathcal{L}_{gp}^{-}\geq\tau]-% 0.5)p_{r}),italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ← Clip start_POSTSUBSCRIPT [ 0 , 1 ] end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT + 2 ( [ caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ≥ italic_τ ] - 0.5 ) italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) ,

where []delimited-[][\cdot][ ⋅ ] denotes the indicator function, which takes a value of 1111 when the condition is true and 00 otherwise. Clip[0,1]()subscriptClip01\text{Clip}_{[0,1]}(\cdot)Clip start_POSTSUBSCRIPT [ 0 , 1 ] end_POSTSUBSCRIPT ( ⋅ ) represents the operation of clip** the value to the interval [0,1]01[0,1][ 0 , 1 ]. prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT denotes the update rate at each iteration, and gpsuperscriptsubscript𝑔𝑝\mathcal{L}_{gp}^{-}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT is the exponential moving average of gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, defined as gp=μpgp+(1μp)gpsuperscriptsubscript𝑔𝑝subscript𝜇𝑝superscriptsubscript𝑔𝑝1subscript𝜇𝑝subscript𝑔𝑝\mathcal{L}_{gp}^{-}=\mu_{p}\mathcal{L}_{gp}^{-}+(1-\mu_{p})\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + ( 1 - italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT. prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and μpsubscript𝜇𝑝\mu_{p}italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are constants within the range [0,1]01[0,1][ 0 , 1 ]. This algorithm is described in Algorithm 2 shown in Appendix D. Our motivation for proposing the use of data augmentation is to mitigate the overfitting phenomenon in the discriminator. We conduct experiments on CIFAR10 to verify the method. However, the performance of data augmentation on large datasets, such as ImageNet 64×\times×64, remains to be explored.

Table 1: Training steps and model parameter size are reported. BS stands for Batch Size. For ACT, Params represent parameters of the consistency model + discriminator.

Dataset Method BS Steps Params Fid CIFAR10 CT 512 800K 73.9M 8.7 CT 256 800K 73.9M 10.4 CT 128 800K 73.9M 14.4 ACT-Aug 80 300K 27.5M+14.1M 6.0 ImageNet CT 2048 800K 282M 13.0 ACT 320 400K 107M+54M 10.6 LSUN Cat CT 2048 1000K 458M 20.7 ACT 320 165K 113M+57M 13.0

Table 2: Sample quality of ACT on the ImageNet dataset with the resolution of 64×64646464\times 6464 × 64. Our ACT significantly outperforms CT.

Method NFE (\downarrow) FID (\downarrow) Prec. (\uparrow) Rec. (\uparrow) BigGAN-deep [3] 1 4.06 0.79 0.48 ADM [8] 250 2.07 0.74 0.63 EDM [21] 79 2.44 0.71 0.67 DDPM [19] 250 11.0 0.67 0.58 DDIM [42] 50 13.7 0.65 0.56 DDIM [42] 10 18.3 0.60 0.49 CT 1 13.0 0.71 0.47 ACT 1 10.6 0.67 0.56

Algorithm 1 Adversarial Consistency Training
1:Input: dataset 𝒟𝒟\mathcal{D}caligraphic_D, initial consistency model parameter θgsubscript𝜃𝑔\theta_{g}italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, discriminator θdsubscript𝜃𝑑\theta_{d}italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, step schedule N()𝑁N(\cdot)italic_N ( ⋅ ), EMA decay rate schedule μ()𝜇\mu(\cdot)italic_μ ( ⋅ ), optimizer opt(,)opt\text{opt}(\cdot,\cdot)opt ( ⋅ , ⋅ ), discriminator D(,,θd)𝐷subscript𝜃𝑑D(\cdot,\cdot,\theta_{d})italic_D ( ⋅ , ⋅ , italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), adversarial rate schedule λ()𝜆\lambda(\cdot)italic_λ ( ⋅ ), gradient penalty weight wgpsubscript𝑤𝑔𝑝w_{gp}italic_w start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, gradient penalty interval Igpsubscript𝐼𝑔𝑝I_{gp}italic_I start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT.
2:𝜽g𝜽superscriptsubscript𝜽𝑔𝜽\boldsymbol{\theta}_{g}^{-}\leftarrow\boldsymbol{\theta}bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ← bold_italic_θ and k0𝑘0k\leftarrow 0italic_k ← 0
3:repeat
4:     Sample 𝒙𝒟similar-to𝒙𝒟\boldsymbol{x}\sim\mathcal{D}bold_italic_x ∼ caligraphic_D, and n𝒰[[1,N(k)]]similar-to𝑛𝒰delimited-[]1𝑁𝑘n\sim\mathcal{U}[\![1,N(k)]\!]italic_n ∼ caligraphic_U [ [ 1 , italic_N ( italic_k ) ] ]
5:     Sample 𝒛𝒩(0,𝑰)similar-to𝒛𝒩0𝑰\boldsymbol{z}\sim\mathcal{N}(0,\boldsymbol{I})bold_italic_z ∼ caligraphic_N ( 0 , bold_italic_I ) \triangleright Train Consistency Model
6:     CTsubscript𝐶𝑇absent\mathcal{L}_{CT}\leftarrowcaligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT ←
7:       d(𝒇(𝒙+tn+1𝒛,tn+1,𝜽g),𝒇(𝒙+tn𝒛,tn,𝜽g))𝑑𝒇𝒙subscript𝑡𝑛1𝒛subscript𝑡𝑛1subscript𝜽𝑔𝒇𝒙subscript𝑡𝑛𝒛subscript𝑡𝑛superscriptsubscript𝜽𝑔d(\boldsymbol{f}(\boldsymbol{x}+t_{n+1}\boldsymbol{z},t_{n+1},\boldsymbol{% \theta}_{g}),\boldsymbol{f}(\boldsymbol{x}+t_{n}\boldsymbol{z},t_{n},% \boldsymbol{\theta}_{g}^{-}))italic_d ( bold_italic_f ( bold_italic_x + italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , bold_italic_f ( bold_italic_x + italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) )
8:     Gsubscript𝐺absent\mathcal{L}_{G}\leftarrowcaligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ←
9:       log(1D(𝒇(𝒙+tn+1𝒛,tn+1,𝜽g),tn+1,𝜽d))1𝐷𝒇𝒙subscript𝑡𝑛1𝒛subscript𝑡𝑛1subscript𝜽𝑔subscript𝑡𝑛1subscript𝜽𝑑\log(1-D(\boldsymbol{f}(\boldsymbol{x}+t_{n+1}\boldsymbol{z},t_{n+1},% \boldsymbol{\theta}_{g}),t_{n+1},\boldsymbol{\theta}_{d}))roman_log ( 1 - italic_D ( bold_italic_f ( bold_italic_x + italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
10:     f(1λN(k)(n+1))CT+λN(k)(n+1)Gsubscript𝑓1subscript𝜆𝑁𝑘𝑛1subscript𝐶𝑇subscript𝜆𝑁𝑘𝑛1subscript𝐺\mathcal{L}_{f}\leftarrow(1-\lambda_{N(k)}(n+1))\mathcal{L}_{CT}+\lambda_{N(k)% }(n+1)\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← ( 1 - italic_λ start_POSTSUBSCRIPT italic_N ( italic_k ) end_POSTSUBSCRIPT ( italic_n + 1 ) ) caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_N ( italic_k ) end_POSTSUBSCRIPT ( italic_n + 1 ) caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT
11:     𝜽gopt(𝜽g,𝜽g(f))subscript𝜽𝑔optsubscript𝜽𝑔subscriptsubscript𝜽𝑔subscript𝑓\boldsymbol{\theta}_{g}\leftarrow\text{opt}(\boldsymbol{\theta}_{g},\nabla_{% \boldsymbol{\theta}_{g}}(\mathcal{L}_{f}))bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ← opt ( bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) )
12:     𝜽gstopgrad(μ(k)𝜽g+(1μ(k))𝜽g)superscriptsubscript𝜽𝑔stopgrad𝜇𝑘superscriptsubscript𝜽𝑔1𝜇𝑘subscript𝜽𝑔\boldsymbol{\theta}_{g}^{-}\leftarrow\text{stopgrad}(\mu(k)\boldsymbol{\theta}% _{g}^{-}+(1-\mu(k))\boldsymbol{\theta}_{g})bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ← stopgrad ( italic_μ ( italic_k ) bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + ( 1 - italic_μ ( italic_k ) ) bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT )
13:
14:     Sample 𝒙g𝒟similar-tosubscript𝒙𝑔𝒟\boldsymbol{x}_{g}\sim\mathcal{D}bold_italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ caligraphic_D, 𝒙r𝒟similar-tosubscript𝒙𝑟𝒟\boldsymbol{x}_{r}\sim\mathcal{D}bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∼ caligraphic_D, and n𝒰[[1,N(k)]]similar-to𝑛𝒰delimited-[]1𝑁𝑘n\sim\mathcal{U}[\![1,N(k)]\!]italic_n ∼ caligraphic_U [ [ 1 , italic_N ( italic_k ) ] ]
15:     Sample 𝒛𝒩(0,𝑰)similar-to𝒛𝒩0𝑰\boldsymbol{z}\sim\mathcal{N}(0,\boldsymbol{I})bold_italic_z ∼ caligraphic_N ( 0 , bold_italic_I )\triangleright Train Discriminator
16:     Dlog(D(𝒙r,tn+1,𝜽d))subscript𝐷𝐷subscript𝒙𝑟subscript𝑡𝑛1subscript𝜽𝑑\mathcal{L}_{D}\leftarrow-\log(D(\boldsymbol{x}_{r},t_{n+1},\boldsymbol{\theta% }_{d}))caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ← - roman_log ( italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
17:       log(1D(𝒇(𝒙g+tn+1𝒛,tn+1,𝜽d))-\log(1-D(\boldsymbol{f}(\boldsymbol{x}_{g}+t_{n+1}\boldsymbol{z},t_{n+1},% \boldsymbol{\theta}_{d}))- roman_log ( 1 - italic_D ( bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
18:     gpsubscript𝑔𝑝absent\mathcal{L}_{gp}\leftarrowcaligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT ←
19:       wgp𝒙rD(𝒙r,tn+1,𝜽d)2[kmodIgp=0]subscript𝑤𝑔𝑝superscriptnormsubscriptsubscript𝒙𝑟𝐷subscript𝒙𝑟subscript𝑡𝑛1subscript𝜽𝑑2delimited-[]modulo𝑘subscript𝐼𝑔𝑝0w_{gp}\|\nabla_{\boldsymbol{x}_{r}}D(\boldsymbol{x}_{r},t_{n+1},\boldsymbol{% \theta}_{d})\|^{2}[k\mod I_{gp}=0]italic_w start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT ∥ ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D ( bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_k roman_mod italic_I start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT = 0 ]
20:     dλN(k)(n+1)D+λN(k)(n+1)gpsubscript𝑑subscript𝜆𝑁𝑘𝑛1subscript𝐷subscript𝜆𝑁𝑘𝑛1subscript𝑔𝑝\mathcal{L}_{d}\leftarrow\lambda_{N(k)}(n+1)\mathcal{L}_{D}+\lambda_{N(k)}(n+1% )\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ← italic_λ start_POSTSUBSCRIPT italic_N ( italic_k ) end_POSTSUBSCRIPT ( italic_n + 1 ) caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_N ( italic_k ) end_POSTSUBSCRIPT ( italic_n + 1 ) caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT
21:     𝜽dopt(𝜽d,𝜽d(d))subscript𝜽𝑑optsubscript𝜽𝑑subscriptsubscript𝜽𝑑subscript𝑑\boldsymbol{\theta}_{d}\leftarrow\text{opt}(\boldsymbol{\theta}_{d},\nabla_{% \boldsymbol{\theta}_{d}}(\mathcal{L}_{d}))bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ← opt ( bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
22:     kk+1𝑘𝑘1k\leftarrow k+1italic_k ← italic_k + 1
23:until convergence

4 Experiments

In this section, we report experimental settings and results on CIFAR-10, ImageNet64 and LSUN Cat 256 datasets.

4.1 Generation Performance

In this section, we report the performance of our model on the CIFAR10, ImageNet 64×\times×64 datasets and LSUN Cat 256×\times×256 datasets. The results demonstrate a significant improvement of our method over the original approach. We exhibit the results on CIFAR10 in Tab. 3, on ImageNet 64×\times×64 in Tab. 2 and on LSUN Cat 256×\times×256 in Tab. 4, respectively. The FID on CIFAR10 improves from 8.7 to 6.0. It improves from 13 to 10.6 on ImageNet 64×\times×64, and it improves from 20.7 to 13.0 on LSUN Cat 256×\times×256.

Furthermore, we demonstrate the performance of the consistency training on different batch sizes, and the sizes of the models used by the proposed method and consistency training, in Tab. 1. As can be discerned from the data in the table, the batch size has a significant impact on consistency training. When the batch size is set to 256, the FID score escalates to 10.4 from 8.7. Besides, with a batch size of 128, the FID rises to 14.4. On the CIFAR10 dataset, the proposed method outperforms consistency training, achieving an FID of 6.0 with a batch size of 80, versus 8.7 with a batch size of 512. On ImageNet 64x64, it achieves an FID of 10.6 with a batch size of 320, compared to consistency training’s 13.0 with a batch size of 2048. Besides, on LSUN Cat 256 ×\times× 256, the proposed method attains an FID of 13.0 with a batch size of 320, better than consistency training’s 20.7 with a batch size of 2048. Fig. 1 shows the generated samples from model training on ImageNet 64×\times×64 and LSUN Cat 256×\times×256. Appendices E and E7 shows more generated samples from model training on LSUN Cat 256×\times×256. Appendix A provides explanations for all metrics. Appendix E shows zero-shot image inpainting.

4.2 Resource Consumption

We utilize the DDPM model architecture as our backbone. While DDPM’s performance isn’t as high as [8] and [44], it has fewer parameters and attention layers, enabling faster execution. Our model is significantly smaller than the 63.8M model used by consistency training on CIFAR10, with only 27.5M (41.6M with discriminator during training) parameters. On the ImageNet 64×\times×64 dataset, our model, with only 107M parameters (161M with discriminator during training), is smaller than the 282M model used by consistency training. The smaller model and batch size reduce resource consumption. In our experiments on CIFAR10, we utilize 1 NVIDIA GeForce RTX 3090, as opposed to the 8 NVIDIA A100 GPUs used for consistency training. For the ImageNet 64×\times×64 experiments, we employ 4 NVIDIA A100 GPUs, in contrast to the 64 A100 GPUs used for training in the consistency training setup. For the LSUN Cat 256×\times×256 experiments, we employ 8 NVIDIA A100 GPUs, in contrast to the 64 A100 GPUs used for training in the consistency training setup [45].

Table 3: Sample quality of ACT on the CIFAR10 dataset. We compare ACT with state-of-the-art GANs and (efficient) diffusion models. We show that ACT achieves the best FID and IS among all the one-step diffusion models.

Method NFE (\downarrow) FID (\downarrow) IS (\uparrow) BigGAN [3] 1 14.7 9.22 AutoGAN [14] 1 12.4 8.40 ViTGAN [28] 1 6.66 9.30 TransGAN [20] 1 9.26 9.05 StyleGAN2-ADA [46] 1 2.92 9.83 StyleGAN2-XL [41] 1 1.85 - Score SDE [44] 2000 2.20 9.89 DDPM [19] 1000 3.17 9.46 EDM [21] 36 2.04 9.84 DDIM [42] 50 4.67 - DDIM [42] 20 6.84 - DDIM [42] 10 8.23 - 1-Rectified Flow [30] 1 378 1.13 Glow [23] 1 48.9 3.92 Residual FLow [4] 1 46.4 - DenseFlow [16] 1 34.9 - DC-VAE [35] 1 17.9 8.20 CT [45] 1 8.70 8.49 ACT 1 6.4 8.93 ACT-Aug 1 6.0 9.15

4.3 Ablation Study

4.3.1 Impacts of λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT

When λN0subscript𝜆𝑁0\lambda_{N}\equiv 0italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≡ 0, this reduces to consistency training. Conversely, when λN1subscript𝜆𝑁1\lambda_{N}\equiv 1italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≡ 1, it becomes Generative Adversarial Networks (GANs). According to the analysis in Sec. 3.2, as λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT increases, adversarial consistency training gains the capacity to enhance model performance with smaller batch sizes, leveraging the discriminator. However, as discussed in Sec. 3.3, an overly large λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT can lead to an excessive consistency training loss, thereby causing a conflict between CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT. Furthermore, it has been noted in the literature that for GANs, high-dimensional inputs may detrimentally affect model performance [34]. Therefore, as λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT increases, the model performance exhibits a pattern of initial improvement followed by a decline. Firstly, we demonstrate the phenomenon of mode collapse when λN1subscript𝜆𝑁1\lambda_{N}\approx 1italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≈ 1 on CIFAR10. As illustrated in Fig. E6, the phenomenon of mode collapse is observed. It can be noted that, apart from the initial tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where the residual structure from Eq. 2 results in outputs with substantial input components, preventing mode collapse, the other tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT values all exhibit mode collapse.

For a score-based model as defined in Sec. 3.1.1, the learned sampling process is the reverse of the diffusion process pt(𝒙0|𝒙t)subscript𝑝𝑡conditionalsubscript𝒙0subscript𝒙𝑡p_{t}(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). However, the distribution qt(𝒙0|𝒙t)subscript𝑞𝑡conditionalsubscript𝒙0subscript𝒙𝑡q_{t}(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) learned via Eqs. 8 and 9 does not consider the forward process of the diffusion. We conduct further experiments where the form of the discriminator is changed to D(𝒙0,𝒙t,t,𝜽d)𝐷subscript𝒙0subscript𝒙𝑡𝑡subscript𝜽𝑑D(\boldsymbol{x}_{0},\boldsymbol{x}_{t},t,\boldsymbol{\theta}_{d})italic_D ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), and it can be proven Appendix C that the distribution learned by the generator is pt(𝒙0|𝒙t)subscript𝑝𝑡conditionalsubscript𝒙0subscript𝒙𝑡p_{t}(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). However, we also observe the phenomenon of mode collapse in our experiments. Fig. 2 illustrates the training collapse on ImageNet 64×\times×64 when λN0.3subscript𝜆𝑁0.3\lambda_{N}\equiv 0.3italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≡ 0.3. It can be observed that at around 150k training steps, the CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT becomes unstable and completely collapses around 170k. We have included the training curves for the proper λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT in the Fig. E5. It can be observed that at this point, CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and several other training losses remain stable. Essentially, a smaller wmidsubscript𝑤𝑚𝑖𝑑w_{mid}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT and a larger w𝑤witalic_w are preferable choices.

4.3.2 Connection between gradient penalty and training stability

In Sec. 3.3, we analyze the relationship between CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT, highlighting the importance of gradient stability. In this section, we conduct experiments to validate our previous analysis and demonstrate the rationality of the ACT-Aug method proposed in Sec. 3.4.

Fig. 2 illustrates the relationship among the values of the gradient penalty (gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT), consistency training loss (CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT), and FID. It can be observed that almost every instance of instability in CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT is accompanied by a relatively large gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT. Fig. 3 illustrates the relationship among these three on the CIFAR10 dataset. It can be seen that in the mid-stage of training, gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT begins to slowly increase, a process that is accompanied by a gradual increase in CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and FID. Therefore, we believe that gradient stability is crucial for adversarial consistency training. Based on this, we propose ACT-Aug (Sec. 3.4) on small datasets, using gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT as an indicator to adjust the probability of data augmentation, thereby stabilizing gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT around a certain value.

Table 4: Sample quality of ACT on the LSUN Cat dataset with the resolution of 256×\times×256. Our ACT significantly outperforms CT. Distillation techniques.

Method NFE (\downarrow) FID (\downarrow) Prec. (\uparrow) Rec. (\uparrow) DDPM [19] 1000 17.1 0.53 0.48 ADM [8] 1000 5.57 0.63 0.52 EDM [21] 79 6.69 0.70 0.43 PD [39] 1 18.3 0.60 0.49 CD [45] 1 11.0 0.65 0.36 CT [45] 1 20.7 0.56 0.23 ACT 1 13.0 0.69 0.30

Refer to caption
Figure 1: Generated samples on ImageNet 64×\times×64 (top two rows) and LSUN Cat 256×\times×256 (the third row).
Refer to caption
Figure 2: gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, and FID of ACT on ImageNet 64x64 (λN0.3subscript𝜆𝑁0.3\lambda_{N}\equiv 0.3italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≡ 0.3, an overly large λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT leads to training collapse. Additionally, drastic changes in gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT closely follow changes in CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT).
Refer to caption
Figure 3: gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, and FID of ACT on CIFAR10 (λN0.3subscript𝜆𝑁0.3\lambda_{N}\equiv 0.3italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≡ 0.3, an appropriate λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. In the later stages of training, without data augmentation, CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, and FID all show relatively large increases).

4.3.3 Discriminator

Activation Function Generally, GANs employ LeakyReLU as the activation function for the discriminator. This function is typically considered to provide better gradients for the generator. On the other hand, SiLU is the activation function chosen for DDPM, and it is generally regarded as a stronger activation function compared to LeakyReLU. Tab. 5 displays the FID scores of different activation functions on CIFAR10 at 50k and 150k training steps. Contrary to previous findings, we discovery that utilizing the SiLU function for the discriminator leads to faster convergence rates and improved final performance. A possible reason is that CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT provides an additional gradient direction, which mitigates the overfitting of the discriminator.

Different Backbone   Tab. 5 also displays the FID scores of different architecture on CIFAR10 at 50k and 150k training steps. In our investigation, we have evaluated the discriminators of StyleGAN2, ProjectedGAN and the downsampling part of DDPM (simply denoted as DDPM) as described in Appendix A. Due to the significant role of residual structures in designing GANs’ discriminators, we incorporate residual connections between different downsampling blocks in DDPM, denoted as DDPM-res. It can be observed that DDPM performs the best. Although DDPM-res exhibits a faster convergence rate during the early stages of training, its performance in the later stages is not as satisfactory as that of DDPM. Furthermore, we find that DDPM demonstrates superior training stability compared to DDPM-res. We also experiment with whether or not to feed t𝑡titalic_t into the discriminator, denoted as t𝑡titalic_t-emb. We find that feeding t𝑡titalic_t yields better results. This might be due to the fact that the optimal value of the discriminator varies with different tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, hence the necessity of t𝑡titalic_t-emb for better fitting.

Table 5: Ablation study of the discriminator.

Discriminator Activation t𝑡titalic_t-emb Fid (50k) Fid (150k) DDPM-res LeakyReLU False 18.7 10.6 DDPM-res LeakyReLU True 11.5 7.4 DDPM-res SiLU True 9.9 7.0 DDPM SiLU True 12.5 6.5 StyleGAN2 LeakyReLU True 16.7 9.5 ProjectedGAN LeakyReLU True 19.4 16.6

5 Conclusion

We proposed Adversarial Consistency Training (ACT), an improvement over consistency training. Through analyzing the consistency training loss, which is proven to be the upper bound of the W-distance between the sampling and target distributions, we introduced a method that directly employs Jensen-Shannon Divergence to minimize the distance between the generated and target distributions. This approach enables superior generation quality with less than 1/6161/61 / 6 of the original batch size and approximately 1/2121/21 / 2 of the original model parameters and training steps, thereby having smaller resource consumption. Our method retains the beneficial capabilities of consistency models, such as inpainting. Additionally, we proposed to use gradient penalty-based adaptive data augmentation to improve the performance on small datasets. The effectiveness has been validated on CIFAR10, ImageNet 64×\times×64 and LSUN Cat 256×\times×256 datasets, highlighting its potential for broader application in the field of image generation.

However, the interaction between CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and Gsubscript𝐺\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT can be further explored to improve our method. In addition to using JS-Divergence, other distances can also be used to reduce the distance between the generated and target distributions. In the future, we will focus on these two aspects to further boost the performance.

6 Acknowledgement

Fei Kong and Xiaoshuang Shi were supported by the National Natural Science Foundation of China (No. 62276052).

References

  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.
  • Barratt and Sharma [2018] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv: Machine Learning, abs/1801.01973, 2018.
  • Brock et al. [2019] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019.
  • Chen et al. [2019] Ricky T. Q. Chen, Jens Behrmann, David Duvenaud, and J&ouml;rn-Henrik Jacobsen. Residual flows for invertible generative modeling. In Conference on Neural Information Processing Systems, pages 9913–9923, 2019.
  • christian szegedy et al. [2016] christian szegedy, vincent vanhoucke, sergey ioffe, jonathon shlens, and zbigniew wojna. Rethinking the inception architecture for computer vision. Proceedings - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, abs/1512.00567(1):2818–2826, 2016.
  • Daras et al. [2023] Giannis Daras, Yuval Dagan, Alexandros G Dimakis, and Constantinos Daskalakis. Consistent diffusion models: Mitigating sampling drift by learning to be consistent. arXiv preprint arXiv:2302.09057, 2023.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In Advances in neural information processing systems, pages 8780–8794, 2021.
  • Dockhorn et al. [2022] Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with critically-damped langevin diffusion. In International Conference on Learning Representations, 2022.
  • Donahue et al. [2018] Chris Donahue, Julian McAuley, and Miller Puckette. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.
  • Donahue et al. [2017] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In International Conference on Learning Representations, 2017.
  • Duan et al. [2023] **hao Duan, Fei Kong, Shiqi Wang, Xiaoshuang Shi, and Kaidi Xu. Are diffusion models vulnerable to membership inference attacks? In International Conference on Machine Learning, 2023.
  • Dumoulin et al. [2017] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martín Arjovsky, Olivier Mastropietro, and Aaron C. Courville. Adversarially learned inference. In International Conference on Learning Representations, 2017.
  • Gong et al. [2019] Xinyu Gong, Shiyu Chang, Yifan Jiang, and Zhangyang Wang. Autogan: Neural architecture search for generative adversarial networks. In IEEE International Conference on Computer Vision, pages 3223–3233, 2019.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, 2014.
  • Grcić et al. [2021] Matej Grcić, Ivan Grubišić, and Siniša Šegvić. Densely connected normalizing flows. In Conference on Neural Information Processing Systems, pages 23968–23982, 2021.
  • Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martín Arjovsky, Vincent Dumoulin, and Aaron C. Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, 2017.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Conference on Neural Information Processing Systems, 2017.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851, 2020.
  • Jiang et al. [2021] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two pure transformers can make one strong gan, and that can scale up. In Advances in Neural Information Processing Systems, pages 14745–14758, 2021.
  • Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. In Conference on Neural Information Processing Systems, 2022.
  • Kim et al. [2023] Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279, 2023.
  • Kingma and Dhariwal [2018] Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Conference on Neural Information Processing Systems, 2018.
  • Kong et al. [2023] Fei Kong, **hao Duan, RuiPeng Ma, Hengtao Shen, Xiaofeng Zhu, Xiaoshuang Shi, and Kaidi Xu. An efficient membership inference attack for the diffusion model by proximal initialization. arXiv preprint arXiv:2305.18355, 2023.
  • Kong et al. [2021] Zhifeng Kong, Wei **, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images, 2009.
  • Kynk&auml;&auml;nniemi et al. [2019] Tuomas Kynk&auml;&auml;nniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. In Advances in neural information processing systems, pages 3929–3938, 2019.
  • Lee et al. [2022] Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, and Ce Liu. Vitgan: Training gans with vision transformers. In International Conference on Learning Representations, 2022.
  • Liu et al. [2022] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations, 2022.
  • Liu et al. [2023] Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In International Conference on Learning Representations, 2023.
  • Liu et al. [2024] Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177, 2024.
  • Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  • Miyato et al. [2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.
  • Padala et al. [2021] Manisha Padala, Debojit Das, and Sujit Gujar. Effect of input noise dimension in gans. In Neural Information Processing, pages 558–569. Springer, 2021.
  • Parmar et al. [2021] Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative autoencoder. Proceedings - IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 823–832, 2021.
  • Popov et al. [2021] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10674–10685, 2022.
  • Salimans and Ho [2022] Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022.
  • Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, 2016.
  • Sauer et al. [2022] Axel Sauer, Katja Schwarz, and Andreas Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In International Conference on Computer Graphics and Interactive Techniques, pages 1–10, 2022.
  • Song et al. [2021a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021a.
  • Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
  • Song et al. [2021b] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021b.
  • Song et al. [2023] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. Computing Research Repository, abs/2303.01469, 2023.
  • Tero et al. [2020a] Karras Tero, Aittala Miika, Hellsten Janne, Laine Samuli, Lehtinen Jaakko, and Aila Timo. Training generative adversarial networks with limited data. In Conference on Neural Information Processing Systems, pages 12104–12114, 2020a.
  • Tero et al. [2020b] Karras Tero, Laine Samuli, Aittala Miika, Hellsten Janne, Lehtinen Jaakko, and Aila Timo. Analyzing and improving the image quality of stylegan. In Computer Vision and Pattern Recognition, pages 8107–8116, 2020b.
  • Thanh-Tung et al. [2019] Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. Improving generalization and stability of generative adversarial networks. In International Conference on Learning Representations, 2019.
  • von Platen et al. [2022] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  • Xiao et al. [2022] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans. In International Conference on Learning Representations, 2022.
  • Yu et al. [2015] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
  • Yuan and Moghaddam [2020] Chenxi Yuan and Mohsen Moghaddam. Attribute-aware generative design with generative adversarial networks. IEEE Access, 8:190710–190721, 2020.
  • Yuan et al. [2023a] Chenxi Yuan, **hao Duan, Nicholas J Tustison, Kaidi Xu, Rebecca A Hubbard, and Kristin A Linn. Remind: Recovery of missing neuroimaging using diffusion models with application to alzheimer’s disease. medRxiv, pages 2023–08, 2023a.
  • Yuan et al. [2023b] Chenxi Yuan, Tucker Marion, and Mohsen Moghaddam. Dde-gan: Integrating a data-driven design evaluator into generative adversarial networks for desirable and diverse concept generation. Journal of Mechanical Design, 145(4):041407, 2023b.
  • Zhang et al. [2020] Han Zhang, Zizhao Zhang, Augustus Odena, and Honglak Lee. Consistency regularization for generative adversarial networks. In International Conference on Learning Representations, 2020.
  • Zhao et al. [2020] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data-efficient gan training. In Conference on Neural Information Processing Systems, pages 7559–7570, 2020.
  • Zheng et al. [2023] Hongkai Zheng, Weili Nie, Arash Vahdat, Kamyar Azizzadenesheli, and Anima Anandkumar. Fast sampling of diffusion models via operator learning. International Conference on Machine Learning, 2023.
\thetitle

Supplementary Material

Appendix A Architecture and Experiment settings

Architecture   For the consistency model architecture, we employ a structure similar to that of DDPM [19], with the exception of altering the corresponding embeddings to continuous time. We utilize the Python library diffusers [49]. In terms of the discriminator, we employ the downsampling structure in the DDPM, preserving it up to the mid-block. Subsequently, a linear layer is added to map it to \mathbb{R}blackboard_R. Additionally, the layers-per-block parameter is set to 150% of that in the consistency model, with all other parameters remaining the same. The parameters passed to the UNet2DModel are listed in Tab. A1. B=128. In the context of block type, ‘D’ represents DownBlock2D, ‘A’ stands for either AttnDownBlock2D or AttnUpBlock2D, and ‘U’ means UpBlock2D.

CIFAR10 ImageNet 64×\times×64 LSUN Cat 256×\times×256 layers_per_block 2 2 2 block_out_channels (1B,1B,2B,2B) (1B,2B,2B,4B,4B) (1B,1B,2B,2B,4B,4B) down_block_types DADD DDADD DDDDAD up_block_types UUAU UUAUU UAUUUU attention_head_dim 8 16 16

Table A1: The parameters passed to the UNet2DModel. For those not listed, the default settings from the diffusers library are used.

Experiment settings   In this section, we report the configuration of various hyperparameters within our experimental framework. Tab. A2 provides a summary of the experimental setup. Unless otherwise specified, the learning rate for both the consistency model and the discriminator is identical. The experiments conducted during the ablation study (Sec. 4.3), maintain consistency with the settings outlined in this table, with the exception of the parameters specifically varied for the ablation study. Additionally, when employing the ProjectedGAN as the discriminator, the learning rate of discriminator is set to 0.0020.0020.0020.002, with w𝑤witalic_w and wmidsubscript𝑤𝑚𝑖𝑑w_{mid}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT values at 0.10.10.10.1.

Metrics   The metrics used are IS, FID, Improved Precision and Improved Recall. The Inception Score (IS), introduced in [40], assesses a model’s ability to generate convincing images of distinct ImageNet classes and capture the overall class distribution. However, it has a limitation in that it doesn’t incentivize capturing the full distribution or the diversity within classes, leading to models with high IS even if they only memorize a small portion of the dataset, as noted in [2]. To address the need for a metric that better reflects diversity, the Fréchet Inception Distance (FID) was introduced in [18]. This metric is argued to align more closely with human judgment than IS, and it quantifies the similarity between two image distributions in the latent space of Inception-V3 as detailed in [5]. Additionally, [27] developed Improved Precision and Recall metrics that evaluate the fidelity of generated samples by determining the proportion that aligns with the data manifold (precision) and the diversity by the proportion of real samples that are represented in the generated sample manifold (recall).

Hyperparameter CIFAR10 ImageNet LSUN Cat 64×\times×64 256×\times×256 Discriminator DDPM DDPM DDPM Learning rate 1e-4 5e-5 1e-5 Batch size 80 320 320 μ0subscript𝜇0\mu_{0}italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 0.9 0.95 0.95 s0subscript𝑠0s_{0}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 2 2 2 s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 150 200 150 wmidsubscript𝑤𝑚𝑖𝑑w_{mid}italic_w start_POSTSUBSCRIPT italic_m italic_i italic_d end_POSTSUBSCRIPT 0.3 0.2 0.1 w𝑤witalic_w 0.3 0.6 0.6 Igpsubscript𝐼𝑔𝑝I_{gp}italic_I start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT 16 16 16 wgpsubscript𝑤𝑔𝑝w_{gp}italic_w start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT 10 10 10 τ𝜏\tauitalic_τ 0.55 - - μpsubscript𝜇𝑝\mu_{p}italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT 0.93 - - prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT 0.05 - - Training iterations 300k 400k 165k Mixed-Precision No Yes Yes Number of GPUs 1×\times×RTX 3090 4×\times×A100 8×\times×A100

Table A2: Summary of the experimental setup.

Appendix B Details of the Proof for Theorem 3.1

Details for Eq. 6:

𝔼𝒙tk,𝒚tkγ[𝒇(𝒙tk,tk,𝜽)𝒈(𝒚tk,tk)]subscript𝔼similar-tosubscript𝒙subscript𝑡𝑘subscript𝒚subscript𝑡𝑘superscript𝛾delimited-[]norm𝒇subscript𝒙subscript𝑡𝑘subscript𝑡𝑘𝜽𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘\displaystyle\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim% \gamma^{*}}[\|\boldsymbol{f}(\boldsymbol{x}_{t_{k}},t_{k},\boldsymbol{\theta})% -\boldsymbol{g}(\boldsymbol{y}_{t_{k}},t_{k})\|]blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∥ ]
=\displaystyle== 𝔼𝒙tk,𝒚tkγ[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)\displaystyle\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim% \gamma^{*}}[\|\boldsymbol{g}(\boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(% \boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{\theta})blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ )
+𝒇(𝒚tk,tk,𝜽)𝒇(𝒙tk,tk,𝜽)]\displaystyle\qquad\qquad\qquad+\boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},% \boldsymbol{\theta})-\boldsymbol{f}(\boldsymbol{x}_{t_{k}},t_{k},\boldsymbol{% \theta})\|]+ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
\displaystyle\leq 𝔼𝒙tk,𝒚tkγ[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)\displaystyle\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim% \gamma^{*}}[\|\boldsymbol{g}(\boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(% \boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{\theta})\|blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥
+𝒇(𝒚tk,tk,𝜽)𝒇(𝒙tk,tk,𝜽)]\displaystyle\qquad\qquad\qquad+\|\boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},% \boldsymbol{\theta})-\boldsymbol{f}(\boldsymbol{x}_{t_{k}},t_{k},\boldsymbol{% \theta})\|]+ ∥ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
(i)𝑖\displaystyle\overset{({i})}{\leq}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG ≤ end_ARG 𝔼𝒙tk,𝒚tkγ[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)\displaystyle\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim% \gamma^{*}}[\|\boldsymbol{g}(\boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(% \boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{\theta})\|blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥
+L𝒚tk𝒙tk]\displaystyle\qquad\qquad\qquad+L\|\boldsymbol{y}_{t_{k}}-\boldsymbol{x}_{t_{k% }}\|]+ italic_L ∥ bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ]
=\displaystyle== 𝔼𝒙tk,𝒚tkγ[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)]subscript𝔼similar-tosubscript𝒙subscript𝑡𝑘subscript𝒚subscript𝑡𝑘superscript𝛾delimited-[]norm𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽\displaystyle\mathbb{E}_{\boldsymbol{x}_{t_{k}},\boldsymbol{y}_{t_{k}}\sim% \gamma^{*}}[\|\boldsymbol{g}(\boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(% \boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{\theta})\|]blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
+L𝔼𝒙tk,𝒚tkγ[𝒚tk𝒙tk]𝐿subscript𝔼similar-tosubscript𝒙subscript𝑡𝑘subscript𝒚subscript𝑡𝑘superscript𝛾delimited-[]normsubscript𝒚subscript𝑡𝑘subscript𝒙subscript𝑡𝑘\displaystyle\qquad\qquad\qquad+L\mathbb{E}_{\boldsymbol{x}_{t_{k}},% \boldsymbol{y}_{t_{k}}\sim\gamma^{*}}[\|\boldsymbol{y}_{t_{k}}-\boldsymbol{x}_% {t_{k}}\|]+ italic_L blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_γ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ]
=\displaystyle{=}= 𝔼𝒚tkptk[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)]+L𝒲[qtk,ptk].subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽𝐿𝒲subscript𝑞subscript𝑡𝑘subscript𝑝subscript𝑡𝑘\displaystyle\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{g% }(\boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},% \boldsymbol{\theta})\|]+L\mathcal{W}[q_{t_{k}},p_{t_{k}}].blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ] + italic_L caligraphic_W [ italic_q start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] .

Here, (i) holds because 𝒇𝒇\boldsymbol{f}bold_italic_f satisfies the Lipschitz condition.

Details for LABEL:E2:

𝔼𝒚tkptk[𝒈(𝒚tk,tk)𝒇(𝒚tk,tk,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽\displaystyle\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{g% }(\boldsymbol{y}_{t_{k}},t_{k})-\boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},% \boldsymbol{\theta})\|]blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
=(i)𝑖\displaystyle\overset{({i})}{=}start_OVERACCENT ( italic_i ) end_OVERACCENT start_ARG = end_ARG 𝔼𝒚tkptk[𝒈(𝒚tk1,tk1)𝒇(𝒚tk1,tk1,𝜽)\displaystyle\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{g% }(\boldsymbol{y}_{t_{k-1}},t_{k-1})-\boldsymbol{f}(\boldsymbol{y}_{t_{k-1}},t_% {k-1},\boldsymbol{\theta})blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ )
+𝒇(𝒚tk1,tk1,𝜽)𝒇(𝒚tk1ϕ,tk1,𝜽)𝒇subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝜽𝒇superscriptsubscript𝒚subscript𝑡𝑘1italic-ϕsubscript𝑡𝑘1𝜽\displaystyle+\boldsymbol{f}(\boldsymbol{y}_{t_{k-1}},t_{k-1},\boldsymbol{% \theta})-\boldsymbol{f}(\boldsymbol{y}_{t_{k-1}}^{\phi},t_{k-1},\boldsymbol{% \theta})+ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ )
+𝒇(𝒚tk1ϕ,tk1,𝜽)𝒇(𝒚tk,tk,𝜽)]\displaystyle\quad+\boldsymbol{f}(\boldsymbol{y}_{t_{k-1}}^{\phi},t_{k-1},% \boldsymbol{\theta})-\boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{% \theta})\|]+ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
\displaystyle\leq 𝔼𝒚tkptk[𝒈(𝒚tk1,tk1)𝒇(𝒚tk1,tk1,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒈subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝒇subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝜽\displaystyle\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{g% }(\boldsymbol{y}_{t_{k-1}},t_{k-1})-\boldsymbol{f}(\boldsymbol{y}_{t_{k-1}},t_% {k-1},\boldsymbol{\theta})\|]blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
+𝔼𝒚tkptk[𝒇(𝒚tk1,tk1,𝜽)𝒇(𝒚tk1ϕ,tk1,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒇subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝜽𝒇superscriptsubscript𝒚subscript𝑡𝑘1italic-ϕsubscript𝑡𝑘1𝜽\displaystyle\quad+\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|% \boldsymbol{f}(\boldsymbol{y}_{t_{k-1}},t_{k-1},\boldsymbol{\theta})-% \boldsymbol{f}(\boldsymbol{y}_{t_{k-1}}^{\phi},t_{k-1},\boldsymbol{\theta})\|]+ blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
+𝔼𝒚tkptk[𝒇(𝒚tk1ϕ,tk1,𝜽)𝒇(𝒚tk,tk,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒇superscriptsubscript𝒚subscript𝑡𝑘1italic-ϕsubscript𝑡𝑘1𝜽𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽\displaystyle\quad+\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|% \boldsymbol{f}(\boldsymbol{y}_{t_{k-1}}^{\phi},t_{k-1},\boldsymbol{\theta})-% \boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{\theta})\|]+ blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
(ii)𝑖𝑖\displaystyle\overset{({ii})}{\leq}start_OVERACCENT ( italic_i italic_i ) end_OVERACCENT start_ARG ≤ end_ARG 𝔼𝒚tkptk[𝒈(𝒚tk1,tk1)𝒇(𝒚tk1,tk1,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒈subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝒇subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝜽\displaystyle\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|\boldsymbol{g% }(\boldsymbol{y}_{t_{k-1}},t_{k-1})-\boldsymbol{f}(\boldsymbol{y}_{t_{k-1}},t_% {k-1},\boldsymbol{\theta})\|]blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
+L𝒚tk1𝒚tk1ϕ𝐿normsubscript𝒚subscript𝑡𝑘1superscriptsubscript𝒚subscript𝑡𝑘1italic-ϕ\displaystyle\quad+L\|\boldsymbol{y}_{t_{k-1}}-\boldsymbol{y}_{t_{k-1}}^{\phi}\|+ italic_L ∥ bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ∥
+𝔼𝒚tkptk[𝒇(𝒚tk1ϕ,tk1,𝜽)𝒇(𝒚tk,tk,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒇superscriptsubscript𝒚subscript𝑡𝑘1italic-ϕsubscript𝑡𝑘1𝜽𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽\displaystyle\quad+\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|% \boldsymbol{f}(\boldsymbol{y}_{t_{k-1}}^{\phi},t_{k-1},\boldsymbol{\theta})-% \boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{\theta})\|]+ blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
=(iii)𝑖𝑖𝑖\displaystyle\overset{({iii})}{=}start_OVERACCENT ( italic_i italic_i italic_i ) end_OVERACCENT start_ARG = end_ARG 𝔼𝒚tk1ptk1[𝒈(𝒚tk1,tk1)𝒇(𝒚tk1,tk1,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘1subscript𝑝subscript𝑡𝑘1delimited-[]norm𝒈subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝒇subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1𝜽\displaystyle\mathbb{E}_{\boldsymbol{y}_{t_{k-1}}\sim p_{t_{k-1}}}[\|% \boldsymbol{g}(\boldsymbol{y}_{t_{k-1}},t_{k-1})-\boldsymbol{f}(\boldsymbol{y}% _{t_{k-1}},t_{k-1},\boldsymbol{\theta})\|]blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]
+L(ttktk1)O(ttktk1)𝐿subscript𝑡subscript𝑡𝑘subscript𝑡𝑘1𝑂subscript𝑡subscript𝑡𝑘subscript𝑡𝑘1\displaystyle\quad+L(t_{t_{k}}-t_{k-1})O(t_{t_{k}}-t_{k-1})+ italic_L ( italic_t start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) italic_O ( italic_t start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT )
+𝔼𝒚tkptk[𝒇(𝒚tk1ϕ,tk1,𝜽)𝒇(𝒚tk,tk,𝜽)]subscript𝔼similar-tosubscript𝒚subscript𝑡𝑘subscript𝑝subscript𝑡𝑘delimited-[]norm𝒇superscriptsubscript𝒚subscript𝑡𝑘1italic-ϕsubscript𝑡𝑘1𝜽𝒇subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝜽\displaystyle\quad+\mathbb{E}_{\boldsymbol{y}_{t_{k}}\sim p_{t_{k}}}[\|% \boldsymbol{f}(\boldsymbol{y}_{t_{k-1}}^{\phi},t_{k-1},\boldsymbol{\theta})-% \boldsymbol{f}(\boldsymbol{y}_{t_{k}},t_{k},\boldsymbol{\theta})\|]+ blackboard_E start_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , bold_italic_θ ) - bold_italic_f ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_θ ) ∥ ]

Here, (i) holds because 𝒈𝒈\boldsymbol{g}bold_italic_g is a consistency function, with 𝒈(𝒚tk,tk)=𝒈(𝒚tk1,tk1)𝒈subscript𝒚subscript𝑡𝑘subscript𝑡𝑘𝒈subscript𝒚subscript𝑡𝑘1subscript𝑡𝑘1\boldsymbol{g}(\boldsymbol{y}_{t_{k}},t_{k})=\boldsymbol{g}(\boldsymbol{y}_{t_% {k-1}},t_{k-1})bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = bold_italic_g ( bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ). (ii) holds because 𝒇𝒇\boldsymbol{f}bold_italic_f satisfies the Lipschitz condition. (iii) holds because ΦΦ\Phiroman_Φ is an Euler solver, hence 𝒚tk1𝒚tk1ϕnormsubscript𝒚subscript𝑡𝑘1superscriptsubscript𝒚subscript𝑡𝑘1italic-ϕ\|\boldsymbol{y}_{t_{k-1}}-\boldsymbol{y}_{t_{k-1}}^{\phi}\|∥ bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ∥ does not exceed the truncation error O((tntn1)2)𝑂superscriptsubscript𝑡𝑛subscript𝑡𝑛12O((t_{n}-t_{n-1})^{2})italic_O ( ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_t start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

Appendix C Conditional Discriminator

Theorem C.1.

Given a generator G(𝐳,𝐱t,t)𝐺𝐳subscript𝐱𝑡𝑡G(\boldsymbol{z},\boldsymbol{x}_{t},t)italic_G ( bold_italic_z , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) and a discriminator D(𝐱0,𝐱t,t)𝐷subscript𝐱0subscript𝐱𝑡𝑡D(\boldsymbol{x}_{0},\boldsymbol{x}_{t},t)italic_D ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ). The distribution of optimal solution of G(,𝐱t,t)𝐺subscript𝐱𝑡𝑡G(\cdot,\boldsymbol{x}_{t},t)italic_G ( ⋅ , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) for the problem Eq. 11 is pg(|𝐱t)=p(|𝐱t)p_{g}(\cdot|\boldsymbol{x}_{t})=p(\cdot|\boldsymbol{x}_{t})italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( ⋅ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p ( ⋅ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where pg(|𝐱t)p_{g}(\cdot|\boldsymbol{x}_{t})italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( ⋅ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the sample distribution of G(𝐳,𝐱t,t),zp𝐳(𝐳|𝐱t)similar-to𝐺𝐳subscript𝐱𝑡𝑡𝑧subscript𝑝𝐳conditional𝐳subscript𝐱𝑡G(\boldsymbol{z},\boldsymbol{x}_{t},t),z\sim p_{\boldsymbol{z}}(\boldsymbol{z}% |\boldsymbol{x}_{t})italic_G ( bold_italic_z , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_z ∼ italic_p start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). p𝐳(|𝐱t)p_{\boldsymbol{z}}(\cdot|\boldsymbol{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT ( ⋅ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is a normal distribution. 𝐱tptsimilar-tosubscript𝐱𝑡subscript𝑝𝑡\boldsymbol{x}_{t}\sim p_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and 𝐱0p0similar-tosubscript𝐱0subscript𝑝0\boldsymbol{x}_{0}\sim p_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the marginal distribution of a diffusion process.

minGmaxDV(G,D)=𝔼𝒙0,𝒙tp(𝒙0,𝒙t)[logD(𝒙0,𝒙t)]+𝔼𝒛p𝒛(𝒛|𝒙t),𝒙tpt[log(1D(G(𝒛,𝒙t,t),𝒙t))]subscript𝐺subscript𝐷𝑉𝐺𝐷subscript𝔼similar-tosubscript𝒙0subscript𝒙𝑡𝑝subscript𝒙0subscript𝒙𝑡delimited-[]𝐷subscript𝒙0subscript𝒙𝑡subscript𝔼formulae-sequencesimilar-to𝒛subscript𝑝𝒛conditional𝒛subscript𝒙𝑡similar-tosubscript𝒙𝑡subscript𝑝𝑡delimited-[]1𝐷𝐺𝒛subscript𝒙𝑡𝑡subscript𝒙𝑡\begin{split}\min_{G}&\max_{D}V(G,D)=\mathbb{E}_{\boldsymbol{x}_{0},% \boldsymbol{x}_{t}\sim p(\boldsymbol{x}_{0},\boldsymbol{x}_{t})}[\log D(% \boldsymbol{x}_{0},\boldsymbol{x}_{t})]\\ &+\mathbb{E}_{\boldsymbol{z}\sim p_{\boldsymbol{z}}(\boldsymbol{z}|\boldsymbol% {x}_{t}),\boldsymbol{x}_{t}\sim p_{t}}[\log(1-D(G(\boldsymbol{z},\boldsymbol{x% }_{t},t),\boldsymbol{x}_{t}))]\end{split}start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT end_CELL start_CELL roman_max start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_V ( italic_G , italic_D ) = blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_D ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E start_POSTSUBSCRIPT bold_italic_z ∼ italic_p start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D ( italic_G ( bold_italic_z , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ] end_CELL end_ROW (11)
Proof.

By expressing Eq. 11 in integral form, we have the following equation:

𝒙0,𝒙tp(𝒙0,𝒙t)log(D(𝒙0,𝒙t))𝑑𝒙0𝑑𝒙tsubscriptdouble-integralsubscript𝒙0subscript𝒙𝑡𝑝subscript𝒙0subscript𝒙𝑡𝐷subscript𝒙0subscript𝒙𝑡differential-dsubscript𝒙0differential-dsubscript𝒙𝑡\displaystyle\iint_{\boldsymbol{x}_{0},\boldsymbol{x}_{t}}p(\boldsymbol{x}_{0}% ,\boldsymbol{x}_{t})\log(D(\boldsymbol{x}_{0},\boldsymbol{x}_{t}))d\boldsymbol% {x}_{0}d\boldsymbol{x}_{t}∬ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( italic_D ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_d bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
+𝒛,𝒙tp𝒛(𝒛,𝒙t)log(1D(G(𝒛,𝒙t),𝒙t))𝑑𝒛𝑑𝒙tsubscriptdouble-integral𝒛subscript𝒙𝑡subscript𝑝𝒛𝒛subscript𝒙𝑡1𝐷𝐺𝒛subscript𝒙𝑡subscript𝒙𝑡differential-d𝒛differential-dsubscript𝒙𝑡\displaystyle+\iint_{\boldsymbol{z},\boldsymbol{x}_{t}}p_{\boldsymbol{z}}(% \boldsymbol{z},\boldsymbol{x}_{t})\log(1-D(G(\boldsymbol{z},\boldsymbol{x}_{t}% ),\boldsymbol{x}_{t}))d\boldsymbol{z}d\boldsymbol{x}_{t}+ ∬ start_POSTSUBSCRIPT bold_italic_z , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT ( bold_italic_z , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( 1 - italic_D ( italic_G ( bold_italic_z , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_d bold_italic_z italic_d bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
=\displaystyle== 𝒙tpt(𝒙t)(𝒙0p(𝒙0|𝒙t)log(D(𝒙0,𝒙t))d𝒙0\displaystyle\int_{\boldsymbol{x}_{t}}p_{t}(\boldsymbol{x}_{t})\left(\int_{% \boldsymbol{x}_{0}}p(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})\log(D(\boldsymbol{% x}_{0},\boldsymbol{x}_{t}))d\boldsymbol{x}_{0}\right.∫ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( ∫ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( italic_D ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_d bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
+𝒛p𝒛(𝒛|𝒙t)log(1D(G(𝒛,𝒙t),𝒙t))d𝒛)d𝒙t\displaystyle+\left.\int_{\boldsymbol{z}}p_{\boldsymbol{z}}(\boldsymbol{z}|% \boldsymbol{x}_{t})\log(1-D(G(\boldsymbol{z},\boldsymbol{x}_{t}),\boldsymbol{x% }_{t}))d\boldsymbol{z}\right)d\boldsymbol{x}_{t}+ ∫ start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( 1 - italic_D ( italic_G ( bold_italic_z , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_d bold_italic_z ) italic_d bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
=\displaystyle== 𝔼𝒙tpt[𝒙0p(𝒙0|𝒙t)log(D(𝒙0,𝒙t))\displaystyle\mathbb{E}_{\boldsymbol{x}_{t}\sim p_{t}}\left[\int_{\boldsymbol{% x}_{0}}p(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})\log(D(\boldsymbol{x}_{0},% \boldsymbol{x}_{t}))\right.blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∫ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( italic_D ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
+pg(𝒙0|𝒙t)log(1D(𝒙0,𝒙t))d𝒙0]\displaystyle+\left.p_{g}(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})\log(1-D(% \boldsymbol{x}_{0},\boldsymbol{x}_{t}))d\boldsymbol{x}_{0}\right]+ italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_log ( 1 - italic_D ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_d bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]

The optimal D𝐷Ditalic_D is:

DG=p(𝒙0|𝒙t)p(𝒙0|𝒙t)+pg(𝒙0|𝒙t)superscriptsubscript𝐷𝐺𝑝conditionalsubscript𝒙0subscript𝒙𝑡𝑝conditionalsubscript𝒙0subscript𝒙𝑡subscript𝑝𝑔conditionalsubscript𝒙0subscript𝒙𝑡D_{G}^{*}=\frac{p(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})}{p(\boldsymbol{x}_{0}% |\boldsymbol{x}_{t})+p_{g}(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})}italic_D start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG

Substituting Dsuperscript𝐷D^{*}italic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT into V𝑉Vitalic_V, we obtain the following equation:

maxDV(G,D)subscript𝐷𝑉𝐺𝐷\displaystyle\max_{D}V(G,D)roman_max start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_V ( italic_G , italic_D )
=\displaystyle== 𝔼𝒙tpt[𝔼𝒙0p(𝒙0|𝒙t)[logp(𝒙0|𝒙t)p(𝒙0|𝒙t)+pg(𝒙0|𝒙t)]\displaystyle\mathbb{E}_{\boldsymbol{x}_{t}\sim p_{t}}\left[\mathbb{E}_{% \boldsymbol{x}_{0}\sim p(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})}\left[\log% \frac{p(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})}{p(\boldsymbol{x}_{0}|% \boldsymbol{x}_{t})+p_{g}(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})}\right]\right.blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ]
+𝔼𝒙0pg(𝒙0|𝒙t)log[pg(𝒙0|𝒙t)p(𝒙0|𝒙t)+pg(𝒙0|𝒙t)]]\displaystyle+\left.\mathbb{E}_{\boldsymbol{x}_{0}\sim p_{g}(\boldsymbol{x}_{0% }|\boldsymbol{x}_{t})}\log\left[\frac{p_{g}(\boldsymbol{x}_{0}|\boldsymbol{x}_% {t})}{p(\boldsymbol{x}_{0}|\boldsymbol{x}_{t})+p_{g}(\boldsymbol{x}_{0}|% \boldsymbol{x}_{t})}\right]\right]+ blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_log [ divide start_ARG italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG ] ]
=\displaystyle== 𝔼𝒙tpt[log4+2JSD(pt(|𝒙t)||pg(|𝒙t))]\displaystyle\mathbb{E}_{\boldsymbol{x}_{t}\sim p_{t}}\left[-\log 4+2\textit{% JSD}(p_{t}(\cdot|\boldsymbol{x}_{t})||p_{g}(\cdot|\boldsymbol{x}_{t}))\right]blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ - roman_log 4 + 2 JSD ( italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( ⋅ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( ⋅ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ]

In the aforementioned equation, JSD represents the Jensen-Shannon divergence. The equation holds true only when pg(|𝒙t)=p(|𝒙t)p_{g}(\cdot|\boldsymbol{x}_{t})=p(\cdot|\boldsymbol{x}_{t})italic_p start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( ⋅ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p ( ⋅ | bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This concludes the proof. ∎

Appendix D ACT-Aug

In this section, we will provide the details of ACT-Aug. The differences from ACT are highlighted in red. The algorithm is listed in Algorithm 2.

Algorithm 2 Adversarial Consistency Training with Augmentation
1:Input: dataset 𝒟𝒟\mathcal{D}caligraphic_D, initial consistency model parameter θgsubscript𝜃𝑔\theta_{g}italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, discriminator θdsubscript𝜃𝑑\theta_{d}italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, step schedule N()𝑁N(\cdot)italic_N ( ⋅ ), EMA decay rate schedule μ()𝜇\mu(\cdot)italic_μ ( ⋅ ), optimizer opt(,)opt\text{opt}(\cdot,\cdot)opt ( ⋅ , ⋅ ), discriminator with augmentation Daug(,,,θd)subscript𝐷𝑎𝑢𝑔subscript𝜃𝑑D_{aug}(\cdot,\cdot,\cdot,\theta_{d})italic_D start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ( ⋅ , ⋅ , ⋅ , italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), adversarial rate schedule λ()𝜆\lambda(\cdot)italic_λ ( ⋅ ), gradient penalty weight wgpsubscript𝑤𝑔𝑝w_{gp}italic_w start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, gradient penalty interval Igpsubscript𝐼𝑔𝑝I_{gp}italic_I start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, gradient penalty threshold τ𝜏\tauitalic_τ, augmentation probability update rate prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT
2:𝜽g𝜽superscriptsubscript𝜽𝑔𝜽\boldsymbol{\theta}_{g}^{-}\leftarrow\boldsymbol{\theta}bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ← bold_italic_θ, k0𝑘0k\leftarrow 0italic_k ← 0, paug0subscript𝑝𝑎𝑢𝑔0p_{aug}\leftarrow 0italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ← 0 and gp=τsuperscriptsubscript𝑔𝑝𝜏\mathcal{L}_{gp}^{-}=\taucaligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_τ
3:repeat
4:       Sample 𝒙𝒟similar-to𝒙𝒟\boldsymbol{x}\sim\mathcal{D}bold_italic_x ∼ caligraphic_D, and n𝒰[[1,N(k)]]similar-to𝑛𝒰delimited-[]1𝑁𝑘n\sim\mathcal{U}[\![1,N(k)]\!]italic_n ∼ caligraphic_U [ [ 1 , italic_N ( italic_k ) ] ]
5:       Sample 𝒛𝒩(0,𝑰)similar-to𝒛𝒩0𝑰\boldsymbol{z}\sim\mathcal{N}(0,\boldsymbol{I})bold_italic_z ∼ caligraphic_N ( 0 , bold_italic_I ) \triangleright Train Consistency Model
6:       CTsubscript𝐶𝑇absent\mathcal{L}_{CT}\leftarrowcaligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT ←
7:       d(𝒇(𝒙+tn+1𝒛,tn+1,𝜽g),𝒇(𝒙+tn𝒛,tn,𝜽g))𝑑𝒇𝒙subscript𝑡𝑛1𝒛subscript𝑡𝑛1subscript𝜽𝑔𝒇𝒙subscript𝑡𝑛𝒛subscript𝑡𝑛superscriptsubscript𝜽𝑔d(\boldsymbol{f}(\boldsymbol{x}+t_{n+1}\boldsymbol{z},t_{n+1},\boldsymbol{% \theta}_{g}),\boldsymbol{f}(\boldsymbol{x}+t_{n}\boldsymbol{z},t_{n},% \boldsymbol{\theta}_{g}^{-}))italic_d ( bold_italic_f ( bold_italic_x + italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , bold_italic_f ( bold_italic_x + italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) )
8:       Glog(1\mathcal{L}_{G}\leftarrow\log(1-caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ← roman_log ( 1 -
9:       Daug(𝒇(𝒙+tn+1𝒛,tn+1,paug,𝜽g),tn+1,𝜽d)){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}D_{aug}}(% \boldsymbol{f}(\boldsymbol{x}+t_{n+1}\boldsymbol{z},t_{n+1},{\color[rgb]{1,0,0% }\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}p_{aug}},\boldsymbol{\theta}_{% g}),t_{n+1},\boldsymbol{\theta}_{d}))italic_D start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ( bold_italic_f ( bold_italic_x + italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
10:       f(1λN(k)(n+1))CT+λN(k)(n+1)Gsubscript𝑓1subscript𝜆𝑁𝑘𝑛1subscript𝐶𝑇subscript𝜆𝑁𝑘𝑛1subscript𝐺\mathcal{L}_{f}\leftarrow(1-\lambda_{N(k)}(n+1))\mathcal{L}_{CT}+\lambda_{N(k)% }(n+1)\mathcal{L}_{G}caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ← ( 1 - italic_λ start_POSTSUBSCRIPT italic_N ( italic_k ) end_POSTSUBSCRIPT ( italic_n + 1 ) ) caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_N ( italic_k ) end_POSTSUBSCRIPT ( italic_n + 1 ) caligraphic_L start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT
11:       𝜽gopt(𝜽g,𝜽g(f))subscript𝜽𝑔optsubscript𝜽𝑔subscriptsubscript𝜽𝑔subscript𝑓\boldsymbol{\theta}_{g}\leftarrow\text{opt}(\boldsymbol{\theta}_{g},\nabla_{% \boldsymbol{\theta}_{g}}(\mathcal{L}_{f}))bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ← opt ( bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) )
12:       𝜽gstopgrad(μ(k)𝜽g+(1μ(k))𝜽g)superscriptsubscript𝜽𝑔stopgrad𝜇𝑘superscriptsubscript𝜽𝑔1𝜇𝑘subscript𝜽𝑔\boldsymbol{\theta}_{g}^{-}\leftarrow\text{stopgrad}(\mu(k)\boldsymbol{\theta}% _{g}^{-}+(1-\mu(k))\boldsymbol{\theta}_{g})bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ← stopgrad ( italic_μ ( italic_k ) bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + ( 1 - italic_μ ( italic_k ) ) bold_italic_θ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT )
13:
14:       Sample 𝒙g𝒟similar-tosubscript𝒙𝑔𝒟\boldsymbol{x}_{g}\sim\mathcal{D}bold_italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ caligraphic_D, 𝒙r𝒟similar-tosubscript𝒙𝑟𝒟\boldsymbol{x}_{r}\sim\mathcal{D}bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∼ caligraphic_D, and n𝒰[[1,N(k)]]similar-to𝑛𝒰delimited-[]1𝑁𝑘n\sim\mathcal{U}[\![1,N(k)]\!]italic_n ∼ caligraphic_U [ [ 1 , italic_N ( italic_k ) ] ]
15:       Sample 𝒛𝒩(0,𝑰)similar-to𝒛𝒩0𝑰\boldsymbol{z}\sim\mathcal{N}(0,\boldsymbol{I})bold_italic_z ∼ caligraphic_N ( 0 , bold_italic_I )\triangleright Train Discriminator
16:       Dlog(Daug(𝒙r,tn+1,paug,𝜽d))subscript𝐷subscript𝐷𝑎𝑢𝑔subscript𝒙𝑟subscript𝑡𝑛1subscript𝑝𝑎𝑢𝑔subscript𝜽𝑑\mathcal{L}_{D}\leftarrow-\log({\color[rgb]{1,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,0}D_{aug}}(\boldsymbol{x}_{r},t_{n+1},{\color[rgb]{% 1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}p_{aug}},\boldsymbol{% \theta}_{d}))caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ← - roman_log ( italic_D start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
17:       log(1Daug(𝒇(𝒙g+tn+1𝒛,tn+1,paug,𝜽d))-\log(1-{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}D_{% aug}}(\boldsymbol{f}(\boldsymbol{x}_{g}+t_{n+1}\boldsymbol{z},t_{n+1},{\color[% rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}p_{aug}},\boldsymbol% {\theta}_{d}))- roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ( bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT bold_italic_z , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
18:       gpwgp[kmodIgp=0]\mathcal{L}_{gp}\leftarrow w_{gp}[k\mod I_{gp}=0]*caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT ← italic_w start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT [ italic_k roman_mod italic_I start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT = 0 ] ∗
19:       𝒙rDaug(𝒙r,tn+1,paug,𝜽d)2superscriptnormsubscriptsubscript𝒙𝑟subscript𝐷𝑎𝑢𝑔subscript𝒙𝑟subscript𝑡𝑛1subscript𝑝𝑎𝑢𝑔subscript𝜽𝑑2\|\nabla_{\boldsymbol{x}_{r}}{\color[rgb]{1,0,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,0,0}D_{aug}}(\boldsymbol{x}_{r},t_{n+1},{\color[rgb]{% 1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}p_{aug}},\boldsymbol{% \theta}_{d})\|^{2}∥ ∇ start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
20:       dλN(k)(n+1)D+λN(k)(n+1)gpsubscript𝑑subscript𝜆𝑁𝑘𝑛1subscript𝐷subscript𝜆𝑁𝑘𝑛1subscript𝑔𝑝\mathcal{L}_{d}\leftarrow\lambda_{N(k)}(n+1)\mathcal{L}_{D}+\lambda_{N(k)}(n+1% )\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ← italic_λ start_POSTSUBSCRIPT italic_N ( italic_k ) end_POSTSUBSCRIPT ( italic_n + 1 ) caligraphic_L start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_N ( italic_k ) end_POSTSUBSCRIPT ( italic_n + 1 ) caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT
21:       𝜽dopt(𝜽d,𝜽d(d))subscript𝜽𝑑optsubscript𝜽𝑑subscriptsubscript𝜽𝑑subscript𝑑\boldsymbol{\theta}_{d}\leftarrow\text{opt}(\boldsymbol{\theta}_{d},\nabla_{% \boldsymbol{\theta}_{d}}(\mathcal{L}_{d}))bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ← opt ( bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
22:       if kmodIgp=0modulo𝑘subscript𝐼𝑔𝑝0k\mod I_{gp}=0italic_k roman_mod italic_I start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT = 0 then
23:             paugsubscript𝑝𝑎𝑢𝑔absentp_{aug}\leftarrowitalic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT ←
24:             Clip[0,1](paug+2([gp>=τ]0.5)pr)subscriptClip01subscript𝑝𝑎𝑢𝑔2delimited-[]superscriptsubscript𝑔𝑝𝜏0.5subscript𝑝𝑟\text{Clip}_{[0,1]}(p_{aug}+2([\mathcal{L}_{gp}^{-}>=\tau]-0.5)p_{r})Clip start_POSTSUBSCRIPT [ 0 , 1 ] end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT italic_a italic_u italic_g end_POSTSUBSCRIPT + 2 ( [ caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT > = italic_τ ] - 0.5 ) italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT )
25:             gp=μpgp+(1μp)gpsuperscriptsubscript𝑔𝑝subscript𝜇𝑝superscriptsubscript𝑔𝑝1subscript𝜇𝑝subscript𝑔𝑝\mathcal{L}_{gp}^{-}=\mu_{p}\mathcal{L}_{gp}^{-}+(1-\mu_{p})\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT + ( 1 - italic_μ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT
26:       end if
27:       kk+1𝑘𝑘1k\leftarrow k+1italic_k ← italic_k + 1
28:until convergence

Appendix E More Experiment Results

Zero-shot Image Inpainting   An important capability of consistency models is zero-shot image inpainting. This depends on the properties of the diffusion process and CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT. Given that we introduce a discriminator during the training process, does this impact the properties of consistency models? We demonstrate the results of inpainting in Fig. E3. We employ the algorithm consistent with [45]. It can be seen that ACT still retains the capabilities of consistency models.

We further display the sampling results from the conditional trajectory {𝒙0+tk𝒛},𝒙0p0,𝒛𝒩(0,𝑰)formulae-sequencesimilar-tosubscript𝒙0subscript𝑡𝑘𝒛subscript𝒙0subscript𝑝0similar-to𝒛𝒩0𝑰\{\boldsymbol{x}_{0}+t_{k}\boldsymbol{z}\},\boldsymbol{x}_{0}\sim p_{0},% \boldsymbol{z}\sim\mathcal{N}({0,\boldsymbol{I}}){ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z } , bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_z ∼ caligraphic_N ( 0 , bold_italic_I ) on ImageNet 64×\times×64. k𝑘kitalic_k ranges from 00 to N𝑁Nitalic_N, with 10101010 equidistant points. It can be observed that the sampling results of tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and tk1subscript𝑡𝑘1t_{k-1}italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT exhibit significant similarity, which further substantiates that ACT does not disrupt the properties of CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and consistency models.

Refer to caption
Figure E1: gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, and FID of ACT on ImageNet 64x64 (wmid=0.2,w=0.6subscript𝑤𝑚𝑖𝑑0.2𝑤0.6w_{mid=0.2},w=0.6italic_w start_POSTSUBSCRIPT italic_m italic_i italic_d = 0.2 end_POSTSUBSCRIPT , italic_w = 0.6, a suitable parameter set. Under these parameters, all three metrics demonstrate stability).
Refer to caption
Figure E2: gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, and FID of ACT-Aug on CIFAR10 (λN0.3subscript𝜆𝑁0.3\lambda_{N}\equiv 0.3italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≡ 0.3, a suitable parameter set. Under these parameters, all three metrics demonstrate stability).
Refer to caption
Refer to caption
Refer to caption
Figure E3: The results of zero-shot inpainting. First Row: original images; Second Row: masked images; Bottom Row: inpainted images.

Generation Visualization on Conditional Trajectory   In this section, we demonstrate samples generated from the conditional trajectory {𝒙0+tk𝒛}subscript𝒙0subscript𝑡𝑘𝒛\{\boldsymbol{x}_{0}+t_{k}\boldsymbol{z}\}{ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z } on ImageNet 64×\times×64, further illustrating that our method preserves the properties of consistency training. Fig. E4 shows the conditional trajectory {𝒙0+tk𝒛}subscript𝒙0subscript𝑡𝑘𝒛\{\boldsymbol{x}_{0}+t_{k}\boldsymbol{z}\}{ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z }, while Fig. E5 displays the samples generated from the conditional trajectory {𝒙0+tk𝒛}subscript𝒙0subscript𝑡𝑘𝒛\{\boldsymbol{x}_{0}+t_{k}\boldsymbol{z}\}{ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z }. It can be observed that there is a high degree of similarity between adjacent t𝑡titalic_t values, further validating that our method retains the properties of CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT.

Refer to caption
Figure E4: The conditional trajectory {𝒙0+tk𝒛}subscript𝒙0subscript𝑡𝑘𝒛\{\boldsymbol{x}_{0}+t_{k}\boldsymbol{z}\}{ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z } (ImageNet 64×\times×64).
Refer to caption
Figure E5: Generated from the conditional trajectory {𝒙0+tk𝒛}subscript𝒙0subscript𝑡𝑘𝒛\{\boldsymbol{x}_{0}+t_{k}\boldsymbol{z}\}{ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z } (ImageNet 64×\times×64).

Examples of proper λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT   In this section, we present the stability of CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, and the FID score of the appropriate selection of λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. As depicted in Fig. E1, it is observed that all three metrics exhibit stability during training. Specifically for gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, there is an initial decreasing trend followed by an increase; however, the variation remains within a range of 0.10.10.10.1 until the end of training.

Fig. E2 illustrates the stability of gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT, CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT, and the FID score for ACT-Aug under the appropriate selection of λNsubscript𝜆𝑁\lambda_{N}italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. It is observed that all three metrics exhibit stability. Furthermore, when compared with ACT on CIFAR10 as shown in Fig. 3, gpsubscript𝑔𝑝\mathcal{L}_{gp}caligraphic_L start_POSTSUBSCRIPT italic_g italic_p end_POSTSUBSCRIPT is stabilized around the set τ=0.55𝜏0.55\tau=0.55italic_τ = 0.55, and both CTsubscript𝐶𝑇\mathcal{L}_{CT}caligraphic_L start_POSTSUBSCRIPT italic_C italic_T end_POSTSUBSCRIPT and the FID score continue to show a decreasing trend. This validates the effectiveness of the augmentation.

More samples.   Fig. E6 shows failed generations on CIFAR10 dataset. Appendices E and E7 shows more samples on LSUN Cat 256×\times×256 dataset.

Refer to caption
(a) Generated from the conditional trajectory {𝒙0+tk𝒛}subscript𝒙0subscript𝑡𝑘𝒛\{\boldsymbol{x}_{0}+t_{k}\boldsymbol{z}\}{ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_z }.
Refer to caption
(b) Sampling from T𝒛𝑇𝒛T\boldsymbol{z}italic_T bold_italic_z.
Figure E6: Failed generations. Mode collapse when λN1subscript𝜆𝑁1\lambda_{N}\approx 1italic_λ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≈ 1. Experiments are conducted on the CIFAR10 dataset.
[Uncaptioned image]
Refer to caption
Figure E7: Generated samples (ACT Trained on LSUN Cat 256×\times×256).