Posterior Distillation Sampling

Juil Koo     Chanho Park     Minhyuk Sung
KAIST
{63days,charlieppark,mhsung}@kaist.ac.kr
Abstract

We introduce Posterior Distillation Sampling (PDS), a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods, which leverage the powerful 2D prior of diffusion models to handle various parametric images, have mainly focused on generation. Unlike generation, editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space, we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target, enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source’s identity. We demonstrate that this optimization resembles running a generative process with the target attribute, but aligning this process with the trajectory of the source’s generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces. Our project page is at https://posterior-distillation-sampling.github.io.

[Uncaptioned image]
Figure 1: Parametric image editing results obtained by Posterior Distillation Sampling (PDS). PDS is an optimization tailored for editing across diverse parameter spaces. It preserves the original details of the source content while aligning them with the input texts.
Refer to caption
Figure 2: A comparison of 3D scene editing between PDS and other baselines. Given input 3D scenes on the left, PDS, marked by green boxes on the rightmost side, successfully performs complex editing, such as geometric changes and adding objects, according to the input texts. On the other hand, the baselines either fail to change the input 3D scenes or produce results that greatly deviate from the input scenes, losing their identity.

1 Introduction

Diffusion models [13, 48, 50, 47, 49] have recently led to rapid development in text-conditioned generation and editing across diverse domains, including 2D images [22, 51, 15, 54, 11], 3D objects [18, 34, 23, 21], and audio [14, 7, 57]. Among them, in particular, 2D image diffusion models [39, 41, 43, 5, 28] have demonstrated their powerful generative prior aided by Internet-scale image and text datasets [45, 44, 3]. Nonetheless, this rich 2D generative prior has been confined to pixel space, limiting their broader applicability. A pioneer work overcoming this limitation, DreamFusion [36], has introduced Score Distillation Sampling (SDS). It leverages the generative prior of text-to-image diffusion models to synthesize 3D scenes represented by Neural Radiance Fields (NeRFs) [30] from texts. Beyond NeRF representations [25, 53, 46, 59, 38, 4, 52], SDS has been widely applied to various parameter spaces, where images are not represented by pixels but specific parameterizations, such as texture [27, 1], material [56] and Scalable Vector Graphics (SVGs) [17, 55, 16].

While SDS [36] has achieved great advances in generating parametric images, editing is also an essential element for full freedom in handling visual content. Editing differs from generation in that it requires considerations of both the target text and the original source content, thereby emphasizing two key aspects: (1) alignment with the target text prompt and (2) preservation of the source content’s identity. To extend SDS, which lacks the latter aspect, Hertz et al.  [10] propose Delta Denoising Score (DDS). DDS reduces the noisy gradients inherent in SDS, leading to better-maintaining background details and sharper editing outputs. However, the optimization function of DDS still lacks an explicit term for identity preservation.

To address the absence of preserving the source’s identity in SDS [36] and DDS [10], we turn our attention to a recent 2D image editing method [54, 15] based on diffusion models, known as stochastic diffusion inversion. Their primary objective is to compute the stochastic latent of an input image within the generative process of diffusion models. Once the stochastic latent of a source image is computed, the source image can be edited by running a generative process with new conditions, such as new target text prompts, while feeding the source’s stochastic latent into the process. Feeding the source’s stochastic latent into the target image’s generative process ensures that the target image maintains the structural details of the source while moving towards the direction of the target text. Thus, this editing process reflects the aforementioned two key aspects of editing.

To extend the editing capabilities of the stochastic diffusion inversion method from pixel space to parameter space, we reformulate this method into an optimization form named Posterior Distillation Sampling (PDS). Unlike SDS [36] and DDS [10], which match two noise variables, PDS aims to match the stochastic latents of the source and the optimized target. We demonstrate that our optimization process resembles aligning forward process posteriors of the source and the target, ensuring that the target’s generative process trajectory does not significantly deviate from that of the source.

When parametric images come from NeRF [30], Haque et al.  [9] have recently introduced a promising text-driven NeRF editing method called Iterative Dataset Update (Iterative DU). To edit 3D scenes, it performs an editing process in 2D space bypassing direct edit in 3D space. Thus, when a text prompt induces large variations in 2D space across different views, it has difficulty producing the right edit in 3D space. On the other hand, our method directly updates NeRF in 3D space, thus gradually transforming a 3D scene into its edited version in a view-consistent manner even in the case where text prompts induce large variations, such as large geometric changes or the addition of objects to unspecified regions.

Our extensive editing experiment results, including NeRF editing (Section 6.1) and SVG editing (Section 6.2), demonstrate the versatility of our method for parametric image editing. In NeRF editing, we are the first to produce large geometric changes or to add objects to arbitrary regions without specifying local regions to be edited. Figure 2 shows these examples. Qualitative and quantitative comparisons of SVG editing with other optimization methods, namely SDS [36] and DDS [10], have demonstrated that PDS produces only the necessary changes to source SVGs, effectively aligning them with the target prompts.

2 Related Work

2.1 Score Distillation Sampling

Following the remarkable success of diffusion models in text-to-image generation, there have been attempts to leverage the 2D prior of diffusion models for various other types of generative tasks. In these tasks, images are represented through rendering processes with specific parameters, including Neural Radiance Fields [36, 52, 17], texture [1, 27], material [56] and Scalable Vector Graphics (SVGs) [17, 55, 16]. The primary method employed in these tasks is Score Distillation Sampling (SDS). SDS is an optimization approach that updates the rendering parameter towards the image distribution of diffusion models by enforcing the noise prediction on noisy rendered images to match sampled noise. Concurrently, Wang et al.  [52] also have introduced Score Jacobian Chaining which converges toward a similar algorithm as SDS but from a different mathematical derivation. Wang et al.  [53] have proposed Variational Score Distillation (VSD) to address over-saturation, over-smoothing, and low-diversity problems in SDS [36]. Instead of updating a single data point, VSD updates multiple data points to align an optimized distribution with the diffusion model’s image distribution. Zhu and Zhuang [59] use more accurate predictions of diffusion models via iterative denoising at every SDS update step.

When it comes to editing, Hertz et al.  [10] propose Delta Denoising Score (DDS), an adaptation of SDS for editing tasks. It reduces the noisy gradient directions in SDS to better maintain the input image details. Nonetheless, its optimization function lacks an explicit term to preserve the identity of the input image, thus often producing outputs that significantly deviate from the input images. To alleviate this issue, we propose Posterior Distillation Sampling, a novel optimization approach that incorporates a term dedicated to preserving the identity of the source in its optimization function.

2.2 Text-Driven NeRF Editing

Haque et al.  [9] have proposed a text-driven NeRF editing method, known as Iterative Dataset Update (Iterative DU). It iteratively replaces reference images, initially used for NeRF [30] reconstruction, with edited images using Instruct-Pix2Pix [2]. By applying a reconstruction loss with these iteratively updated images to an input NeRF [30] scene, the scene is gradually transformed to its edited counterpart. Mirzae et al.  [31] improve Instruct-NeRF2NeRF [9] by computing local regions to be edited. However, this iterative image replacement method suffers from edits that involve large variations across different views, such as complex geometric changes or adding objects to unspecified regions. Thus, they have mainly focused on appearance changes.

Instead of the Iterative DU method, several recent works [35, 24, 60] directly apply SDS [36] or DDS [10] to NeRF editing. However, these optimizations do not fully consider the preservation of the source’s identity and are thus prone to producing outputs that substantially diverge from the input scenes. In contrast, our novel optimization inherently guarantees the preservation of the source’s identity, facilitating involved NeRF editing while maintaining the identity of the original scene.

2.3 Diffusion Inversion

Diffusion inversion computes the latent representation of an input image encoded in diffusion models. This allows for real image editing by finding the corresponding latent that can fairly reconstruct the given image. The computed latent is then decoded into a new image through a generative process. Using the deterministic generative process of Denoising Diffusion Implicit Models (DDIM) [48], one can approximately run the ODE of the generative process in reverse [48, 6], referred to as DDIM inversion. Several recent works have improved DDIM inversion by adjusting text features [33, 8, 32], introducing new cross-attention maps during a generative process [11] or alternatively coupling intermediate latents from two inversion trajectories [51]. Meanwhile, an alternative approach, known as DDPM inversion [15, 54], employs the stochastic generative process of Denoising Diffusion Probabilistic Models (DDPM) [13]. They focus on capturing the structural details of an input image encoded in its stochastic latent. We extend the editing capabilities of this DDPM inversion method to parameter space by reformulating the method into an optimization form.

3 Preliminaries

We first discuss existing optimization-based approaches to handle parametric images, then introduce our novel parametric image editing method in Section 4.

3.1 Score Distillation Sampling (SDS) [36]

Score Distillation Sampling (SDS) [36] is proposed to generate parametric images by leveraging the 2D prior of pre-trained text-to-image diffusion models. Given an input data 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a text prompt y𝑦yitalic_y, the training objective function of diffusion models is to predict injected noise ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ using a noise predictor ϵϕsubscriptbold-italic-ϵitalic-ϕ\boldsymbol{\epsilon}_{\phi}bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT:

(𝐱0)=𝔼t𝒰(0,1),ϵt[w(t)ϵϕ(𝐱t,y,t)ϵt22],subscript𝐱0subscript𝔼similar-to𝑡𝒰01subscriptbold-italic-ϵ𝑡delimited-[]𝑤𝑡superscriptsubscriptnormsubscriptbold-italic-ϵitalic-ϕsubscript𝐱𝑡𝑦𝑡subscriptbold-italic-ϵ𝑡22\displaystyle\mathcal{L}(\mathbf{x}_{0})=\mathbb{E}_{t\sim\mathcal{U}(0,1),% \boldsymbol{\epsilon}_{t}}\left[w(t)\|\boldsymbol{\epsilon}_{\phi}(\mathbf{x}_% {t},y,t)-\boldsymbol{\epsilon}_{t}\|_{2}^{2}\right],caligraphic_L ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_t ∼ caligraphic_U ( 0 , 1 ) , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ( italic_t ) ∥ bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) - bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (1)

where w(t)𝑤𝑡w(t)italic_w ( italic_t ) is a weighting function and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT results from the forward process of diffusion models:

𝐱t:=α¯t𝐱0+1α¯tϵt,ϵt𝒩(𝟎,𝐈)formulae-sequenceassignsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡subscriptbold-italic-ϵ𝑡similar-tosubscriptbold-italic-ϵ𝑡𝒩0𝐈\displaystyle\mathbf{x}_{t}:=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-% \bar{\alpha}_{t}}\boldsymbol{\epsilon}_{t},\quad\boldsymbol{\epsilon}_{t}\sim% \mathcal{N}(\mathbf{0},\mathbf{I})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ) (2)

with variance schedule variables α¯t:=s=1tαsassignsubscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\bar{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. When the input data 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is generated by a differentiable image generator 𝐱0=g(θ)subscript𝐱0𝑔𝜃\mathbf{x}_{0}=g(\theta)bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_g ( italic_θ ), parameterized by θ𝜃\thetaitalic_θ, SDS updates θ𝜃\thetaitalic_θ by backpropagating the gradient of Equation 1 while omitting the U-Net jacobian term ϵϕ𝐱tsubscriptbold-italic-ϵitalic-ϕsubscript𝐱𝑡\frac{\partial\boldsymbol{\epsilon}_{\phi}}{\partial\mathbf{x}_{t}}divide start_ARG ∂ bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT end_ARG start_ARG ∂ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG for computation efficiency:

θSDS(𝐱0=g(θ))=𝔼t,ϵt[w(t)(ϵϕ(𝐱t,y,t)ϵt)𝐱0θ],subscript𝜃subscriptSDSsubscript𝐱0𝑔𝜃subscript𝔼𝑡subscriptbold-italic-ϵ𝑡delimited-[]𝑤𝑡subscriptbold-italic-ϵitalic-ϕsubscript𝐱𝑡𝑦𝑡subscriptbold-italic-ϵ𝑡subscript𝐱0𝜃\displaystyle\nabla_{\theta}\mathcal{L}_{\text{SDS}}(\mathbf{x}_{0}=g(\theta))% =\mathbb{E}_{t,\boldsymbol{\epsilon}_{t}}\left[w(t)(\boldsymbol{\epsilon}_{% \phi}(\mathbf{x}_{t},y,t)-\boldsymbol{\epsilon}_{t})\frac{\partial\mathbf{x}_{% 0}}{\partial\theta}\right],∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT SDS end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_g ( italic_θ ) ) = blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ( italic_t ) ( bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) - bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) divide start_ARG ∂ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG ] , (3)

where we denote a noise prediction of diffusion models with classifier-free guidance [12] by ϵϕsubscriptbold-italic-ϵitalic-ϕ\boldsymbol{\epsilon}_{\phi}bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT for simplicity. Through this optimization process, SDS is capable of generating a parametric image which conforms to the input text prompt y𝑦yitalic_y.

3.2 Delta Denoising Score (DDS) [10]

Even though SDS has been widely used for various parametric images, its optimization is designed for generation, thus it does not reflect one of the key aspects of editing: preserving the source identity.

To extend SDS to editing, Hertz et al.  [10] have proposed Delta Denoising Score (DDS). Given source data 𝐱srcsuperscript𝐱src\mathbf{x}^{\text{src}}bold_x start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT and its corresponding text prompt ysrcsuperscript𝑦srcy^{\text{src}}italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT, the goal of DDS is to synthesize new target data 𝐱tgtsuperscript𝐱tgt\mathbf{x}^{\text{tgt}}bold_x start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT that is aligned with a target text prompt ytgtsuperscript𝑦tgty^{\text{tgt}}italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT. In the SDS formula 3, DDS replaces randomly sampled noise ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ with a noise prediction given a source data-text pair ϵϕ(𝐱tsrc,ysrc,t)subscriptbold-italic-ϵitalic-ϕsubscriptsuperscript𝐱src𝑡superscript𝑦src𝑡\boldsymbol{\epsilon}_{\phi}(\mathbf{x}^{\text{src}}_{t},y^{\text{src}},t)bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_t ):

θDDS=subscript𝜃subscriptDDSabsent\displaystyle\nabla_{\theta}\mathcal{L}_{\text{DDS}}=∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT DDS end_POSTSUBSCRIPT =
𝔼t,ϵt[w(t)(ϵϕ(𝐱ttgt,ytgt,t)ϵϕ(𝐱tsrc,ysrc,t))𝐱0tgtθ],subscript𝔼𝑡subscriptbold-italic-ϵ𝑡delimited-[]𝑤𝑡subscriptbold-italic-ϵitalic-ϕsubscriptsuperscript𝐱tgt𝑡superscript𝑦tgt𝑡subscriptbold-italic-ϵitalic-ϕsubscriptsuperscript𝐱src𝑡superscript𝑦src𝑡subscriptsuperscript𝐱tgt0𝜃\displaystyle\mathbb{E}_{t,\boldsymbol{\epsilon}_{t}}\left[w(t)\left(% \boldsymbol{\epsilon}_{\phi}(\mathbf{x}^{\text{tgt}}_{t},y^{\text{tgt}},t)-% \boldsymbol{\epsilon}_{\phi}(\mathbf{x}^{\text{src}}_{t},y^{\text{src}},t)% \right)\frac{\partial\mathbf{x}^{\text{tgt}}_{0}}{\partial\theta}\right],blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ( italic_t ) ( bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_t ) - bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_t ) ) divide start_ARG ∂ bold_x start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG ] , (4)

where the same noise ϵtsubscriptbold-italic-ϵ𝑡\boldsymbol{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is shared for 𝐱tsrcsuperscriptsubscript𝐱𝑡src\mathbf{x}_{t}^{\text{src}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT and 𝐱ttgtsuperscriptsubscript𝐱𝑡tgt\mathbf{x}_{t}^{\text{tgt}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT:

ϵtsubscriptbold-italic-ϵ𝑡\displaystyle\boldsymbol{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 𝒩(𝟎,𝐈),similar-toabsent𝒩0𝐈\displaystyle\sim\mathcal{N}(\mathbf{0},\mathbf{I}),∼ caligraphic_N ( bold_0 , bold_I ) ,
𝐱tsrcsuperscriptsubscript𝐱𝑡src\displaystyle\mathbf{x}_{t}^{\text{src}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT =α¯t𝐱0src+1α¯tϵt,absentsubscript¯𝛼𝑡superscriptsubscript𝐱0src1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝑡\displaystyle=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}^{\text{src}}+\sqrt{1-\bar{% \alpha}_{t}}\boldsymbol{\epsilon}_{t},= square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
𝐱ttgtsuperscriptsubscript𝐱𝑡tgt\displaystyle\mathbf{x}_{t}^{\text{tgt}}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT =α¯t𝐱0tgt+1α¯tϵt.absentsubscript¯𝛼𝑡superscriptsubscript𝐱0tgt1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝑡\displaystyle=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}^{\text{tgt}}+\sqrt{1-\bar{% \alpha}_{t}}\boldsymbol{\epsilon}_{t}.= square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (5)

While DDS extends SDS for editing tasks, it lacks an explicit term in its optimization to preserve the identity of the source. As a result, DDS is still prone to produce editing results that significantly deviate from the source.

3.3 Stochastic Latent in Generative Process

To achieve both conformity to the text and preservation of the source’s identity, we turn our attention to the rich information encoded in the stochastic generative process of DDPM [13]. When βt:=1αtassignsubscript𝛽𝑡1subscript𝛼𝑡\beta_{t}:=1-\alpha_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are small, it is well-known that the posterior of the forward process also follows a Gaussian distribution according to a property of Gaussians. The forward process posteriors are represented as:

q(𝐱t1|𝐱t,𝐱0)=𝒩(𝝁(𝐱t,𝐱0),σt𝐈),𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝒩𝝁subscript𝐱𝑡subscript𝐱0subscript𝜎𝑡𝐈\displaystyle q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})=\mathcal{N}(% \boldsymbol{\mu}(\mathbf{x}_{t},\mathbf{x}_{0}),\sigma_{t}\mathbf{I}),italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_μ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , (6)

where σt:=1α¯t11α¯tβtassignsubscript𝜎𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛽𝑡\sigma_{t}:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the posterior mean 𝝁𝝁\boldsymbol{\mu}bold_italic_μ is a linear combination of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT: 𝝁(𝐱t,𝐱0):=γt𝐱0+δt𝐱tassign𝝁subscript𝐱𝑡subscript𝐱0subscript𝛾𝑡subscript𝐱0subscript𝛿𝑡subscript𝐱𝑡\boldsymbol{\mu}(\mathbf{x}_{t},\mathbf{x}_{0}):=\gamma_{t}\mathbf{x}_{0}+% \delta_{t}\mathbf{x}_{t}bold_italic_μ ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with γt:=α¯t1(1αt)1α¯tassignsubscript𝛾𝑡subscript¯𝛼𝑡11subscript𝛼𝑡1subscript¯𝛼𝑡\gamma_{t}:=\frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_{t})}{1-\bar{\alpha}_{t}}italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := divide start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and δt:=αt(1α¯t1)1α¯tassignsubscript𝛿𝑡subscript𝛼𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡\delta_{t}:=\frac{\sqrt{\alpha}_{t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}}italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := divide start_ARG square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG.

Since 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is unknown during a generative process, we approximate 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a one-step denoised estimate as follows:

𝐱~0(𝐱t,y;ϵϕ):=1α¯t(𝐱t1α¯tϵϕ(𝐱t,y,t)).assignsubscript~𝐱0subscript𝐱𝑡𝑦subscriptbold-italic-ϵitalic-ϕ1subscript¯𝛼𝑡subscript𝐱𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵitalic-ϕsubscript𝐱𝑡𝑦𝑡\displaystyle\tilde{\mathbf{x}}_{0}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{% \phi}):=\frac{1}{\sqrt{\bar{\alpha}_{t}}}(\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_% {t}}\boldsymbol{\epsilon}_{\phi}(\mathbf{x}_{t},y,t)).over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) := divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) ) . (7)

Consequently, one step of the generative process is represented as follows:

𝐱t1=𝝁ϕ(𝐱t,y;ϵϕ)+σt𝐳t,𝐳t𝒩(𝟎,𝐈),formulae-sequencesubscript𝐱𝑡1subscript𝝁italic-ϕsubscript𝐱𝑡𝑦subscriptbold-italic-ϵitalic-ϕsubscript𝜎𝑡subscript𝐳𝑡similar-tosubscript𝐳𝑡𝒩0𝐈\displaystyle\mathbf{x}_{t-1}=\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;% \boldsymbol{\epsilon}_{\phi})+\sigma_{t}\mathbf{z}_{t},\quad\mathbf{z}_{t}\sim% \mathcal{N}(\mathbf{0},\mathbf{I}),bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_I ) , (8)

where 𝝁ϕ(𝐱t,y;ϵϕ)=γt𝐱~0(𝐱t,y;ϵϕ)+δt𝐱tsubscript𝝁italic-ϕsubscript𝐱𝑡𝑦subscriptbold-italic-ϵitalic-ϕsubscript𝛾𝑡subscript~𝐱0subscript𝐱𝑡𝑦subscriptbold-italic-ϵitalic-ϕsubscript𝛿𝑡subscript𝐱𝑡\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{\phi})=\gamma_% {t}\tilde{\mathbf{x}}_{0}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{\phi})+% \delta_{t}\mathbf{x}_{t}bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) = italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) + italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Using Equation 8, one can compute stochastic latent 𝐳~tsubscript~𝐳𝑡\tilde{\mathbf{z}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT that captures the structural details of 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This involves computing 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT via the forward process and then rearranging Equation 8 as follows:

𝐳~t(𝐱0,y;ϵϕ)=𝐱t1𝝁ϕ(𝐱t,y;ϵϕ)σt.subscript~𝐳𝑡subscript𝐱0𝑦subscriptbold-italic-ϵitalic-ϕsubscript𝐱𝑡1subscript𝝁italic-ϕsubscript𝐱𝑡𝑦subscriptbold-italic-ϵitalic-ϕsubscript𝜎𝑡\displaystyle\tilde{\mathbf{z}}_{t}(\mathbf{x}_{0},y;\boldsymbol{\epsilon}_{% \phi})=\frac{\mathbf{x}_{t-1}-\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;% \boldsymbol{\epsilon}_{\phi})}{\sigma_{t}}.over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) = divide start_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (9)

Several recent works [54, 15], known as DDPM inversion, have utilized the stochastic latent for image editing tasks. To edit an image using 𝐳~tsubscript~𝐳𝑡\tilde{\mathbf{z}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, they first pre-compute 𝐳~tsubscript~𝐳𝑡\tilde{\mathbf{z}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the source image across all t𝑡titalic_t in the generative process. They then run a new generative process with a new target prompt while incorporating the pre-computed 𝐳~tsubscript~𝐳𝑡\tilde{\mathbf{z}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the source into the process instead of randomly sampled noise 𝐳tsubscript𝐳𝑡\mathbf{z}_{t}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Although these works [54, 15] have utilized the rich information encoded in 𝐳~tsubscript~𝐳𝑡\tilde{\mathbf{z}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for an editing purpose, their applications have been limited within 2D-pixel space due to reliance on the generative process. In our work, we broaden the application of the stochastic latent to parameter space by reformulating the method as an optimization form, enabling parametric image editing.

4 Posterior Distillation Sampling

Here, we introduce Posterior Distillation Sampling (PDS), a novel optimization function designed for parametric image editing.

Our objective is to synthesize 𝐱0tgtsuperscriptsubscript𝐱0tgt\mathbf{x}_{0}^{\text{tgt}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT that is aligned with ytgtsuperscript𝑦tgty^{\text{tgt}}italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT while it retains the identity of 𝐱0srcsuperscriptsubscript𝐱0src\mathbf{x}_{0}^{\text{src}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT. To achieve this, we employ the stochastic latent 𝐳~tsubscript~𝐳𝑡\tilde{\mathbf{z}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in our optimization. For simplicity, we denote the stochastic latents of the source and the target as follows:

𝐳~tsrcsuperscriptsubscript~𝐳𝑡src\displaystyle\tilde{\mathbf{z}}_{t}^{\text{src}}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT :=𝐳~t(𝐱0src,ysrc;ϵϕ)assignabsentsubscript~𝐳𝑡superscriptsubscript𝐱0srcsuperscript𝑦srcsubscriptbold-italic-ϵitalic-ϕ\displaystyle:=\tilde{\mathbf{z}}_{t}(\mathbf{x}_{0}^{\text{src}},y^{\text{src% }};\boldsymbol{\epsilon}_{\phi}):= over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) (10)
𝐳~ttgtsuperscriptsubscript~𝐳𝑡tgt\displaystyle\tilde{\mathbf{z}}_{t}^{\text{tgt}}over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT :=𝐳~t(𝐱0tgt,ytgt;ϵϕ).assignabsentsubscript~𝐳𝑡superscriptsubscript𝐱0tgtsuperscript𝑦tgtsubscriptbold-italic-ϵitalic-ϕ\displaystyle:=\tilde{\mathbf{z}}_{t}(\mathbf{x}_{0}^{\text{tgt}},y^{\text{tgt% }};\boldsymbol{\epsilon}_{\phi}).:= over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) . (11)

Using the stochastic latents, we define a novel objective function as follows:

𝐳~t(𝐱0tgt=g(θ)):=𝔼t,ϵt1,ϵt[𝐳~ttgt𝐳~tsrc22],assignsubscriptsubscript~𝐳𝑡superscriptsubscript𝐱0tgt𝑔𝜃subscript𝔼𝑡subscriptbold-italic-ϵ𝑡1subscriptbold-italic-ϵ𝑡delimited-[]superscriptsubscriptnormsuperscriptsubscript~𝐳𝑡tgtsuperscriptsubscript~𝐳𝑡src22\displaystyle\mathcal{L}_{\tilde{\mathbf{z}}_{t}}(\mathbf{x}_{0}^{\text{tgt}}=% g(\theta)):=\mathbb{E}_{t,\boldsymbol{\epsilon}_{t-1},\boldsymbol{\epsilon}_{t% }}\left[\|\tilde{\mathbf{z}}_{t}^{\text{tgt}}-\tilde{\mathbf{z}}_{t}^{\text{% src}}\|_{2}^{2}\right],caligraphic_L start_POSTSUBSCRIPT over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT = italic_g ( italic_θ ) ) := blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∥ over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (12)

where, similar to Equation 3.2, 𝐳~tsrcsubscriptsuperscript~𝐳src𝑡\tilde{\mathbf{z}}^{\text{src}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐳~ttgtsubscriptsuperscript~𝐳tgt𝑡\tilde{\mathbf{z}}^{\text{tgt}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT share the same noises, denoted by ϵt1subscriptbold-italic-ϵ𝑡1\boldsymbol{\epsilon}_{t-1}bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and ϵtsubscriptbold-italic-ϵ𝑡\boldsymbol{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, when computing their respective 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Rather than matching noise variables as in SDS [36] and DDS [10], we match the stochastic latents of the source and the target via the optimization. By taking the gradient of 𝐳~tsubscriptsubscript~𝐳𝑡\mathcal{L}_{\tilde{\mathbf{z}}_{t}}caligraphic_L start_POSTSUBSCRIPT over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT with respect to θ𝜃\thetaitalic_θ and ignoring the U-Net jacobian term as previous works [36, 10, 52], one can obtain PDS as follows:

θPDS:=𝔼t,ϵt,ϵt1[w(t)(𝐳~ttgt𝐳~tsrc)𝐱0tgtθ].assignsubscript𝜃subscriptPDSsubscript𝔼𝑡subscriptbold-italic-ϵ𝑡subscriptbold-italic-ϵ𝑡1delimited-[]𝑤𝑡superscriptsubscript~𝐳𝑡tgtsuperscriptsubscript~𝐳𝑡srcsuperscriptsubscript𝐱0tgt𝜃\displaystyle\nabla_{\theta}\mathcal{L}_{\text{PDS{}}}:=\mathbb{E}_{t,% \boldsymbol{\epsilon}_{t},\boldsymbol{\epsilon}_{t-1}}\left[w(t)(\tilde{% \mathbf{z}}_{t}^{\text{tgt}}-\tilde{\mathbf{z}}_{t}^{\text{src}})\frac{% \partial\mathbf{x}_{0}^{\text{tgt}}}{\partial\theta}\right].∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT PDS end_POSTSUBSCRIPT := blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_w ( italic_t ) ( over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) divide start_ARG ∂ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG ] . (13)

Expanding Equation 13, the following detailed formulation is derived:

θPDS:=assignsubscript𝜃subscriptPDSabsent\displaystyle\nabla_{\theta}\mathcal{L}_{\text{PDS{}}}:=∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT PDS end_POSTSUBSCRIPT :=
𝔼t,ϵt,ϵt1[(ψ(t)(𝐱0tgt𝐱0src)+χ(t)(ϵ^ttgtϵ^tsrc))𝐱0tgtθ],subscript𝔼𝑡subscriptbold-italic-ϵ𝑡subscriptbold-italic-ϵ𝑡1delimited-[]𝜓𝑡superscriptsubscript𝐱0tgtsuperscriptsubscript𝐱0src𝜒𝑡superscriptsubscript^bold-italic-ϵ𝑡tgtsuperscriptsubscript^bold-italic-ϵ𝑡srcsuperscriptsubscript𝐱0tgt𝜃\displaystyle\mathbb{E}_{t,\boldsymbol{\epsilon}_{t},\boldsymbol{\epsilon}_{t-% 1}}\left[(\psi(t)(\mathbf{x}_{0}^{\text{tgt}}-\mathbf{x}_{0}^{\text{src}})+% \chi(t)(\hat{\boldsymbol{\epsilon}}_{t}^{\text{tgt}}-\hat{\boldsymbol{\epsilon% }}_{t}^{\text{src}}))\frac{\partial\mathbf{x}_{0}^{\text{tgt}}}{\partial\theta% }\right],blackboard_E start_POSTSUBSCRIPT italic_t , bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( italic_ψ ( italic_t ) ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) + italic_χ ( italic_t ) ( over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) ) divide start_ARG ∂ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG ] , (14)

where ϵ^tsrc:=ϵϕ(𝐱tsrc,ysrc,t)assignsuperscriptsubscript^bold-italic-ϵ𝑡srcsubscriptbold-italic-ϵitalic-ϕsuperscriptsubscript𝐱𝑡srcsuperscript𝑦src𝑡\hat{\boldsymbol{\epsilon}}_{t}^{\text{src}}:=\boldsymbol{\epsilon}_{\phi}(% \mathbf{x}_{t}^{\text{src}},y^{\text{src}},t)over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT := bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_t ) and ϵ^ttgt:=ϵϕ(𝐱ttgt,ytgt,t)assignsuperscriptsubscript^bold-italic-ϵ𝑡tgtsubscriptbold-italic-ϵitalic-ϕsuperscriptsubscript𝐱𝑡tgtsuperscript𝑦tgt𝑡\hat{\boldsymbol{\epsilon}}_{t}^{\text{tgt}}:=\boldsymbol{\epsilon}_{\phi}(% \mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}},t)over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT := bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_t ). We leave a more detailed derivation to the supplementary material.

Matching 𝐳ttgtsuperscriptsubscript𝐳𝑡tgt\mathbf{z}_{t}^{\text{tgt}}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT with 𝐳tsrcsuperscriptsubscript𝐳𝑡src\mathbf{z}_{t}^{\text{src}}bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ensures that the posteriors of 𝐱0tgtsuperscriptsubscript𝐱0tgt\mathbf{x}_{0}^{\text{tgt}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT and 𝐱0srcsuperscriptsubscript𝐱0src\mathbf{x}_{0}^{\text{src}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT do not significantly diverge, despite being steered by different prompts, ytgtsuperscript𝑦tgty^{\text{tgt}}italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT and ysrcsuperscript𝑦srcy^{\text{src}}italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT. This approach is akin to running a generative process with ytgtsuperscript𝑦tgty^{\text{tgt}}italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT while remaining near the trajectory made by the posteriors of 𝐱0srcsuperscriptsubscript𝐱0src\mathbf{x}_{0}^{\text{src}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT. Consequently, PDS enables the sampling of 𝐱0tgtsuperscriptsubscript𝐱0tgt\mathbf{x}_{0}^{\text{tgt}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT that aligns with ytgtsuperscript𝑦tgty^{\text{tgt}}italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT, while also retaining the identity of 𝐱0srcsuperscriptsubscript𝐱0src\mathbf{x}_{0}^{\text{src}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT. This is achieved through the distillation of the posteriors of 𝐱0srcsuperscriptsubscript𝐱0src\mathbf{x}_{0}^{\text{src}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT into the target sampling process.

4.1 Comparison with SDS [36] and DDS [10]

Refer to caption
Figure 3: A visual comparison of the editing process through SDS [36], DDS [10] and PDS. The figure illustrates the trajectories of samples drawn from p(𝐱0|y=1)𝑝conditionalsubscript𝐱0𝑦1p(\mathbf{x}_{0}|y=1)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y = 1 ) as they are shifted towards p(𝐱0|y=2)𝑝conditionalsubscript𝐱0𝑦2p(\mathbf{x}_{0}|y=2)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y = 2 ). PDS notably moves the samples near the boundary of the two marginals—the optimal endpoint in that it balances the necessary change with the original identity.

In Figure 3, we visually illustrate the difference among the three optimization methods: SDS [36], DDS [10] and PDS. Here, we model a 2D distribution 𝐱0p(𝐱0)2similar-tosubscript𝐱0𝑝subscript𝐱0superscript2\mathbf{x}_{0}\sim p(\mathbf{x}_{0})\in\mathbb{R}^{2}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT that is separated by two marginals, p(𝐱0|y=1)𝑝conditionalsubscript𝐱0𝑦1p(\mathbf{x}_{0}|y=1)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y = 1 ) and p(𝐱0|y=2)𝑝conditionalsubscript𝐱0𝑦2p(\mathbf{x}_{0}|y=2)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y = 2 ) which are colored by red and blue, respectively. Then, we train a diffusion model conditioned on the class labels y𝑦yitalic_y. Using the pre-trained conditional diffusion model, we aim to transition 𝐱0tgtsuperscriptsubscript𝐱0tgt\mathbf{x}_{0}^{\text{tgt}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT starting from 𝐱0srcp(𝐱0|y=1)similar-tosuperscriptsubscript𝐱0src𝑝conditionalsubscript𝐱0𝑦1\mathbf{x}_{0}^{\text{src}}\sim p(\mathbf{x}_{0}|y=1)bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ∼ italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y = 1 ) towards the other marginal p(𝐱0|y=2)𝑝conditionalsubscript𝐱0𝑦2p(\mathbf{x}_{0}|y=2)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y = 2 ). The trajectories of three optimization methods are plotted in Figure 3 with their endpoints denoted by stars. As illustrated, SDS and DDS significantly displace the data from the initial position, whereas our method is terminated near the boundary of the two marginals. This is the optimal endpoint for an editing purpose as it indicates proximity to both the starting points and p(𝐱0|y=2)𝑝conditionalsubscript𝐱0𝑦2p(\mathbf{x}_{0}|y=2)italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_y = 2 ), thereby achieving a balance between the necessary change and the original identity.

4.2 Comparison with Iterative DU

Refer to caption
Figure 4: An example of editing inducing large variations across different views. The figure shows NeRF editing results of ours and Iterative DU methods, IN2N [9] and Inv2N, with their corresponding 2D editing results obtained by IP2P [2] and DDPM Inversion [15], respectively. When 2D editing leads to large variations, the Iterative DU methods fail to produce accurate edits in 3D space.

When a parameterization of images is given as NeRF [30], recent works [9, 31] have shown promising NeRF editing results based on a method known as Iterative Dataset Update (Iterative DU). This method bypasses 3D editing by performing the editing process within 2D space. Given an image dataset {Ivsrc}v=1Nsuperscriptsubscriptsubscriptsuperscript𝐼src𝑣𝑣1𝑁\{I^{\text{src}}_{v}\}_{v=1}^{N}{ italic_I start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_v = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT used for NeRF [30] reconstruction with viewpoints v𝑣vitalic_v, they randomly replace Ivsrcsubscriptsuperscript𝐼src𝑣I^{\text{src}}_{v}italic_I start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT with its 2D edited version using Instruct-Pix2Pix (IP2P) [2]. By iteratively updating the input images, they progressively transform the input NeRF scene into an edited version of it.

In contrast to Iterative DU which performs editing in 2D space, our approach directly edits NeRFs [30] in 3D space. To visually demonstrate this difference, Figure 4 presents a qualitative comparison of ours and various methods based on Iterative DU. Specifically, we compare ours with Instruct-NeRF2NeRF (IN2N) [9] which uses IP2P [2] for 2D editing. Additionally, we include another Iterative-DU-based method, Inversion2NeRF (Inv2N), which employs DDPM inversion [15] for its 2D editing process. Given the prompt “raising his arms", the figure illustrates significant variations in 2D edited images across different views: the man raises either only one arm or both arms, as marked by the red circle. Furthermore, the red arrow highlights the inconsistency in the poses of raising arms across different views. Such notable discrepancies in 2D editing hinder the Iterative DU methods from transferring these edits into 3D space. Particularly noteworthy is the comparison of our method with Inv2N, both of which leverage the stochastic latent for editing. However, while Inv2N confines its editing within 2D space, ours directly updates NeRF parameters in 3D space by reformulating the 2D image editing method [15] into an optimization form. Consequently, as shown in Figure 4 and Figure 2, ours is the only one to facilitate complex geometric changes and the addition of objects in 3D scenes. It demonstrates the strength of our method lies in the novel optimization design, which allows for direct 3D editing, not just relying on the editing capabilities of DDPM inversion [15].

5 NeRF Editing with PDS

As one of the applications of PDS, we present a detailed pipeline for NeRF [30] editing. NeRF can be seen as a parameterized rendering function. The rendering process is expressed as Iv=g(v;θ)subscript𝐼𝑣𝑔𝑣𝜃I_{v}=g(v;\theta)italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_g ( italic_v ; italic_θ ), where the function takes a specific viewpoint v𝑣vitalic_v to render the image Ivsubscript𝐼𝑣I_{v}italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT at that viewpoint with the rendering parameter θ𝜃\thetaitalic_θ. Using the publicly available Stable Diffusion [41] as our diffusion prior model, we encode the current rendering at viewpoint v𝑣vitalic_v to obtain the target latent 𝐱0,vtgtsuperscriptsubscript𝐱0𝑣tgt\mathbf{x}_{0,v}^{\text{tgt}}bold_x start_POSTSUBSCRIPT 0 , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT: 𝐱0,vtgt:=(g(v;θ))assignsubscriptsuperscript𝐱tgt0𝑣𝑔𝑣𝜃\mathbf{x}^{\text{tgt}}_{0,v}:=\mathcal{E}(g(v;\theta))bold_x start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_v end_POSTSUBSCRIPT := caligraphic_E ( italic_g ( italic_v ; italic_θ ) ), where \mathcal{E}caligraphic_E is a pre-trained encoder. Similarly, given the original source images {Ivsrc}subscriptsuperscript𝐼src𝑣\{I^{\text{src}}_{v}\}{ italic_I start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } used for NeRF [30] reconstruction, the source latent 𝐱0,vsrcsubscriptsuperscript𝐱src0𝑣\mathbf{x}^{\text{src}}_{0,v}bold_x start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_v end_POSTSUBSCRIPT is also computed by encoding the source image at viewpoint v𝑣vitalic_v: 𝐱0,vsrc:=(Ivsrc)assignsubscriptsuperscript𝐱src0𝑣superscriptsubscript𝐼𝑣src\mathbf{x}^{\text{src}}_{0,v}:=\mathcal{E}(I_{v}^{\text{src}})bold_x start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 , italic_v end_POSTSUBSCRIPT := caligraphic_E ( italic_I start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ).

For real scenes, there are no given source prompts. Thus, we manually create descriptions for the real scenes, such as “a photo of a man" in Figure 1. For target prompts ytgtsuperscript𝑦tgty^{\text{tgt}}italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT, we adjust ysrcsuperscript𝑦srcy^{\text{src}}italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT by appending a description of a desired attribute—e.g.,“…raising his arms" in Figure 4—or by substituting an existing word in ysrcsuperscript𝑦srcy^{\text{src}}italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT with a new one, such as changing “deer doll" to “unicorn doll" in the last row of Figure 2. Given a pre-fixed set of viewpoints {v}𝑣\{v\}{ italic_v }, we randomly select a viewpoint v𝑣vitalic_v to compute 𝐱0,vsrcsuperscriptsubscript𝐱0𝑣src\mathbf{x}_{0,v}^{\text{src}}bold_x start_POSTSUBSCRIPT 0 , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT and 𝐱0,vtgtsuperscriptsubscript𝐱0𝑣tgt\mathbf{x}_{0,v}^{\text{tgt}}bold_x start_POSTSUBSCRIPT 0 , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT. The pairs of (𝐱0,vsrc,ysrc)superscriptsubscript𝐱0𝑣srcsuperscript𝑦src(\mathbf{x}_{0,v}^{\text{src}},y^{\text{src}})( bold_x start_POSTSUBSCRIPT 0 , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) and (𝐱0,vtgt,ytgt)superscriptsubscript𝐱0𝑣tgtsuperscript𝑦tgt(\mathbf{x}_{0,v}^{\text{tgt}},y^{\text{tgt}})( bold_x start_POSTSUBSCRIPT 0 , italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT ) are fed into the PDS optimization to update θ𝜃\thetaitalic_θ in a direction dictated by the target prompt. After the optimization, the updated NeRF parameter θ~~𝜃\tilde{\theta}over~ start_ARG italic_θ end_ARG renders an edited 3D scene that is aligned with the target prompt: I~v:=g(v;θ~)assignsubscript~𝐼𝑣𝑔𝑣~𝜃\tilde{I}_{v}:=g(v;\tilde{\theta})over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT := italic_g ( italic_v ; over~ start_ARG italic_θ end_ARG ).

To further improve the final output, we take a refinement stage inspired by DreamBooth3D [38]. During iterations of the refinement stage, we randomly select an edited rendering I~vsubscript~𝐼𝑣\tilde{I}_{v}over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and refine it into a more realistic-looking image using SDEdit [26]. The edited NeRF scenes through PDS optimization are then further refined by a reconstruction loss with these repeatedly updated images.

In some cases of source prompts we create, we observe some gap between the ideal text prompt, which would ideally reconstruct the input image through the generative process, and the actual prompt we provide. To alleviate this discrepancy issue, we have found it effective to finetune the Stable Diffusion [41] with {Ivsrc}subscriptsuperscript𝐼src𝑣\{I^{\text{src}}_{v}\}{ italic_I start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT } and ysrcsuperscript𝑦srcy^{\text{src}}italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT following the DreamBooth [42] setup.

6 Experiment Results

In this section, we conduct editing experiments across two types of parameterized images. Section 6.1 presents NeRF editing results, comparing our NeRF editing capabilities to the state-of-the-art NeRF editing methods. Furthermore, Section 6.2 shows SVG editing results to compare PDS against other optimization methods, namely SDS [36] and DDS [10].

6.1 NeRF Editing

Datasets.

We use real scenes we capture as well as the scenes from IN2N [9] and LLFF [29]. The total number of scenes is 13131313, and the final number of pairs of source scenes and target text prompts is 37373737 with multiple target prompts for each scene.

Baselines.

For extensive comparisons, we evaluate our method against three baselines: Instruct-NeRF2NeRF (IN2N) [9], DDS [10] and Inversion2NeRF (Inv2N). First, we compare ours with IN2N [9], which is a state-of-the-art NeRF editing method with its code publicly available. Additionally, as introduced in Section 4.2, we conduct a comparison with Inv2N, another method based on Iterative DU, which performs editing within 2D space rather directly in 3D space, but employs DDPM inversion [15] instead of IP2P [2] for 2D editing.

Results.

Figure 2 presents the qualitative comparisons of NeRF editing. Notably, as depicted in rows 1 and 2, our method is the only one that makes large geometric changes in 3D scenes from the input text, folding the man’s arms to create natural poses of him reading a book or drinking coffee. In contrast, Iterative-DU-based methods like IN2N [9] and Inv2N fail to produce the right edits in 3D space. DDS [10] produces the outputs that completely lose the identity of the input scenes, focusing solely on conforming to the input texts. Rows 3 and 4 of Figure 2 show the editing scenarios of adding objects in outdoor scenes without specifying local regions, which also leads to large variations. Here, our method successfully adds objects like windmills and hot air balloons in the input scenes, maintaining their background details. On the other hand, the baselines either fail to add the objects in 3D space or produce outputs that significantly deviate from the original scenes. When it comes to appearance change, which induces relatively little variations across different views, both our method and IN2N [9] effectively produce the desired appearance change in 3D scenes, as shown in the last row of Figure 2. However, ours most preserves the original identity of the input scene, such as the object’s color, while making appropriate changes. Additional qualitative results are presented through videos on our project page111https://posterior-distillation-sampling.github.io.

To further assess the perceptual quality of the editing results, we conduct a user study compared to the baselines. Following Ritchie [40], participants were shown input NeRF scene videos, editing prompts, and edited NeRF scene videos produced by ours and the baselines. They were then asked to choose the most appropriate edited NeRF scene video. As illustrated in Table 1, our editing results are most preferred over the baselines in human evaluation by a large margin: 49.33% (Ours) vs. 27.71% (IN2N [9], the second best). See the supplementary material for a more detailed user study setup.

For a quantitative evaluation, we measure CLIP [37] Score that measures the similarity between edited 2D renderings and target text prompts in CLIP [37] space. As shown in Table 1, ours outperforms the baselines quantitatively. This is corroborated by the qualitative results illustrated in Figure 2, especially in scenarios of geometric changes or object addition, where the other baselines have difficulty in making the right edits.

Table 1: A quantitative comparison of NeRF editing between ours and other baselines. Ours outperforms the baselines quantitatively. Bold indicates the best result for each column.
Methods CLIP [37] Score \uparrow User Preference Rate (%) \uparrow
IN2N [9] 0.2280 27.71
DDS [10] 0.2210 13.71
Inv2N 0.2232 9.24
PDS (Ours) 0.2477 49.33

6.2 SVG Editing

Refer to caption
Figure 5: A qualitative comparison of SVG editing using three different optimization methods: SDS [36], DDS [10] and PDS. PDS makes changes according to input text while most preserving the structural semantics of the input SVGs.
Table 2: A quantitative comparison of SVG editing between SDS [36], DDS [10] and PDS. Ours outperforms the others in LPIPS [58] while achieving a CLIP [37] score that is on par with the others. Bold indicates the best result for each column.
Methods CLIP [37] Score \uparrow LPIPS [58] \downarrow User Preference Rate (%) \uparrow
SDS [36] 0.2606 0.4855 30.83
DDS [10] 0.2460 0.5982 20.24
PDS (Ours) 0.2504 0.3121 48.94

Experimental Setup.

We use pairs of SVGs and their corresponding text prompts used in VectorFusion [17] as input. By manually creating target text prompts, we conduct experiments with a total of 48484848 pairs of input SVGs and target text prompts. For comparison, we evaluate our method against other optimization methods, SDS [36] and DDS [10]. To perform editing with SDS, we start with a source SVG as an initial updated SVG and then update it using a target prompt according to the SDS [36] optimization. Following DDS, we use CLIP [37] score and LPIPS [58] as quantitative metrics.

Results.

Qualitative results of SVG editing are shown in Figure 5. It demonstrates that while all the methods effectively change input SVGs according to the target text prompts, ours best preserves the structural semantics of the input SVGs. This is particularly evident in row 2 of Figure 5, where ours maintains the overall color pattern of the input SVG.

The trends from the qualitative results are mirrored in our quantitative results. As seen in Table 2, ours significantly surpasses the others in LPIPS [58] by a large margin, which measures the fidelity to the input SVG, while our CLIP score is on par with the others. This demonstrates that our method introduces only minimal necessary changes to meet the described attributes in the target text prompts.

We further provide a user study result of SVG editing in Table 2. We use the same user study setup used in NeRF editing (Section 6.1). Consistent with the qualitative and quantitative results, ours are most preferred in human evaluation.

7 Conclusion

We propose Posterior Distillation Sampling (PDS), an optimization method for parametric image editing. PDS matches the stochastic latents of the source and the target to fulfill both conformity to the target text and preservation of the source identity in parameter space. We demonstrate the versatility of PDS in parametric image editing through a comparative analysis between ours and other optimization methods and extensive experiments across various parameter spaces.

Acknowledgements

This work was supported by NRF grant (RS-2023-00209723) and IITP grants (2022-0- 00594, RS-2023-00227592) funded by the Korean government (MSIT), Seoul R&BD Program (CY230112), and grants from the DRB-KAIST SketchTheFuture Research Center, Hyundai NGV, KT, NCSOFT, and Samsung Electronics.

References

  • Anonymous [2023] Anonymous. Learning pseudo 3D guidance for view-consistent 3D texturing with 2D diffusion. In Submitted to The Twelfth International Conference on Learning Representations, 2023. under review.
  • Brooks et al. [2023] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. InstructPix2Pix: Learning to follow image editing instructions. In CVPR, 2023.
  • Byeon et al. [2022] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  • Chen et al. [2023] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation. In ICCV, 2023.
  • [5] DeepFloyd. Deepfloyd if. https://www.deepfloyd.ai/deepfloyd-if/.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 2021.
  • Ghosal et al. [2023] Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
  • Han et al. [2023] Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Yuxiao Chen, Di Liu, Qilong Zhangli, et al. Improving negative-prompt inversion via proximal guidance. arXiv preprint arXiv:2306.05414, 2023.
  • Haque et al. [2023] Ayaan Haque, Matthew Tancik, Alexei Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-NeRF2NeRF: Editing 3D scenes with instructions. In ICCV, 2023.
  • Hertz et al. [2023a] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta denoising score. In ICCV, 2023a.
  • Hertz et al. [2023b] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023b.
  • Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020.
  • Huang et al. [2023] Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Lu** Liu, Mingze Li, Zhenhui Ye, **glin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023.
  • Huberman-Spiegelglas et al. [2023] Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly DDPM noise space: Inversion and manipulations. arXiv preprint arXiv:2304.06140, 2023.
  • Iluz et al. [2023] Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for semantic typography. ACM TOG, 2023.
  • Jain et al. [2023] Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In CVPR, 2023.
  • Jun and Nichol [2023] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  • Kawar et al. [2023] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
  • Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 2023.
  • Koo et al. [2023] Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Minhyuk Sung. SALAD: Part-level latent diffusion for 3d shape generation and manipulation. In ICCV, 2023.
  • Lee et al. [2023] Yuseung Lee, Kunho Kim, Hyun** Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. In NeurIPS, 2023.
  • Li et al. [2023a] Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. Diffusion-sdf: Text-to-shape via voxelized diffusion. In CVPR, 2023a.
  • Li et al. [2023b] Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, and Bingbing Ni. Focaldreamer: Text-driven 3D editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608, 2023b.
  • Lin et al. [2023] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-resolution text-to-3D content creation. In CVPR, 2023.
  • Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
  • Metzer et al. [2023] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3D shapes and textures. In CVPR, 2023.
  • [28] Midjourney. Midjourney. https://www.midjourney.com/.
  • Mildenhall et al. [2019] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM TOG, 2019.
  • Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  • Mirzaei et al. [2023] Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G Derpanis, and Igor Gilitschenski. Watch your steps: Local image and scene editing by text instructions. arXiv preprint arXiv:2308.08947, 2023.
  • Miyake et al. [2023] Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
  • Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
  • Nichol et al. [2022] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  • Park et al. [2023] Jangho Park, Gihyun Kwon, and Jong Chul Ye. ED-NeRF: Efficient text-guided editing of 3D scene using latent space NeRF. arXiv preprint arXiv:2310.02712, 2023.
  • Poole et al. [2023] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
  • Raj et al. [2023] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3D: Subject-driven text-to-3D generation. In ICCV, 2023.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • [40] Daniel Ritchie. Rudimentary framework for running two-alternative forced choice (2afc) perceptual studies on mechanical turk.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
  • Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  • Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  • Shi et al. [2023] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512, 2023.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  • Song et al. [2021a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021a.
  • Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
  • Song et al. [2021b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
  • Wallace et al. [2023] Bram Wallace, Akash Gokul, and Nikhil Naik. EDICT: Exact diffusion inversion via coupled transformations. In CVPR, 2023.
  • Wang et al. [2023a] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2D diffusion models for 3D generation. In CVPR, 2023a.
  • Wang et al. [2023b] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3D generation with variational score distillation. In NeurIPS, 2023b.
  • Wu and la Torre [2023] Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, 2023.
  • Xing et al. [2023] Ximing Xing, Chuang Wang, Haitao Zhou, **g Zhang, Qian Yu, and Dong Xu. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In NeurIPS, 2023.
  • Xu et al. [2023] Xudong Xu, Zhaoyang Lyu, Xingang Pan, and Bo Dai. Matlaber: Material-aware text-to-3D via latent BRDF auto-encoder. arXiv preprint arXiv:2308.09278, 2023.
  • Yang et al. [2023] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  • Zhu and Zhuang [2023] Joseph Zhu and Peiye Zhuang. HiFA: High-fidelity text-to-3D with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
  • Zhuang et al. [2023] **gyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, and Guanbin Li. Dreameditor: Text-driven 3D scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.

Appendix

A.1 Editing 3D Gaussian Splats [20] and 2D Images

Refer to caption
Figure A6: Editing of more diverse representations, 3D Gaussian Splats [20] and 2D images. PDS consistently outperforms the baselines. The target attributes are “Batman" and “raising the arms."

PDS encompasses various editing scenarios, not confined within a specific parameter space. To further assess the versatility and generalizability of PDS in editing tasks, we include both 3D Gaussian Splat (3DGS) [20] editing and 2D image editing. As NeRF editing, Figure A6 shows that PDS outperforms Instruct-NeRF2NeRF [9] in 3DGS representation while uniquely realizing geometric changes. In 2D image editing, PDS demonstrates superior performance compared to Imagic [19], which is introduced for 2D image editing using pre-trained 2D diffusion models. PDS edits the input image while preserving other details with high fidelity. On the other hand, Imagic [19] leaves artifacts, losing the identity of the source content.

A.2 Derivation of Posterior Distillation Sampling

For a comprehensive derivation of Equation 4 , we first remind that the objective function of PDS is expressed as:

𝐳~t(𝐱0tgt)subscriptsubscript~𝐳𝑡superscriptsubscript𝐱0tgt\displaystyle\mathcal{L}_{\tilde{\mathbf{z}}_{t}}(\mathbf{x}_{0}^{\text{tgt}})caligraphic_L start_POSTSUBSCRIPT over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT ) =𝔼[𝐳~ttgt𝐳~tsrc22]absent𝔼delimited-[]superscriptsubscriptnormsuperscriptsubscript~𝐳𝑡tgtsuperscriptsubscript~𝐳𝑡src22\displaystyle=\mathbb{E}\left[\|\tilde{\mathbf{z}}_{t}^{\text{tgt}}-\tilde{% \mathbf{z}}_{t}^{\text{src}}\|_{2}^{2}\right]= blackboard_E [ ∥ over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (15)
=𝔼[𝐱t1tgt𝝁ϕ(𝐱ttgt,ytgt;ϵϕ)σt𝐱t1src𝝁ϕ(𝐱tsrc,ysrc;ϵϕ)σt22]absent𝔼delimited-[]superscriptsubscriptnormsuperscriptsubscript𝐱𝑡1tgtsubscript𝝁italic-ϕsuperscriptsubscript𝐱𝑡tgtsuperscript𝑦tgtsubscriptbold-italic-ϵitalic-ϕsubscript𝜎𝑡superscriptsubscript𝐱𝑡1srcsubscript𝝁italic-ϕsuperscriptsubscript𝐱𝑡srcsuperscript𝑦srcsubscriptbold-italic-ϵitalic-ϕsubscript𝜎𝑡22\displaystyle=\mathbb{E}\left[\Big{\|}\frac{\mathbf{x}_{t-1}^{\text{tgt}}-% \boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}};\boldsymbol% {\epsilon}_{\phi})}{\sigma_{t}}-\frac{\mathbf{x}_{t-1}^{\text{src}}-% \boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{src}},y^{\text{src}};\boldsymbol% {\epsilon}_{\phi})}{\sigma_{t}}\Big{\|}_{2}^{2}\right]= blackboard_E [ ∥ divide start_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - divide start_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT - bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (16)
=𝔼[1σt2(𝐱t1tgt𝐱t1src)(𝝁ϕ(𝐱ttgt,ytgt;ϵϕ)𝝁ϕ(𝐱tsrc,ysrc;ϵϕ))22].absent𝔼delimited-[]1superscriptsubscript𝜎𝑡2superscriptsubscriptnormsuperscriptsubscript𝐱𝑡1tgtsuperscriptsubscript𝐱𝑡1srcsubscript𝝁italic-ϕsuperscriptsubscript𝐱𝑡tgtsuperscript𝑦tgtsubscriptbold-italic-ϵitalic-ϕsubscript𝝁italic-ϕsuperscriptsubscript𝐱𝑡srcsuperscript𝑦srcsubscriptbold-italic-ϵitalic-ϕ22\displaystyle=\mathbb{E}\left[\frac{1}{\sigma_{t}^{2}}\big{\|}(\mathbf{x}_{t-1% }^{\text{tgt}}-\mathbf{x}_{t-1}^{\text{src}})-\left(\boldsymbol{\mu}_{\phi}(% \mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}};\boldsymbol{\epsilon}_{\phi})-% \boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{src}},y^{\text{src}};\boldsymbol% {\epsilon}_{\phi})\right)\big{\|}_{2}^{2}\right].= blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) - ( bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) - bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (17)

Given that 𝐳~tsrcsubscriptsuperscript~𝐳src𝑡\tilde{\mathbf{z}}^{\text{src}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐳~ttgtsubscriptsuperscript~𝐳tgt𝑡\tilde{\mathbf{z}}^{\text{tgt}}_{t}over~ start_ARG bold_z end_ARG start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT share the same noises ϵt1subscriptbold-italic-ϵ𝑡1\boldsymbol{\epsilon}_{t-1}bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and ϵtsubscriptbold-italic-ϵ𝑡\boldsymbol{\epsilon}_{t}bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for their respective 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the difference between 𝐱t1tgtsuperscriptsubscript𝐱𝑡1tgt\mathbf{x}_{t-1}^{\text{tgt}}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT and 𝐱t1srcsuperscriptsubscript𝐱𝑡1src\mathbf{x}_{t-1}^{\text{src}}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT results in a constant multiple of the difference between 𝐱0tgtsuperscriptsubscript𝐱0tgt\mathbf{x}_{0}^{\text{tgt}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT and 𝐱0srcsuperscriptsubscript𝐱0src\mathbf{x}_{0}^{\text{src}}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT:

𝐱t1tgt𝐱t1srcsuperscriptsubscript𝐱𝑡1tgtsuperscriptsubscript𝐱𝑡1src\displaystyle\mathbf{x}_{t-1}^{\text{tgt}}-\mathbf{x}_{t-1}^{\text{src}}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT =α¯t1(𝐱0tgt𝐱0src).absentsubscript¯𝛼𝑡1superscriptsubscript𝐱0tgtsuperscriptsubscript𝐱0src\displaystyle=\sqrt{\bar{\alpha}_{t-1}}(\mathbf{x}_{0}^{\text{tgt}}-\mathbf{x}% _{0}^{\text{src}}).= square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) . (18)

Following our notation ϵ^tsrc:=ϵϕ(𝐱tsrc,ysrc,t)assignsuperscriptsubscript^bold-italic-ϵ𝑡srcsubscriptbold-italic-ϵitalic-ϕsuperscriptsubscript𝐱𝑡srcsuperscript𝑦src𝑡\hat{\boldsymbol{\epsilon}}_{t}^{\text{src}}:=\boldsymbol{\epsilon}_{\phi}(% \mathbf{x}_{t}^{\text{src}},y^{\text{src}},t)over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT := bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_t ) and ϵ^ttgt:=ϵϕ(𝐱ttgt,ytgt,t)assignsuperscriptsubscript^bold-italic-ϵ𝑡tgtsubscriptbold-italic-ϵitalic-ϕsuperscriptsubscript𝐱𝑡tgtsuperscript𝑦tgt𝑡\hat{\boldsymbol{\epsilon}}_{t}^{\text{tgt}}:=\boldsymbol{\epsilon}_{\phi}(% \mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}},t)over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT := bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_t ) introduced in Section 4 , the difference between the approximated posterior means is also expressed as follows:

𝝁ϕ(𝐱ttgt,ytgt;ϵϕ)𝝁ϕ(𝐱tsrc,ysrc,ϵϕ)subscript𝝁italic-ϕsuperscriptsubscript𝐱𝑡tgtsuperscript𝑦tgtsubscriptbold-italic-ϵitalic-ϕsubscript𝝁italic-ϕsuperscriptsubscript𝐱𝑡srcsuperscript𝑦srcsubscriptbold-italic-ϵitalic-ϕ\displaystyle\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}% };\boldsymbol{\epsilon}_{\phi})-\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{% src}},y^{\text{src}},\boldsymbol{\epsilon}_{\phi})bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) - bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) =(γt+δtα¯t)(𝐱0tgt𝐱0src)γt1α¯t1(ϵ^ttgtϵ^tsrc),absentsubscript𝛾𝑡subscript𝛿𝑡subscript¯𝛼𝑡superscriptsubscript𝐱0tgtsuperscriptsubscript𝐱0srcsubscript𝛾𝑡1subscript¯𝛼𝑡1superscriptsubscript^bold-italic-ϵ𝑡tgtsuperscriptsubscript^bold-italic-ϵ𝑡src\displaystyle=(\gamma_{t}+\delta_{t}\sqrt{\bar{\alpha}_{t}})(\mathbf{x}_{0}^{% \text{tgt}}-\mathbf{x}_{0}^{\text{src}})-\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}% _{t}}-1}(\hat{\boldsymbol{\epsilon}}_{t}^{\text{tgt}}-\hat{\boldsymbol{% \epsilon}}_{t}^{\text{src}}),= ( italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ( over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) , (19)

where 𝝁ϕ(𝐱t,y;ϵϕ)subscript𝝁italic-ϕsubscript𝐱𝑡𝑦subscriptbold-italic-ϵitalic-ϕ\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{\phi})bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) can be expanded as shown in the following equation:

𝝁ϕ(𝐱t,y;ϵϕ)subscript𝝁italic-ϕsubscript𝐱𝑡𝑦subscriptbold-italic-ϵitalic-ϕ\displaystyle\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{% \phi})bold_italic_μ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) =γt𝐱~0(𝐱t,y;ϵϕ)+δt𝐱tabsentsubscript𝛾𝑡subscript~𝐱0subscript𝐱𝑡𝑦subscriptbold-italic-ϵitalic-ϕsubscript𝛿𝑡subscript𝐱𝑡\displaystyle=\gamma_{t}\tilde{\mathbf{x}}_{0}(\mathbf{x}_{t},y;\boldsymbol{% \epsilon}_{\phi})+\delta_{t}\mathbf{x}_{t}= italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over~ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y ; bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) + italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (20)
=γt(1α¯t(𝐱t1α¯tϵϕ(𝐱t,y,t))+δt𝐱t\displaystyle=\gamma_{t}\left(\frac{1}{\sqrt{\bar{\alpha}_{t}}}(\mathbf{x}_{t}% -\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon}_{\phi}(\mathbf{x}_{t},y,t)% \right)+\delta_{t}\mathbf{x}_{t}= italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) ) + italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (21)
=(γtα¯t+δt)𝐱tγt1α¯t1ϵϕ(𝐱t,y,t)absentsubscript𝛾𝑡subscript¯𝛼𝑡subscript𝛿𝑡subscript𝐱𝑡subscript𝛾𝑡1subscript¯𝛼𝑡1subscriptbold-italic-ϵitalic-ϕsubscript𝐱𝑡𝑦𝑡\displaystyle=(\frac{\gamma_{t}}{\sqrt{\bar{\alpha}}_{t}}+\delta_{t})\mathbf{x% }_{t}-\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-1}\boldsymbol{\epsilon}_{\phi% }(\mathbf{x}_{t},y,t)= ( divide start_ARG italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG + italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) (22)
=(γt+δtα¯t)𝐱0+1α¯t1(γt+δtα¯t)ϵtγt1α¯t1ϵϕ(𝐱t,y,t).absentsubscript𝛾𝑡subscript𝛿𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡1subscript𝛾𝑡subscript𝛿𝑡subscript¯𝛼𝑡subscriptbold-italic-ϵ𝑡subscript𝛾𝑡1subscript¯𝛼𝑡1subscriptbold-italic-ϵitalic-ϕsubscript𝐱𝑡𝑦𝑡\displaystyle=(\gamma_{t}+\delta_{t}\sqrt{\bar{\alpha}_{t}})\mathbf{x}_{0}+% \sqrt{\frac{1}{\bar{\alpha}_{t}}-1}(\gamma_{t}+\delta_{t}\sqrt{\bar{\alpha}_{t% }})\boldsymbol{\epsilon}_{t}-\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-1}% \boldsymbol{\epsilon}_{\phi}(\mathbf{x}_{t},y,t).= ( italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ( italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) bold_italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_y , italic_t ) . (23)

Incorporating Equation 18 and Equation 19 into Equation 17, we can reformulate the objective function of PDS as follows:

𝐳~t(𝐱0tgt)=𝔼[1σt2\displaystyle\mathcal{L}_{\tilde{\mathbf{z}}_{t}}(\mathbf{x}_{0}^{\text{tgt}})% =\mathbb{E}\biggl{[}\frac{1}{\sigma_{t}^{2}}caligraphic_L start_POSTSUBSCRIPT over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT ) = blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (α¯t1γtδtα¯t)(𝐱0tgt𝐱0src)+γt1α¯t1(ϵ^ttgtϵ^tsrc)22]\displaystyle\big{\|}(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{% \bar{\alpha}_{t}})(\mathbf{x}_{0}^{\text{tgt}}-\mathbf{x}_{0}^{\text{src}})+% \gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-1}(\hat{\boldsymbol{\epsilon}}^{% \text{tgt}}_{t}-\hat{\boldsymbol{\epsilon}}^{\text{src}}_{t})\big{\|}_{2}^{2}% \biggl{]}∥ ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) + italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ( over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (24)
=𝔼[1σt2\displaystyle=\mathbb{E}\biggl{[}\frac{1}{\sigma_{t}^{2}}= blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ((α¯t1γtδtα¯t)2(𝐱0tgt𝐱0src)2\displaystyle\biggl{(}(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{% \bar{\alpha}_{t}})^{2}(\mathbf{x}_{0}^{\text{tgt}}-\mathbf{x}_{0}^{\text{src}}% )^{2}( ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (25)
+2(α¯t1γtδtα¯t)γt1α¯t1(𝐱0tgt𝐱0src)(ϵ^ttgtϵ^tsrc)2subscript¯𝛼𝑡1subscript𝛾𝑡subscript𝛿𝑡subscript¯𝛼𝑡subscript𝛾𝑡1subscript¯𝛼𝑡1superscriptsubscript𝐱0tgtsuperscriptsubscript𝐱0srcsubscriptsuperscript^bold-italic-ϵtgt𝑡subscriptsuperscript^bold-italic-ϵsrc𝑡\displaystyle+2(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{\bar{% \alpha}_{t}})\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-1}(\mathbf{x}_{0}^{% \text{tgt}}-\mathbf{x}_{0}^{\text{src}})(\hat{\boldsymbol{\epsilon}}^{\text{% tgt}}_{t}-\hat{\boldsymbol{\epsilon}}^{\text{src}}_{t})+ 2 ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) ( over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
+γt2(1α¯t1)(ϵ^ttgtϵ^tsrc)2)].\displaystyle+\gamma_{t}^{2}(\frac{1}{\bar{\alpha}_{t}}-1)(\hat{\boldsymbol{% \epsilon}}^{\text{tgt}}_{t}-\hat{\boldsymbol{\epsilon}}^{\text{src}}_{t})^{2}% \biggl{)}\biggl{]}.+ italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 ) ( over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] .

By taking the gradient of 𝐳~tsubscriptsubscript~𝐳𝑡\mathcal{L}_{\tilde{\mathbf{z}}_{t}}caligraphic_L start_POSTSUBSCRIPT over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT with respect to θ𝜃\thetaitalic_θ while ignoring the U-Net jacobian term, ϵ^ϕtgt𝐱0tgt=𝐈superscriptsubscript^bold-italic-ϵitalic-ϕtgtsuperscriptsubscript𝐱0tgt𝐈\frac{\partial\hat{\boldsymbol{\epsilon}}_{\phi}^{\text{tgt}}}{\partial\mathbf% {x}_{0}^{\text{tgt}}}=\mathbf{I}divide start_ARG ∂ over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT end_ARG start_ARG ∂ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT end_ARG = bold_I, one can obtain PDS as follows:

θPDSsubscript𝜃subscriptPDS\displaystyle\nabla_{\theta}\mathcal{L}_{\text{PDS}}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT PDS end_POSTSUBSCRIPT =𝐳~t(𝐱0tgt)𝐱0tgt𝐱0tgtθabsentsubscriptsubscript~𝐳𝑡superscriptsubscript𝐱0tgtsuperscriptsubscript𝐱0tgtsuperscriptsubscript𝐱0tgt𝜃\displaystyle=\frac{\partial\mathcal{L}_{\tilde{\mathbf{z}}_{t}}(\mathbf{x}_{0% }^{\text{tgt}})}{\partial\mathbf{x}_{0}^{\text{tgt}}}\cdot\frac{\partial% \mathbf{x}_{0}^{\text{tgt}}}{\partial\theta}= divide start_ARG ∂ caligraphic_L start_POSTSUBSCRIPT over~ start_ARG bold_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT ) end_ARG start_ARG ∂ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT end_ARG ⋅ divide start_ARG ∂ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG (26)
=𝔼[2σt2((α¯t1γtδtα¯t)2(𝐱0tgt𝐱0src)+(α¯t1γtδtα¯t)γt1α¯t1(ϵ^ttgtϵ^tsrc))𝐱0tgtθ].absent𝔼delimited-[]2superscriptsubscript𝜎𝑡2superscriptsubscript¯𝛼𝑡1subscript𝛾𝑡subscript𝛿𝑡subscript¯𝛼𝑡2superscriptsubscript𝐱0tgtsuperscriptsubscript𝐱0srcsubscript¯𝛼𝑡1subscript𝛾𝑡subscript𝛿𝑡subscript¯𝛼𝑡subscript𝛾𝑡1subscript¯𝛼𝑡1subscriptsuperscript^bold-italic-ϵtgt𝑡subscriptsuperscript^bold-italic-ϵsrc𝑡superscriptsubscript𝐱0tgt𝜃\displaystyle=\mathbb{E}\left[\frac{2}{\sigma_{t}^{2}}\left((\sqrt{\bar{\alpha% }_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{\bar{\alpha}_{t}})^{2}(\mathbf{x}_{0}^{% \text{tgt}}-\mathbf{x}_{0}^{\text{src}})+(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}% -\delta_{t}\sqrt{\bar{\alpha}_{t}})\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-% 1}(\hat{\boldsymbol{\epsilon}}^{\text{tgt}}_{t}-\hat{\boldsymbol{\epsilon}}^{% \text{src}}_{t})\right)\frac{\partial\mathbf{x}_{0}^{\text{tgt}}}{\partial% \theta}\right].= blackboard_E [ divide start_ARG 2 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) + ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG ( over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) divide start_ARG ∂ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG ] . (27)

Thus, the coefficients ψ(t)𝜓𝑡\psi(t)italic_ψ ( italic_t ) and χ(t)𝜒𝑡\chi(t)italic_χ ( italic_t ) in Equation 4  are as follows:

ψ(t)𝜓𝑡\displaystyle\psi(t)italic_ψ ( italic_t ) =2(α¯t1γtδtα¯t)2σt2,absent2superscriptsubscript¯𝛼𝑡1subscript𝛾𝑡subscript𝛿𝑡subscript¯𝛼𝑡2superscriptsubscript𝜎𝑡2\displaystyle=\frac{2(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{% \bar{\alpha}_{t}})^{2}}{\sigma_{t}^{2}},= divide start_ARG 2 ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (28)
χ(t)𝜒𝑡\displaystyle\chi(t)italic_χ ( italic_t ) =2(α¯t1γtδtα¯t)σt2γt1α¯t1.absent2subscript¯𝛼𝑡1subscript𝛾𝑡subscript𝛿𝑡subscript¯𝛼𝑡superscriptsubscript𝜎𝑡2subscript𝛾𝑡1subscript¯𝛼𝑡1\displaystyle=\frac{2(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{% \bar{\alpha}_{t}})}{\sigma_{t}^{2}}\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-% 1}.= divide start_ARG 2 ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG - 1 end_ARG . (29)

In practice, we sample non-consecutive timesteps for t1𝑡1t-1italic_t - 1 and t𝑡titalic_t as in DDIM [48] since the coefficients become 00 when they are consecutive. Given a sequence of non-consecutive timesteps [τi]i=1Ssuperscriptsubscriptdelimited-[]subscript𝜏𝑖𝑖1𝑆[\tau_{i}]_{i=1}^{S}[ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, a more generalized form of PDS is represented as follows:

θPDS=𝔼i,ϵτi,ϵτi1[ψ(i)(𝐱0tgt𝐱0src)+χ(i)(ϵ^τitgtϵ^τisrc)𝐱0tgtθ],subscript𝜃subscriptPDSsubscript𝔼𝑖subscriptbold-italic-ϵsubscript𝜏𝑖subscriptbold-italic-ϵsubscript𝜏𝑖1delimited-[]𝜓𝑖superscriptsubscript𝐱0tgtsuperscriptsubscript𝐱0src𝜒𝑖superscriptsubscript^bold-italic-ϵsubscript𝜏𝑖tgtsuperscriptsubscript^bold-italic-ϵsubscript𝜏𝑖srcsuperscriptsubscript𝐱0tgt𝜃\displaystyle\nabla_{\theta}\mathcal{L}_{\text{PDS}}=\mathbb{E}_{i,\boldsymbol% {\epsilon}_{\tau_{i}},\boldsymbol{\epsilon}_{\tau_{i-1}}}\left[\psi(i)(\mathbf% {x}_{0}^{\text{tgt}}-\mathbf{x}_{0}^{\text{src}})+\chi(i)(\hat{\boldsymbol{% \epsilon}}_{\tau_{i}}^{\text{tgt}}-\hat{\boldsymbol{\epsilon}}_{\tau_{i}}^{% \text{src}})\frac{\partial\mathbf{x}_{0}^{\text{tgt}}}{\partial\theta}\right],∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT PDS end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_i , bold_italic_ϵ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_ψ ( italic_i ) ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) + italic_χ ( italic_i ) ( over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT - over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) divide start_ARG ∂ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tgt end_POSTSUPERSCRIPT end_ARG start_ARG ∂ italic_θ end_ARG ] , (30)

where

ψ(i)𝜓𝑖\displaystyle\psi(i)italic_ψ ( italic_i ) =2(α¯τi1γτiδτiα¯τi)2στi2,absent2superscriptsubscript¯𝛼subscript𝜏𝑖1subscript𝛾subscript𝜏𝑖subscript𝛿subscript𝜏𝑖subscript¯𝛼subscript𝜏𝑖2superscriptsubscript𝜎subscript𝜏𝑖2\displaystyle=\frac{2(\sqrt{\bar{\alpha}_{\tau_{i-1}}}-\gamma_{\tau_{i}}-% \delta_{\tau_{i}}\sqrt{\bar{\alpha}_{\tau_{i}}})^{2}}{\sigma_{\tau_{i}}^{2}},= divide start_ARG 2 ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (31)
χ(i)𝜒𝑖\displaystyle\chi(i)italic_χ ( italic_i ) =2(α¯τi1γτiδτiα¯τi)στi2γτi1α¯τi1.absent2subscript¯𝛼subscript𝜏𝑖1subscript𝛾subscript𝜏𝑖subscript𝛿subscript𝜏𝑖subscript¯𝛼subscript𝜏𝑖superscriptsubscript𝜎subscript𝜏𝑖2subscript𝛾subscript𝜏𝑖1subscript¯𝛼subscript𝜏𝑖1\displaystyle=\frac{2(\sqrt{\bar{\alpha}_{\tau_{i-1}}}-\gamma_{\tau_{i}}-% \delta_{\tau_{i}}\sqrt{\bar{\alpha}_{\tau_{i}}})}{\sigma_{\tau_{i}}^{2}}\gamma% _{\tau_{i}}\sqrt{\frac{1}{\bar{\alpha}_{\tau_{i}}}-1}.= divide start_ARG 2 ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG - italic_γ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_δ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_γ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 1 end_ARG start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG - 1 end_ARG . (32)

For more details on timestep sampling, refer to the implementation details in the next section.

A.3 Implementation Details

In this section, we provide the implementation details of NeRF and SVG editing presented in Section 6.1 and Section 6.2 , respectively.

NeRF Editing.

We run the PDS optimization for 30000300003000030000 iterations with classifier-free guidance [12] weights within [30,100]30100[30,100][ 30 , 100 ] depending on the complexity of editing. As detailed in Section A.2, we sample non-consecutive timesteps τi1subscript𝜏𝑖1\tau_{i-1}italic_τ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT and τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT since the coefficients ψ()𝜓\psi(\cdot)italic_ψ ( ⋅ ) and χ()𝜒\chi(\cdot)italic_χ ( ⋅ ) become zero when the sampled timesteps are consecutive. For this, we define non-consecutive timesteps [τi]i=1Ssuperscriptsubscriptdelimited-[]subscript𝜏𝑖𝑖1𝑆[\tau_{i}]_{i=1}^{S}[ italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT, which is a subset sequence of the total forward process timesteps of the diffusion model, [1,,T]1𝑇[1,...,T][ 1 , … , italic_T ]. Specifically, we select these timesteps such that τi=2isubscript𝜏𝑖2𝑖\tau_{i}=\lfloor{2i}\rflooritalic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⌊ 2 italic_i ⌋, resulting in a subset sequence length of S=500𝑆500S=500italic_S = 500 out of the total T=1000𝑇1000T=1000italic_T = 1000 timesteps. We then randomly sample the index i𝑖iitalic_i within a ratio range of [0.02,0.98]0.020.98[0.02,0.98][ 0.02 , 0.98 ], i.e., i𝒰(10,490)similar-to𝑖𝒰10490i\sim\mathcal{U}(10,490)italic_i ∼ caligraphic_U ( 10 , 490 ).

During the refinement stage, we randomly choose and replace I~vsubscript~𝐼𝑣\tilde{I}_{v}over~ start_ARG italic_I end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT every 10101010 iterations, over total 15000150001500015000 iterations. We denote a SDEdit [26] operator by 𝒮(𝐱0;t0,ϵϕ)𝒮subscript𝐱0subscript𝑡0subscriptbold-italic-ϵitalic-ϕ\mathcal{S}(\mathbf{x}_{0};t_{0},\boldsymbol{\epsilon}_{\phi})caligraphic_S ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ; italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ) which samples 𝐱t0𝒩(α¯t0𝐱0,(1α¯t0)𝐈)similar-tosubscript𝐱subscript𝑡0𝒩subscript¯𝛼subscript𝑡0subscript𝐱01subscript¯𝛼subscript𝑡0𝐈\mathbf{x}_{t_{0}}\sim\mathcal{N}(\sqrt{\bar{\alpha}_{t_{0}}}\mathbf{x}_{0},(1% -\bar{\alpha}_{t_{0}})\mathbf{I})bold_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ caligraphic_N ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) bold_I ) then starts denoising it from t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using ϵϕsubscriptbold-italic-ϵitalic-ϕ\boldsymbol{\epsilon}_{\phi}bold_italic_ϵ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. For the denoising process, we randomly sample t0subscript𝑡0t_{0}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT within a ratio range of [0,0.2]00.2[0,0.2][ 0 , 0.2 ] out of total denoising steps N=20𝑁20N=20italic_N = 20.

SVG Editing.

Across all optimizations, SDS [36], DDS [10], and our proposed PDS, we apply the same classifier-free guidance weight of 100. For SDS [36], we sample t𝑡titalic_t within a ratio range of [0.05,0.95]0.050.95[0.05,0.95][ 0.05 , 0.95 ] following VectorFusion [17]. For DDS [10], we follow its original setup, sampling t𝑡titalic_t within [0.02,0.98]0.020.98[0.02,0.98][ 0.02 , 0.98 ]. For PDS, we sample i𝑖iitalic_i out of a ratio range of [0.1,0.98]0.10.98[0.1,0.98][ 0.1 , 0.98 ].

Refer to caption
Figure A7: NeRF editing user study screenshots. The participants are presented with NeRF scene videos and editing prompts, and are asked to answer the following question: When editing the video in the black box as described right next to it, which video do you expect to see? Please choose the most appropriate one.
Refer to caption
Figure A8: SVG editing user study screenshots. Given SVG images and editing prompts, the participants are asked to answer the following question: When editing the image in the black box as described right next to it, which image do you expect to see? Please choose the most appropriate one.
Refer to caption
Figure A9: The effect of the refinement stage. The overall editing outcomes are determined before the refinement stage, whereas the refinement stage plays the role of removing artifacts. The target attributes are “Batman" and “raising the arms."

A.4 Details of User Studies

We conduct user studies for the human evaluation of NeRF and SVG editing through Amazon’s Mechanical Turk. We collected survey responses only from those participants who passed our vigilance tasks. To design our vigilance tasks, we create examples where, except for the correct answer choice, all other choices are replaced with ones from different scenes or unrelated SVG examples. Screenshots of our NeRF and SVG editing user studies, including examples of vigilance tasks, are displayed in Figure A7 and Figure A8, respectively. In the NeRF and SVG editing user studies, we received 42 and 17 valid responses, respectively.

A.5 Effect of the Refinement Stage

Figure A9 illustrates an ablation study of the refinement stage across various editing methods. As depicted, the desired complex edits — making the man raise his arms — are achieved solely through the optimization of PDS. The overall editing outcomes are realized before the refinement stage, and the refinement stage further enhances the fidelity of the outputs.