Posterior Distillation Sampling

Juil Koo Chanho Park Minhyuk Sung
KAIST
{63days,charlieppark,mhsung}@kaist.ac.kr

Abstract

We introduce Posterior Distillation Sampling (PDS), a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods, which leverage the powerful 2D prior of diffusion models to handle various parametric images, have mainly focused on generation. Unlike generation, editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space, we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target, enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source’s identity. We demonstrate that this optimization resembles running a generative process with the target attribute, but aligning this process with the trajectory of the source’s generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces. Our project page is at https://posterior-distillation-sampling.github.io.

Figure 1: Parametric image editing results obtained by Posterior Distillation Sampling (PDS). PDS is an optimization tailored for editing across diverse parameter spaces. It preserves the original details of the source content while aligning them with the input texts.

Refer to caption — Figure 2: A comparison of 3D scene editing between PDS and other baselines. Given input 3D scenes on the left, PDS, marked by green boxes on the rightmost side, successfully performs complex editing, such as geometric changes and adding objects, according to the input texts. On the other hand, the baselines either fail to change the input 3D scenes or produce results that greatly deviate from the input scenes, losing their identity.

1 Introduction

Diffusion models [13, 48, 50, 47, 49] have recently led to rapid development in text-conditioned generation and editing across diverse domains, including 2D images [22, 51, 15, 54, 11], 3D objects [18, 34, 23, 21], and audio [14, 7, 57]. Among them, in particular, 2D image diffusion models [39, 41, 43, 5, 28] have demonstrated their powerful generative prior aided by Internet-scale image and text datasets [45, 44, 3]. Nonetheless, this rich 2D generative prior has been confined to pixel space, limiting their broader applicability. A pioneer work overcoming this limitation, DreamFusion [36], has introduced Score Distillation Sampling (SDS). It leverages the generative prior of text-to-image diffusion models to synthesize 3D scenes represented by Neural Radiance Fields (NeRFs) [30] from texts. Beyond NeRF representations [25, 53, 46, 59, 38, 4, 52], SDS has been widely applied to various parameter spaces, where images are not represented by pixels but specific parameterizations, such as texture [27, 1], material [56] and Scalable Vector Graphics (SVGs) [17, 55, 16].

While SDS [36] has achieved great advances in generating parametric images, editing is also an essential element for full freedom in handling visual content. Editing differs from generation in that it requires considerations of both the target text and the original source content, thereby emphasizing two key aspects: (1) alignment with the target text prompt and (2) preservation of the source content’s identity. To extend SDS, which lacks the latter aspect, Hertz et al. [10] propose Delta Denoising Score (DDS). DDS reduces the noisy gradients inherent in SDS, leading to better-maintaining background details and sharper editing outputs. However, the optimization function of DDS still lacks an explicit term for identity preservation.

To address the absence of preserving the source’s identity in SDS [36] and DDS [10], we turn our attention to a recent 2D image editing method [54, 15] based on diffusion models, known as stochastic diffusion inversion. Their primary objective is to compute the stochastic latent of an input image within the generative process of diffusion models. Once the stochastic latent of a source image is computed, the source image can be edited by running a generative process with new conditions, such as new target text prompts, while feeding the source’s stochastic latent into the process. Feeding the source’s stochastic latent into the target image’s generative process ensures that the target image maintains the structural details of the source while moving towards the direction of the target text. Thus, this editing process reflects the aforementioned two key aspects of editing.

To extend the editing capabilities of the stochastic diffusion inversion method from pixel space to parameter space, we reformulate this method into an optimization form named Posterior Distillation Sampling (PDS). Unlike SDS [36] and DDS [10], which match two noise variables, PDS aims to match the stochastic latents of the source and the optimized target. We demonstrate that our optimization process resembles aligning forward process posteriors of the source and the target, ensuring that the target’s generative process trajectory does not significantly deviate from that of the source.

When parametric images come from NeRF [30], Haque et al. [9] have recently introduced a promising text-driven NeRF editing method called Iterative Dataset Update (Iterative DU). To edit 3D scenes, it performs an editing process in 2D space bypassing direct edit in 3D space. Thus, when a text prompt induces large variations in 2D space across different views, it has difficulty producing the right edit in 3D space. On the other hand, our method directly updates NeRF in 3D space, thus gradually transforming a 3D scene into its edited version in a view-consistent manner even in the case where text prompts induce large variations, such as large geometric changes or the addition of objects to unspecified regions.

Our extensive editing experiment results, including NeRF editing (Section 6.1) and SVG editing (Section 6.2), demonstrate the versatility of our method for parametric image editing. In NeRF editing, we are the first to produce large geometric changes or to add objects to arbitrary regions without specifying local regions to be edited. Figure 2 shows these examples. Qualitative and quantitative comparisons of SVG editing with other optimization methods, namely SDS [36] and DDS [10], have demonstrated that PDS produces only the necessary changes to source SVGs, effectively aligning them with the target prompts.

2 Related Work

2.1 Score Distillation Sampling

Following the remarkable success of diffusion models in text-to-image generation, there have been attempts to leverage the 2D prior of diffusion models for various other types of generative tasks. In these tasks, images are represented through rendering processes with specific parameters, including Neural Radiance Fields [36, 52, 17], texture [1, 27], material [56] and Scalable Vector Graphics (SVGs) [17, 55, 16]. The primary method employed in these tasks is Score Distillation Sampling (SDS). SDS is an optimization approach that updates the rendering parameter towards the image distribution of diffusion models by enforcing the noise prediction on noisy rendered images to match sampled noise. Concurrently, Wang et al. [52] also have introduced Score Jacobian Chaining which converges toward a similar algorithm as SDS but from a different mathematical derivation. Wang et al. [53] have proposed Variational Score Distillation (VSD) to address over-saturation, over-smoothing, and low-diversity problems in SDS [36]. Instead of updating a single data point, VSD updates multiple data points to align an optimized distribution with the diffusion model’s image distribution. Zhu and Zhuang [59] use more accurate predictions of diffusion models via iterative denoising at every SDS update step.

When it comes to editing, Hertz et al. [10] propose Delta Denoising Score (DDS), an adaptation of SDS for editing tasks. It reduces the noisy gradient directions in SDS to better maintain the input image details. Nonetheless, its optimization function lacks an explicit term to preserve the identity of the input image, thus often producing outputs that significantly deviate from the input images. To alleviate this issue, we propose Posterior Distillation Sampling, a novel optimization approach that incorporates a term dedicated to preserving the identity of the source in its optimization function.

2.2 Text-Driven NeRF Editing

Haque et al. [9] have proposed a text-driven NeRF editing method, known as Iterative Dataset Update (Iterative DU). It iteratively replaces reference images, initially used for NeRF [30] reconstruction, with edited images using Instruct-Pix2Pix [2]. By applying a reconstruction loss with these iteratively updated images to an input NeRF [30] scene, the scene is gradually transformed to its edited counterpart. Mirzae et al. [31] improve Instruct-NeRF2NeRF [9] by computing local regions to be edited. However, this iterative image replacement method suffers from edits that involve large variations across different views, such as complex geometric changes or adding objects to unspecified regions. Thus, they have mainly focused on appearance changes.

Instead of the Iterative DU method, several recent works [35, 24, 60] directly apply SDS [36] or DDS [10] to NeRF editing. However, these optimizations do not fully consider the preservation of the source’s identity and are thus prone to producing outputs that substantially diverge from the input scenes. In contrast, our novel optimization inherently guarantees the preservation of the source’s identity, facilitating involved NeRF editing while maintaining the identity of the original scene.

2.3 Diffusion Inversion

Diffusion inversion computes the latent representation of an input image encoded in diffusion models. This allows for real image editing by finding the corresponding latent that can fairly reconstruct the given image. The computed latent is then decoded into a new image through a generative process. Using the deterministic generative process of Denoising Diffusion Implicit Models (DDIM) [48], one can approximately run the ODE of the generative process in reverse [48, 6], referred to as DDIM inversion. Several recent works have improved DDIM inversion by adjusting text features [33, 8, 32], introducing new cross-attention maps during a generative process [11] or alternatively coupling intermediate latents from two inversion trajectories [51]. Meanwhile, an alternative approach, known as DDPM inversion [15, 54], employs the stochastic generative process of Denoising Diffusion Probabilistic Models (DDPM) [13]. They focus on capturing the structural details of an input image encoded in its stochastic latent. We extend the editing capabilities of this DDPM inversion method to parameter space by reformulating the method into an optimization form.

3 Preliminaries

We first discuss existing optimization-based approaches to handle parametric images, then introduce our novel parametric image editing method in Section 4.

3.1 Score Distillation Sampling (SDS) [36]

Score Distillation Sampling (SDS) [36] is proposed to generate parametric images by leveraging the 2D prior of pre-trained text-to-image diffusion models. Given an input data $\mathbf{x}_{0}$ and a text prompt $y$ , the training objective function of diffusion models is to predict injected noise $\boldsymbol{\epsilon}$ using a noise predictor $\boldsymbol{\epsilon}_{\phi}$ :

\displaystyle\mathcal{L}(\mathbf{x}_{0})=\mathbb{E}_{t\sim\mathcal{U}(0,1),% \boldsymbol{\epsilon}_{t}}\left[w(t)\|\boldsymbol{\epsilon}_{\phi}(\mathbf{x}_% {t},y,t)-\boldsymbol{\epsilon}_{t}\|_{2}^{2}\right],

(1)

where $w(t)$ is a weighting function and $\mathbf{x}_{t}$ results from the forward process of diffusion models:

\displaystyle\mathbf{x}_{t}:=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-% \bar{\alpha}_{t}}\boldsymbol{\epsilon}_{t},\quad\boldsymbol{\epsilon}_{t}\sim% \mathcal{N}(\mathbf{0},\mathbf{I})

(2)

with variance schedule variables $\bar{\alpha}_{t}:=\prod_{s=1}^{t}\alpha_{s}$ . When the input data $\mathbf{x}_{0}$ is generated by a differentiable image generator $\mathbf{x}_{0}=g(\theta)$ , parameterized by $\theta$ , SDS updates $\theta$ by backpropagating the gradient of Equation 1 while omitting the U-Net jacobian term $\frac{\partial\boldsymbol{\epsilon}_{\phi}}{\partial\mathbf{x}_{t}}$ for computation efficiency:

\displaystyle\nabla_{\theta}\mathcal{L}_{\text{SDS}}(\mathbf{x}_{0}=g(\theta))% =\mathbb{E}_{t,\boldsymbol{\epsilon}_{t}}\left[w(t)(\boldsymbol{\epsilon}_{% \phi}(\mathbf{x}_{t},y,t)-\boldsymbol{\epsilon}_{t})\frac{\partial\mathbf{x}_{% 0}}{\partial\theta}\right],

(3)

where we denote a noise prediction of diffusion models with classifier-free guidance [12] by $\boldsymbol{\epsilon}_{\phi}$ for simplicity. Through this optimization process, SDS is capable of generating a parametric image which conforms to the input text prompt $y$ .

3.2 Delta Denoising Score (DDS) [10]

Even though SDS has been widely used for various parametric images, its optimization is designed for generation, thus it does not reflect one of the key aspects of editing: preserving the source identity.

To extend SDS to editing, Hertz et al. [10] have proposed Delta Denoising Score (DDS). Given source data $\mathbf{x}^{\text{src}}$ and its corresponding text prompt $y^{\text{src}}$ , the goal of DDS is to synthesize new target data $\mathbf{x}^{\text{tgt}}$ that is aligned with a target text prompt $y^{\text{tgt}}$ . In the SDS formula 3, DDS replaces randomly sampled noise $\boldsymbol{\epsilon}$ with a noise prediction given a source data-text pair $\boldsymbol{\epsilon}_{\phi}(\mathbf{x}^{\text{src}}_{t},y^{\text{src}},t)$ :

		$\displaystyle\nabla_{\theta}\mathcal{L}_{\text{DDS}}=$
		$\displaystyle\mathbb{E}_{t,\boldsymbol{\epsilon}_{t}}\left[w(t)\left(% \boldsymbol{\epsilon}_{\phi}(\mathbf{x}^{\text{tgt}}_{t},y^{\text{tgt}},t)-% \boldsymbol{\epsilon}_{\phi}(\mathbf{x}^{\text{src}}_{t},y^{\text{src}},t)% \right)\frac{\partial\mathbf{x}^{\text{tgt}}_{0}}{\partial\theta}\right],$		(4)

where the same noise $\boldsymbol{\epsilon}_{t}$ is shared for $\mathbf{x}_{t}^{\text{src}}$ and $\mathbf{x}_{t}^{\text{tgt}}$ :

$\displaystyle\boldsymbol{\epsilon}_{t}$	$\displaystyle\sim\mathcal{N}(\mathbf{0},\mathbf{I}),$
$\displaystyle\mathbf{x}_{t}^{\text{src}}$	$\displaystyle=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}^{\text{src}}+\sqrt{1-\bar{% \alpha}_{t}}\boldsymbol{\epsilon}_{t},$
$\displaystyle\mathbf{x}_{t}^{\text{tgt}}$	$\displaystyle=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}^{\text{tgt}}+\sqrt{1-\bar{% \alpha}_{t}}\boldsymbol{\epsilon}_{t}.$	(5)

While DDS extends SDS for editing tasks, it lacks an explicit term in its optimization to preserve the identity of the source. As a result, DDS is still prone to produce editing results that significantly deviate from the source.

3.3 Stochastic Latent in Generative Process

To achieve both conformity to the text and preservation of the source’s identity, we turn our attention to the rich information encoded in the stochastic generative process of DDPM [13]. When $\beta_{t}:=1-\alpha_{t}$ are small, it is well-known that the posterior of the forward process also follows a Gaussian distribution according to a property of Gaussians. The forward process posteriors are represented as:

\displaystyle q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})=\mathcal{N}(% \boldsymbol{\mu}(\mathbf{x}_{t},\mathbf{x}_{0}),\sigma_{t}\mathbf{I}),

(6)

where $\sigma_{t}:=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}$ and the posterior mean $\boldsymbol{\mu}$ is a linear combination of $\mathbf{x}_{0}$ and $\mathbf{x}_{t}$ : $\boldsymbol{\mu}(\mathbf{x}_{t},\mathbf{x}_{0}):=\gamma_{t}\mathbf{x}_{0}+% \delta_{t}\mathbf{x}_{t}$ with $\gamma_{t}:=\frac{\sqrt{\bar{\alpha}_{t-1}}(1-\alpha_{t})}{1-\bar{\alpha}_{t}}$ and $\delta_{t}:=\frac{\sqrt{\alpha}_{t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_{t}}$ .

Since $\mathbf{x}_{0}$ is unknown during a generative process, we approximate $\mathbf{x}_{0}$ with a one-step denoised estimate as follows:

\displaystyle\tilde{\mathbf{x}}_{0}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{% \phi}):=\frac{1}{\sqrt{\bar{\alpha}_{t}}}(\mathbf{x}_{t}-\sqrt{1-\bar{\alpha}_% {t}}\boldsymbol{\epsilon}_{\phi}(\mathbf{x}_{t},y,t)).

(7)

Consequently, one step of the generative process is represented as follows:

\displaystyle\mathbf{x}_{t-1}=\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;% \boldsymbol{\epsilon}_{\phi})+\sigma_{t}\mathbf{z}_{t},\quad\mathbf{z}_{t}\sim% \mathcal{N}(\mathbf{0},\mathbf{I}),

(8)

where $\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{\phi})=\gamma_% {t}\tilde{\mathbf{x}}_{0}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{\phi})+% \delta_{t}\mathbf{x}_{t}$ .

Using Equation 8, one can compute stochastic latent $\tilde{\mathbf{z}}_{t}$ that captures the structural details of $\mathbf{x}_{0}$ . This involves computing $\mathbf{x}_{t}$ and $\mathbf{x}_{t-1}$ via the forward process and then rearranging Equation 8 as follows:

\displaystyle\tilde{\mathbf{z}}_{t}(\mathbf{x}_{0},y;\boldsymbol{\epsilon}_{% \phi})=\frac{\mathbf{x}_{t-1}-\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;% \boldsymbol{\epsilon}_{\phi})}{\sigma_{t}}.

(9)

Several recent works [54, 15], known as DDPM inversion, have utilized the stochastic latent for image editing tasks. To edit an image using $\tilde{\mathbf{z}}_{t}$ , they first pre-compute $\tilde{\mathbf{z}}_{t}$ of the source image across all $t$ in the generative process. They then run a new generative process with a new target prompt while incorporating the pre-computed $\tilde{\mathbf{z}}_{t}$ of the source into the process instead of randomly sampled noise $\mathbf{z}_{t}$ .

Although these works [54, 15] have utilized the rich information encoded in $\tilde{\mathbf{z}}_{t}$ for an editing purpose, their applications have been limited within 2D-pixel space due to reliance on the generative process. In our work, we broaden the application of the stochastic latent to parameter space by reformulating the method as an optimization form, enabling parametric image editing.

4 Posterior Distillation Sampling

Here, we introduce Posterior Distillation Sampling (PDS), a novel optimization function designed for parametric image editing.

Our objective is to synthesize $\mathbf{x}_{0}^{\text{tgt}}$ that is aligned with $y^{\text{tgt}}$ while it retains the identity of $\mathbf{x}_{0}^{\text{src}}$ . To achieve this, we employ the stochastic latent $\tilde{\mathbf{z}}_{t}$ in our optimization. For simplicity, we denote the stochastic latents of the source and the target as follows:

	$\displaystyle\tilde{\mathbf{z}}_{t}^{\text{src}}$	$\displaystyle:=\tilde{\mathbf{z}}_{t}(\mathbf{x}_{0}^{\text{src}},y^{\text{src% }};\boldsymbol{\epsilon}_{\phi})$		(10)
	$\displaystyle\tilde{\mathbf{z}}_{t}^{\text{tgt}}$	$\displaystyle:=\tilde{\mathbf{z}}_{t}(\mathbf{x}_{0}^{\text{tgt}},y^{\text{tgt% }};\boldsymbol{\epsilon}_{\phi}).$		(11)

Using the stochastic latents, we define a novel objective function as follows:

\displaystyle\mathcal{L}_{\tilde{\mathbf{z}}_{t}}(\mathbf{x}_{0}^{\text{tgt}}=% g(\theta)):=\mathbb{E}_{t,\boldsymbol{\epsilon}_{t-1},\boldsymbol{\epsilon}_{t% }}\left[\|\tilde{\mathbf{z}}_{t}^{\text{tgt}}-\tilde{\mathbf{z}}_{t}^{\text{% src}}\|_{2}^{2}\right],

(12)

where, similar to Equation 3.2, $\tilde{\mathbf{z}}^{\text{src}}_{t}$ and $\tilde{\mathbf{z}}^{\text{tgt}}_{t}$ share the same noises, denoted by $\boldsymbol{\epsilon}_{t-1}$ and $\boldsymbol{\epsilon}_{t}$ , when computing their respective $\mathbf{x}_{t-1}$ and $\mathbf{x}_{t}$ .

Rather than matching noise variables as in SDS [36] and DDS [10], we match the stochastic latents of the source and the target via the optimization. By taking the gradient of $\mathcal{L}_{\tilde{\mathbf{z}}_{t}}$ with respect to $\theta$ and ignoring the U-Net jacobian term as previous works [36, 10, 52], one can obtain PDS as follows:

\displaystyle\nabla_{\theta}\mathcal{L}_{\text{PDS{}}}:=\mathbb{E}_{t,% \boldsymbol{\epsilon}_{t},\boldsymbol{\epsilon}_{t-1}}\left[w(t)(\tilde{% \mathbf{z}}_{t}^{\text{tgt}}-\tilde{\mathbf{z}}_{t}^{\text{src}})\frac{% \partial\mathbf{x}_{0}^{\text{tgt}}}{\partial\theta}\right].

(13)

Expanding Equation 13, the following detailed formulation is derived:

		$\displaystyle\nabla_{\theta}\mathcal{L}_{\text{PDS{}}}:=$
		$\displaystyle\mathbb{E}_{t,\boldsymbol{\epsilon}_{t},\boldsymbol{\epsilon}_{t-% 1}}\left[(\psi(t)(\mathbf{x}_{0}^{\text{tgt}}-\mathbf{x}_{0}^{\text{src}})+% \chi(t)(\hat{\boldsymbol{\epsilon}}_{t}^{\text{tgt}}-\hat{\boldsymbol{\epsilon% }}_{t}^{\text{src}}))\frac{\partial\mathbf{x}_{0}^{\text{tgt}}}{\partial\theta% }\right],$		(14)

where $\hat{\boldsymbol{\epsilon}}_{t}^{\text{src}}:=\boldsymbol{\epsilon}_{\phi}(% \mathbf{x}_{t}^{\text{src}},y^{\text{src}},t)$ and $\hat{\boldsymbol{\epsilon}}_{t}^{\text{tgt}}:=\boldsymbol{\epsilon}_{\phi}(% \mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}},t)$ . We leave a more detailed derivation to the supplementary material.

Matching $\mathbf{z}_{t}^{\text{tgt}}$ with $\mathbf{z}_{t}^{\text{src}}$ ensures that the posteriors of $\mathbf{x}_{0}^{\text{tgt}}$ and $\mathbf{x}_{0}^{\text{src}}$ do not significantly diverge, despite being steered by different prompts, $y^{\text{tgt}}$ and $y^{\text{src}}$ . This approach is akin to running a generative process with $y^{\text{tgt}}$ while remaining near the trajectory made by the posteriors of $\mathbf{x}_{0}^{\text{src}}$ . Consequently, PDS enables the sampling of $\mathbf{x}_{0}^{\text{tgt}}$ that aligns with $y^{\text{tgt}}$ , while also retaining the identity of $\mathbf{x}_{0}^{\text{src}}$ . This is achieved through the distillation of the posteriors of $\mathbf{x}_{0}^{\text{src}}$ into the target sampling process.

4.1 Comparison with SDS [36] and DDS [10]

In Figure 3, we visually illustrate the difference among the three optimization methods: SDS [36], DDS [10] and PDS. Here, we model a 2D distribution $\mathbf{x}_{0}\sim p(\mathbf{x}_{0})\in\mathbb{R}^{2}$ that is separated by two marginals, $p(\mathbf{x}_{0}|y=1)$ and $p(\mathbf{x}_{0}|y=2)$ which are colored by red and blue, respectively. Then, we train a diffusion model conditioned on the class labels $y$ . Using the pre-trained conditional diffusion model, we aim to transition $\mathbf{x}_{0}^{\text{tgt}}$ starting from $\mathbf{x}_{0}^{\text{src}}\sim p(\mathbf{x}_{0}|y=1)$ towards the other marginal $p(\mathbf{x}_{0}|y=2)$ . The trajectories of three optimization methods are plotted in Figure 3 with their endpoints denoted by stars. As illustrated, SDS and DDS significantly displace the data from the initial position, whereas our method is terminated near the boundary of the two marginals. This is the optimal endpoint for an editing purpose as it indicates proximity to both the starting points and $p(\mathbf{x}_{0}|y=2)$ , thereby achieving a balance between the necessary change and the original identity.

4.2 Comparison with Iterative DU

When a parameterization of images is given as NeRF [30], recent works [9, 31] have shown promising NeRF editing results based on a method known as Iterative Dataset Update (Iterative DU). This method bypasses 3D editing by performing the editing process within 2D space. Given an image dataset $\{I^{\text{src}}_{v}\}_{v=1}^{N}$ used for NeRF [30] reconstruction with viewpoints $v$ , they randomly replace $I^{\text{src}}_{v}$ with its 2D edited version using Instruct-Pix2Pix (IP2P) [2]. By iteratively updating the input images, they progressively transform the input NeRF scene into an edited version of it.

In contrast to Iterative DU which performs editing in 2D space, our approach directly edits NeRFs [30] in 3D space. To visually demonstrate this difference, Figure 4 presents a qualitative comparison of ours and various methods based on Iterative DU. Specifically, we compare ours with Instruct-NeRF2NeRF (IN2N) [9] which uses IP2P [2] for 2D editing. Additionally, we include another Iterative-DU-based method, Inversion2NeRF (Inv2N), which employs DDPM inversion [15] for its 2D editing process. Given the prompt “raising his arms", the figure illustrates significant variations in 2D edited images across different views: the man raises either only one arm or both arms, as marked by the red circle. Furthermore, the red arrow highlights the inconsistency in the poses of raising arms across different views. Such notable discrepancies in 2D editing hinder the Iterative DU methods from transferring these edits into 3D space. Particularly noteworthy is the comparison of our method with Inv2N, both of which leverage the stochastic latent for editing. However, while Inv2N confines its editing within 2D space, ours directly updates NeRF parameters in 3D space by reformulating the 2D image editing method [15] into an optimization form. Consequently, as shown in Figure 4 and Figure 2, ours is the only one to facilitate complex geometric changes and the addition of objects in 3D scenes. It demonstrates the strength of our method lies in the novel optimization design, which allows for direct 3D editing, not just relying on the editing capabilities of DDPM inversion [15].

5 NeRF Editing with PDS

As one of the applications of PDS, we present a detailed pipeline for NeRF [30] editing. NeRF can be seen as a parameterized rendering function. The rendering process is expressed as $I_{v}=g(v;\theta)$ , where the function takes a specific viewpoint $v$ to render the image $I_{v}$ at that viewpoint with the rendering parameter $\theta$ . Using the publicly available Stable Diffusion [41] as our diffusion prior model, we encode the current rendering at viewpoint $v$ to obtain the target latent $\mathbf{x}_{0,v}^{\text{tgt}}$ : $\mathbf{x}^{\text{tgt}}_{0,v}:=\mathcal{E}(g(v;\theta))$ , where $\mathcal{E}$ is a pre-trained encoder. Similarly, given the original source images $\{I^{\text{src}}_{v}\}$ used for NeRF [30] reconstruction, the source latent $\mathbf{x}^{\text{src}}_{0,v}$ is also computed by encoding the source image at viewpoint $v$ : $\mathbf{x}^{\text{src}}_{0,v}:=\mathcal{E}(I_{v}^{\text{src}})$ .

For real scenes, there are no given source prompts. Thus, we manually create descriptions for the real scenes, such as “a photo of a man" in Figure 1. For target prompts $y^{\text{tgt}}$ , we adjust $y^{\text{src}}$ by appending a description of a desired attribute—e.g.,“…raising his arms" in Figure 4—or by substituting an existing word in $y^{\text{src}}$ with a new one, such as changing “deer doll" to “unicorn doll" in the last row of Figure 2. Given a pre-fixed set of viewpoints $\{v\}$ , we randomly select a viewpoint $v$ to compute $\mathbf{x}_{0,v}^{\text{src}}$ and $\mathbf{x}_{0,v}^{\text{tgt}}$ . The pairs of $(\mathbf{x}_{0,v}^{\text{src}},y^{\text{src}})$ and $(\mathbf{x}_{0,v}^{\text{tgt}},y^{\text{tgt}})$ are fed into the PDS optimization to update $\theta$ in a direction dictated by the target prompt. After the optimization, the updated NeRF parameter $\tilde{\theta}$ renders an edited 3D scene that is aligned with the target prompt: $\tilde{I}_{v}:=g(v;\tilde{\theta})$ .

To further improve the final output, we take a refinement stage inspired by DreamBooth3D [38]. During iterations of the refinement stage, we randomly select an edited rendering $\tilde{I}_{v}$ and refine it into a more realistic-looking image using SDEdit [26]. The edited NeRF scenes through PDS optimization are then further refined by a reconstruction loss with these repeatedly updated images.

In some cases of source prompts we create, we observe some gap between the ideal text prompt, which would ideally reconstruct the input image through the generative process, and the actual prompt we provide. To alleviate this discrepancy issue, we have found it effective to finetune the Stable Diffusion [41] with $\{I^{\text{src}}_{v}\}$ and $y^{\text{src}}$ following the DreamBooth [42] setup.

6 Experiment Results

In this section, we conduct editing experiments across two types of parameterized images. Section 6.1 presents NeRF editing results, comparing our NeRF editing capabilities to the state-of-the-art NeRF editing methods. Furthermore, Section 6.2 shows SVG editing results to compare PDS against other optimization methods, namely SDS [36] and DDS [10].

6.1 NeRF Editing

Datasets.

We use real scenes we capture as well as the scenes from IN2N [9] and LLFF [29]. The total number of scenes is $13$ , and the final number of pairs of source scenes and target text prompts is $37$ with multiple target prompts for each scene.

Baselines.

For extensive comparisons, we evaluate our method against three baselines: Instruct-NeRF2NeRF (IN2N) [9], DDS [10] and Inversion2NeRF (Inv2N). First, we compare ours with IN2N [9], which is a state-of-the-art NeRF editing method with its code publicly available. Additionally, as introduced in Section 4.2, we conduct a comparison with Inv2N, another method based on Iterative DU, which performs editing within 2D space rather directly in 3D space, but employs DDPM inversion [15] instead of IP2P [2] for 2D editing.

Results.

Figure 2 presents the qualitative comparisons of NeRF editing. Notably, as depicted in rows 1 and 2, our method is the only one that makes large geometric changes in 3D scenes from the input text, folding the man’s arms to create natural poses of him reading a book or drinking coffee. In contrast, Iterative-DU-based methods like IN2N [9] and Inv2N fail to produce the right edits in 3D space. DDS [10] produces the outputs that completely lose the identity of the input scenes, focusing solely on conforming to the input texts. Rows 3 and 4 of Figure 2 show the editing scenarios of adding objects in outdoor scenes without specifying local regions, which also leads to large variations. Here, our method successfully adds objects like windmills and hot air balloons in the input scenes, maintaining their background details. On the other hand, the baselines either fail to add the objects in 3D space or produce outputs that significantly deviate from the original scenes. When it comes to appearance change, which induces relatively little variations across different views, both our method and IN2N [9] effectively produce the desired appearance change in 3D scenes, as shown in the last row of Figure 2. However, ours most preserves the original identity of the input scene, such as the object’s color, while making appropriate changes. Additional qualitative results are presented through videos on our project page¹¹1https://posterior-distillation-sampling.github.io.

To further assess the perceptual quality of the editing results, we conduct a user study compared to the baselines. Following Ritchie [40], participants were shown input NeRF scene videos, editing prompts, and edited NeRF scene videos produced by ours and the baselines. They were then asked to choose the most appropriate edited NeRF scene video. As illustrated in Table 1, our editing results are most preferred over the baselines in human evaluation by a large margin: 49.33% (Ours) vs. 27.71% (IN2N [9], the second best). See the supplementary material for a more detailed user study setup.

For a quantitative evaluation, we measure CLIP [37] Score that measures the similarity between edited 2D renderings and target text prompts in CLIP [37] space. As shown in Table 1, ours outperforms the baselines quantitatively. This is corroborated by the qualitative results illustrated in Figure 2, especially in scenarios of geometric changes or object addition, where the other baselines have difficulty in making the right edits.

Table 1: A quantitative comparison of NeRF editing between ours and other baselines. Ours outperforms the baselines quantitatively. Bold indicates the best result for each column.

Methods	CLIP [37] Score $\uparrow$	User Preference Rate (%) $\uparrow$
IN2N [9]	0.2280	27.71
DDS [10]	0.2210	13.71
Inv2N	0.2232	9.24
PDS (Ours)	0.2477	49.33

6.2 SVG Editing

Table 2: A quantitative comparison of SVG editing between SDS [36], DDS [10] and PDS. Ours outperforms the others in LPIPS [58] while achieving a CLIP [37] score that is on par with the others. Bold indicates the best result for each column.

Methods	CLIP [37] Score $\uparrow$	LPIPS [58] $\downarrow$	User Preference Rate (%) $\uparrow$
SDS [36]	0.2606	0.4855	30.83
DDS [10]	0.2460	0.5982	20.24
PDS (Ours)	0.2504	0.3121	48.94

Experimental Setup.

We use pairs of SVGs and their corresponding text prompts used in VectorFusion [17] as input. By manually creating target text prompts, we conduct experiments with a total of $48$ pairs of input SVGs and target text prompts. For comparison, we evaluate our method against other optimization methods, SDS [36] and DDS [10]. To perform editing with SDS, we start with a source SVG as an initial updated SVG and then update it using a target prompt according to the SDS [36] optimization. Following DDS, we use CLIP [37] score and LPIPS [58] as quantitative metrics.

Results.

Qualitative results of SVG editing are shown in Figure 5. It demonstrates that while all the methods effectively change input SVGs according to the target text prompts, ours best preserves the structural semantics of the input SVGs. This is particularly evident in row 2 of Figure 5, where ours maintains the overall color pattern of the input SVG.

The trends from the qualitative results are mirrored in our quantitative results. As seen in Table 2, ours significantly surpasses the others in LPIPS [58] by a large margin, which measures the fidelity to the input SVG, while our CLIP score is on par with the others. This demonstrates that our method introduces only minimal necessary changes to meet the described attributes in the target text prompts.

We further provide a user study result of SVG editing in Table 2. We use the same user study setup used in NeRF editing (Section 6.1). Consistent with the qualitative and quantitative results, ours are most preferred in human evaluation.

7 Conclusion

We propose Posterior Distillation Sampling (PDS), an optimization method for parametric image editing. PDS matches the stochastic latents of the source and the target to fulfill both conformity to the target text and preservation of the source identity in parameter space. We demonstrate the versatility of PDS in parametric image editing through a comparative analysis between ours and other optimization methods and extensive experiments across various parameter spaces.

Acknowledgements

This work was supported by NRF grant (RS-2023-00209723) and IITP grants (2022-0- 00594, RS-2023-00227592) funded by the Korean government (MSIT), Seoul R&BD Program (CY230112), and grants from the DRB-KAIST SketchTheFuture Research Center, Hyundai NGV, KT, NCSOFT, and Samsung Electronics.

References

Anonymous [2023] Anonymous. Learning pseudo 3D guidance for view-consistent 3D texturing with 2D diffusion. In Submitted to The Twelfth International Conference on Learning Representations, 2023. under review.
Brooks et al. [2023] Tim Brooks, Aleksander Holynski, and Alexei A. Efros. InstructPix2Pix: Learning to follow image editing instructions. In CVPR, 2023.
Byeon et al. [2022] Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
Chen et al. [2023] Rui Chen, Yongwei Chen, Ningxin Jiao, and Kui Jia. Fantasia3D: Disentangling geometry and appearance for high-quality text-to-3D content creation. In ICCV, 2023.
[5] DeepFloyd. Deepfloyd if. https://www.deepfloyd.ai/deepfloyd-if/.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 2021.
Ghosal et al. [2023] Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. Text-to-audio generation using instruction tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
Han et al. [2023] Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Yuxiao Chen, Di Liu, Qilong Zhangli, et al. Improving negative-prompt inversion via proximal guidance. arXiv preprint arXiv:2306.05414, 2023.
Haque et al. [2023] Ayaan Haque, Matthew Tancik, Alexei Efros, Aleksander Holynski, and Angjoo Kanazawa. Instruct-NeRF2NeRF: Editing 3D scenes with instructions. In ICCV, 2023.
Hertz et al. [2023a] Amir Hertz, Kfir Aberman, and Daniel Cohen-Or. Delta denoising score. In ICCV, 2023a.
Hertz et al. [2023b] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In ICLR, 2023b.
Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 2020.
Huang et al. [2023] Rongjie Huang, Jiawei Huang, Dongchao Yang, Yi Ren, Lu** Liu, Mingze Li, Zhenhui Ye, **glin Liu, Xiang Yin, and Zhou Zhao. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023.
Huberman-Spiegelglas et al. [2023] Inbar Huberman-Spiegelglas, Vladimir Kulikov, and Tomer Michaeli. An edit friendly DDPM noise space: Inversion and manipulations. arXiv preprint arXiv:2304.06140, 2023.
Iluz et al. [2023] Shir Iluz, Yael Vinker, Amir Hertz, Daniel Berio, Daniel Cohen-Or, and Ariel Shamir. Word-as-image for semantic typography. ACM TOG, 2023.
Jain et al. [2023] Ajay Jain, Amber Xie, and Pieter Abbeel. Vectorfusion: Text-to-svg by abstracting pixel-based diffusion models. In CVPR, 2023.
Jun and Nichol [2023] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
Kawar et al. [2023] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, 2023.
Kerbl et al. [2023] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM TOG, 2023.
Koo et al. [2023] Juil Koo, Seungwoo Yoo, Minh Hieu Nguyen, and Minhyuk Sung. SALAD: Part-level latent diffusion for 3d shape generation and manipulation. In ICCV, 2023.
Lee et al. [2023] Yuseung Lee, Kunho Kim, Hyun** Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions. In NeurIPS, 2023.
Li et al. [2023a] Muheng Li, Yueqi Duan, Jie Zhou, and Jiwen Lu. Diffusion-sdf: Text-to-shape via voxelized diffusion. In CVPR, 2023a.
Li et al. [2023b] Yuhan Li, Yishun Dou, Yue Shi, Yu Lei, Xuanhong Chen, Yi Zhang, Peng Zhou, and Bingbing Ni. Focaldreamer: Text-driven 3D editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608, 2023b.
Lin et al. [2023] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3D: High-resolution text-to-3D content creation. In CVPR, 2023.
Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In ICLR, 2022.
Metzer et al. [2023] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3D shapes and textures. In CVPR, 2023.
[28] Midjourney. Midjourney. https://www.midjourney.com/.
Mildenhall et al. [2019] Ben Mildenhall, Pratul P Srinivasan, Rodrigo Ortiz-Cayon, Nima Khademi Kalantari, Ravi Ramamoorthi, Ren Ng, and Abhishek Kar. Local light field fusion: Practical view synthesis with prescriptive sampling guidelines. ACM TOG, 2019.
Mildenhall et al. [2020] Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
Mirzaei et al. [2023] Ashkan Mirzaei, Tristan Aumentado-Armstrong, Marcus A Brubaker, Jonathan Kelly, Alex Levinshtein, Konstantinos G Derpanis, and Igor Gilitschenski. Watch your steps: Local image and scene editing by text instructions. arXiv preprint arXiv:2308.08947, 2023.
Miyake et al. [2023] Daiki Miyake, Akihiro Iohara, Yu Saito, and Toshiyuki Tanaka. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In CVPR, 2023.
Nichol et al. [2022] Alex Nichol, Heewoo Jun, Prafulla Dhariwal, Pamela Mishkin, and Mark Chen. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
Park et al. [2023] Jangho Park, Gihyun Kwon, and Jong Chul Ye. ED-NeRF: Efficient text-guided editing of 3D scene using latent space NeRF. arXiv preprint arXiv:2310.02712, 2023.
Poole et al. [2023] Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D using 2D diffusion. In ICLR, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021.
Raj et al. [2023] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3D: Subject-driven text-to-3D generation. In ICCV, 2023.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[40] Daniel Ritchie. Rudimentary framework for running two-alternative forced choice (2afc) perceptual studies on mechanical turk.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS, 2022.
Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
Schuhmann et al. [2022] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
Shi et al. [2023] Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. MVDream: Multi-view diffusion for 3D generation. arXiv preprint arXiv:2308.16512, 2023.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
Song et al. [2021a] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In ICLR, 2021a.
Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In NeurIPS, 2019.
Song et al. [2021b] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021b.
Wallace et al. [2023] Bram Wallace, Akash Gokul, and Nikhil Naik. EDICT: Exact diffusion inversion via coupled transformations. In CVPR, 2023.
Wang et al. [2023a] Haochen Wang, Xiaodan Du, Jiahao Li, Raymond A Yeh, and Greg Shakhnarovich. Score jacobian chaining: Lifting pretrained 2D diffusion models for 3D generation. In CVPR, 2023a.
Wang et al. [2023b] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3D generation with variational score distillation. In NeurIPS, 2023b.
Wu and la Torre [2023] Chen Henry Wu and Fernando De la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, 2023.
Xing et al. [2023] Ximing Xing, Chuang Wang, Haitao Zhou, **g Zhang, Qian Yu, and Dong Xu. Diffsketcher: Text guided vector sketch synthesis through latent diffusion models. In NeurIPS, 2023.
Xu et al. [2023] Xudong Xu, Zhaoyang Lyu, Xingang Pan, and Bo Dai. Matlaber: Material-aware text-to-3D via latent BRDF auto-encoder. arXiv preprint arXiv:2308.09278, 2023.
Yang et al. [2023] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
Zhu and Zhuang [2023] Joseph Zhu and Peiye Zhuang. HiFA: High-fidelity text-to-3D with advanced diffusion guidance. arXiv preprint arXiv:2305.18766, 2023.
Zhuang et al. [2023] **gyu Zhuang, Chen Wang, Lingjie Liu, Liang Lin, and Guanbin Li. Dreameditor: Text-driven 3D scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.

Appendix

A.1 Editing 3D Gaussian Splats [20] and 2D Images

PDS encompasses various editing scenarios, not confined within a specific parameter space. To further assess the versatility and generalizability of PDS in editing tasks, we include both 3D Gaussian Splat (3DGS) [20] editing and 2D image editing. As NeRF editing, Figure A6 shows that PDS outperforms Instruct-NeRF2NeRF [9] in 3DGS representation while uniquely realizing geometric changes. In 2D image editing, PDS demonstrates superior performance compared to Imagic [19], which is introduced for 2D image editing using pre-trained 2D diffusion models. PDS edits the input image while preserving other details with high fidelity. On the other hand, Imagic [19] leaves artifacts, losing the identity of the source content.

A.2 Derivation of Posterior Distillation Sampling

For a comprehensive derivation of Equation 4 , we first remind that the objective function of PDS is expressed as:

$\displaystyle\mathcal{L}_{\tilde{\mathbf{z}}_{t}}(\mathbf{x}_{0}^{\text{tgt}})$	$\displaystyle=\mathbb{E}\left[\\|\tilde{\mathbf{z}}_{t}^{\text{tgt}}-\tilde{% \mathbf{z}}_{t}^{\text{src}}\\|_{2}^{2}\right]$	(15)
	$\displaystyle=\mathbb{E}\left[\Big{\\|}\frac{\mathbf{x}_{t-1}^{\text{tgt}}-% \boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}};\boldsymbol% {\epsilon}_{\phi})}{\sigma_{t}}-\frac{\mathbf{x}_{t-1}^{\text{src}}-% \boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{src}},y^{\text{src}};\boldsymbol% {\epsilon}_{\phi})}{\sigma_{t}}\Big{\\|}_{2}^{2}\right]$	(16)
	$\displaystyle=\mathbb{E}\left[\frac{1}{\sigma_{t}^{2}}\big{\\|}(\mathbf{x}_{t-1% }^{\text{tgt}}-\mathbf{x}_{t-1}^{\text{src}})-\left(\boldsymbol{\mu}_{\phi}(% \mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}};\boldsymbol{\epsilon}_{\phi})-% \boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{src}},y^{\text{src}};\boldsymbol% {\epsilon}_{\phi})\right)\big{\\|}_{2}^{2}\right].$	(17)

Given that $\tilde{\mathbf{z}}^{\text{src}}_{t}$ and $\tilde{\mathbf{z}}^{\text{tgt}}_{t}$ share the same noises $\boldsymbol{\epsilon}_{t-1}$ and $\boldsymbol{\epsilon}_{t}$ for their respective $\mathbf{x}_{t-1}$ and $\mathbf{x}_{t}$ , the difference between $\mathbf{x}_{t-1}^{\text{tgt}}$ and $\mathbf{x}_{t-1}^{\text{src}}$ results in a constant multiple of the difference between $\mathbf{x}_{0}^{\text{tgt}}$ and $\mathbf{x}_{0}^{\text{src}}$ :

\displaystyle\mathbf{x}_{t-1}^{\text{tgt}}-\mathbf{x}_{t-1}^{\text{src}}

\displaystyle=\sqrt{\bar{\alpha}_{t-1}}(\mathbf{x}_{0}^{\text{tgt}}-\mathbf{x}% _{0}^{\text{src}}).

(18)

Following our notation $\hat{\boldsymbol{\epsilon}}_{t}^{\text{src}}:=\boldsymbol{\epsilon}_{\phi}(% \mathbf{x}_{t}^{\text{src}},y^{\text{src}},t)$ and $\hat{\boldsymbol{\epsilon}}_{t}^{\text{tgt}}:=\boldsymbol{\epsilon}_{\phi}(% \mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}},t)$ introduced in Section 4 , the difference between the approximated posterior means is also expressed as follows:

\displaystyle\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{tgt}},y^{\text{tgt}% };\boldsymbol{\epsilon}_{\phi})-\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t}^{\text{% src}},y^{\text{src}},\boldsymbol{\epsilon}_{\phi})

\displaystyle=(\gamma_{t}+\delta_{t}\sqrt{\bar{\alpha}_{t}})(\mathbf{x}_{0}^{% \text{tgt}}-\mathbf{x}_{0}^{\text{src}})-\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}% _{t}}-1}(\hat{\boldsymbol{\epsilon}}_{t}^{\text{tgt}}-\hat{\boldsymbol{% \epsilon}}_{t}^{\text{src}}),

(19)

where $\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{\phi})$ can be expanded as shown in the following equation:

$\displaystyle\boldsymbol{\mu}_{\phi}(\mathbf{x}_{t},y;\boldsymbol{\epsilon}_{% \phi})$	$\displaystyle=\gamma_{t}\tilde{\mathbf{x}}_{0}(\mathbf{x}_{t},y;\boldsymbol{% \epsilon}_{\phi})+\delta_{t}\mathbf{x}_{t}$	(20)
	$\displaystyle=\gamma_{t}\left(\frac{1}{\sqrt{\bar{\alpha}_{t}}}(\mathbf{x}_{t}% -\sqrt{1-\bar{\alpha}_{t}}\boldsymbol{\epsilon}_{\phi}(\mathbf{x}_{t},y,t)% \right)+\delta_{t}\mathbf{x}_{t}$	(21)
	$\displaystyle=(\frac{\gamma_{t}}{\sqrt{\bar{\alpha}}_{t}}+\delta_{t})\mathbf{x% }_{t}-\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-1}\boldsymbol{\epsilon}_{\phi% }(\mathbf{x}_{t},y,t)$	(22)
	$\displaystyle=(\gamma_{t}+\delta_{t}\sqrt{\bar{\alpha}_{t}})\mathbf{x}_{0}+% \sqrt{\frac{1}{\bar{\alpha}_{t}}-1}(\gamma_{t}+\delta_{t}\sqrt{\bar{\alpha}_{t% }})\boldsymbol{\epsilon}_{t}-\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-1}% \boldsymbol{\epsilon}_{\phi}(\mathbf{x}_{t},y,t).$	(23)

Incorporating Equation 18 and Equation 19 into Equation 17, we can reformulate the objective function of PDS as follows:

$\displaystyle\mathcal{L}_{\tilde{\mathbf{z}}_{t}}(\mathbf{x}_{0}^{\text{tgt}})% =\mathbb{E}\biggl{[}\frac{1}{\sigma_{t}^{2}}$	$\displaystyle\big{\\|}(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{% \bar{\alpha}_{t}})(\mathbf{x}_{0}^{\text{tgt}}-\mathbf{x}_{0}^{\text{src}})+% \gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-1}(\hat{\boldsymbol{\epsilon}}^{% \text{tgt}}_{t}-\hat{\boldsymbol{\epsilon}}^{\text{src}}_{t})\big{\\|}_{2}^{2}% \biggl{]}$	(24)
$\displaystyle=\mathbb{E}\biggl{[}\frac{1}{\sigma_{t}^{2}}$	$\displaystyle\biggl{(}(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{% \bar{\alpha}_{t}})^{2}(\mathbf{x}_{0}^{\text{tgt}}-\mathbf{x}_{0}^{\text{src}}% )^{2}$	(25)
	$\displaystyle+2(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{\bar{% \alpha}_{t}})\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-1}(\mathbf{x}_{0}^{% \text{tgt}}-\mathbf{x}_{0}^{\text{src}})(\hat{\boldsymbol{\epsilon}}^{\text{% tgt}}_{t}-\hat{\boldsymbol{\epsilon}}^{\text{src}}_{t})$
	$\displaystyle+\gamma_{t}^{2}(\frac{1}{\bar{\alpha}_{t}}-1)(\hat{\boldsymbol{% \epsilon}}^{\text{tgt}}_{t}-\hat{\boldsymbol{\epsilon}}^{\text{src}}_{t})^{2}% \biggl{)}\biggl{]}.$

By taking the gradient of $\mathcal{L}_{\tilde{\mathbf{z}}_{t}}$ with respect to $\theta$ while ignoring the U-Net jacobian term, $\frac{\partial\hat{\boldsymbol{\epsilon}}_{\phi}^{\text{tgt}}}{\partial\mathbf% {x}_{0}^{\text{tgt}}}=\mathbf{I}$ , one can obtain PDS as follows:

	$\displaystyle\nabla_{\theta}\mathcal{L}_{\text{PDS}}$	$\displaystyle=\frac{\partial\mathcal{L}_{\tilde{\mathbf{z}}_{t}}(\mathbf{x}_{0% }^{\text{tgt}})}{\partial\mathbf{x}_{0}^{\text{tgt}}}\cdot\frac{\partial% \mathbf{x}_{0}^{\text{tgt}}}{\partial\theta}$		(26)
		$\displaystyle=\mathbb{E}\left[\frac{2}{\sigma_{t}^{2}}\left((\sqrt{\bar{\alpha% }_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{\bar{\alpha}_{t}})^{2}(\mathbf{x}_{0}^{% \text{tgt}}-\mathbf{x}_{0}^{\text{src}})+(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}% -\delta_{t}\sqrt{\bar{\alpha}_{t}})\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-% 1}(\hat{\boldsymbol{\epsilon}}^{\text{tgt}}_{t}-\hat{\boldsymbol{\epsilon}}^{% \text{src}}_{t})\right)\frac{\partial\mathbf{x}_{0}^{\text{tgt}}}{\partial% \theta}\right].$		(27)

Thus, the coefficients $\psi(t)$ and $\chi(t)$ in Equation 4 are as follows:

	$\displaystyle\psi(t)$	$\displaystyle=\frac{2(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{% \bar{\alpha}_{t}})^{2}}{\sigma_{t}^{2}},$		(28)
	$\displaystyle\chi(t)$	$\displaystyle=\frac{2(\sqrt{\bar{\alpha}_{t-1}}-\gamma_{t}-\delta_{t}\sqrt{% \bar{\alpha}_{t}})}{\sigma_{t}^{2}}\gamma_{t}\sqrt{\frac{1}{\bar{\alpha}_{t}}-% 1}.$		(29)

In practice, we sample non-consecutive timesteps for $t-1$ and $t$ as in DDIM [48] since the coefficients become $0$ when they are consecutive. Given a sequence of non-consecutive timesteps $[\tau_{i}]_{i=1}^{S}$ , a more generalized form of PDS is represented as follows:

\displaystyle\nabla_{\theta}\mathcal{L}_{\text{PDS}}=\mathbb{E}_{i,\boldsymbol% {\epsilon}_{\tau_{i}},\boldsymbol{\epsilon}_{\tau_{i-1}}}\left[\psi(i)(\mathbf% {x}_{0}^{\text{tgt}}-\mathbf{x}_{0}^{\text{src}})+\chi(i)(\hat{\boldsymbol{% \epsilon}}_{\tau_{i}}^{\text{tgt}}-\hat{\boldsymbol{\epsilon}}_{\tau_{i}}^{% \text{src}})\frac{\partial\mathbf{x}_{0}^{\text{tgt}}}{\partial\theta}\right],

(30)

where

	$\displaystyle\psi(i)$	$\displaystyle=\frac{2(\sqrt{\bar{\alpha}_{\tau_{i-1}}}-\gamma_{\tau_{i}}-% \delta_{\tau_{i}}\sqrt{\bar{\alpha}_{\tau_{i}}})^{2}}{\sigma_{\tau_{i}}^{2}},$		(31)
	$\displaystyle\chi(i)$	$\displaystyle=\frac{2(\sqrt{\bar{\alpha}_{\tau_{i-1}}}-\gamma_{\tau_{i}}-% \delta_{\tau_{i}}\sqrt{\bar{\alpha}_{\tau_{i}}})}{\sigma_{\tau_{i}}^{2}}\gamma% _{\tau_{i}}\sqrt{\frac{1}{\bar{\alpha}_{\tau_{i}}}-1}.$		(32)

For more details on timestep sampling, refer to the implementation details in the next section.

A.3 Implementation Details

In this section, we provide the implementation details of NeRF and SVG editing presented in Section 6.1 and Section 6.2 , respectively.

NeRF Editing.

We run the PDS optimization for $30000$ iterations with classifier-free guidance [12] weights within $[30,100]$ depending on the complexity of editing. As detailed in Section A.2, we sample non-consecutive timesteps $\tau_{i-1}$ and $\tau_{i}$ since the coefficients $\psi(\cdot)$ and $\chi(\cdot)$ become zero when the sampled timesteps are consecutive. For this, we define non-consecutive timesteps $[\tau_{i}]_{i=1}^{S}$ , which is a subset sequence of the total forward process timesteps of the diffusion model, $[1,...,T]$ . Specifically, we select these timesteps such that $\tau_{i}=\lfloor{2i}\rfloor$ , resulting in a subset sequence length of $S=500$ out of the total $T=1000$ timesteps. We then randomly sample the index $i$ within a ratio range of $[0.02,0.98]$ , i.e., $i\sim\mathcal{U}(10,490)$ .

During the refinement stage, we randomly choose and replace $\tilde{I}_{v}$ every $10$ iterations, over total $15000$ iterations. We denote a SDEdit [26] operator by $\mathcal{S}(\mathbf{x}_{0};t_{0},\boldsymbol{\epsilon}_{\phi})$ which samples $\mathbf{x}_{t_{0}}\sim\mathcal{N}(\sqrt{\bar{\alpha}_{t_{0}}}\mathbf{x}_{0},(1% -\bar{\alpha}_{t_{0}})\mathbf{I})$ then starts denoising it from $t_{0}$ using $\boldsymbol{\epsilon}_{\phi}$ . For the denoising process, we randomly sample $t_{0}$ within a ratio range of $[0,0.2]$ out of total denoising steps $N=20$ .

SVG Editing.

Across all optimizations, SDS [36], DDS [10], and our proposed PDS, we apply the same classifier-free guidance weight of 100. For SDS [36], we sample $t$ within a ratio range of $[0.05,0.95]$ following VectorFusion [17]. For DDS [10], we follow its original setup, sampling $t$ within $[0.02,0.98]$ . For PDS, we sample $i$ out of a ratio range of $[0.1,0.98]$ .

A.4 Details of User Studies

We conduct user studies for the human evaluation of NeRF and SVG editing through Amazon’s Mechanical Turk. We collected survey responses only from those participants who passed our vigilance tasks. To design our vigilance tasks, we create examples where, except for the correct answer choice, all other choices are replaced with ones from different scenes or unrelated SVG examples. Screenshots of our NeRF and SVG editing user studies, including examples of vigilance tasks, are displayed in Figure A7 and Figure A8, respectively. In the NeRF and SVG editing user studies, we received 42 and 17 valid responses, respectively.

A.5 Effect of the Refinement Stage

Figure A9 illustrates an ablation study of the refinement stage across various editing methods. As depicted, the desired complex edits — making the man raise his arms — are achieved solely through the optimization of PDS. The overall editing outcomes are realized before the refinement stage, and the refinement stage further enhances the fidelity of the outputs.