(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext:

{}^{1}

McGill University

{}^{2}

The Chinese University of Hong Kong
¹¹email: [email protected] ¹¹email: [email protected] ¹¹email: [email protected]

FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

Youyuan Zhang 11 Xuan Ju Corresponding author.22 James J. Clark

{}^{\star}

Abstract

Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation, opening up possibilities for video editing based on textual input. However, the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion, making real-time applications impractical. In this work, we propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time. Our method enables direct map** from source video to target video with strong preservation ability utilizing a special variance schedule. This results in improved speed advantages, as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed, temporal consistency, and text-video alignment.

Keywords:

Video Editing Diffusion Models Consistency Models

Refer to caption — Figure 1: Editing Results of *FastVideoEdit*. *FastVideoEdit* offers efficient, consistent, high-quality, and text-aligned editing capabilities for both artificial (left col) and natural (right col) videos. The top row displays the source video, while the second and third rows showcase two edited videos. Each row features a text prompt at the top, with the edited words highlighted in red. This visual representation effectively demonstrates how our method can successfully achieve desired edits such as attribute change, object change, background change, and style change.

1 Introduction

Diffusion models [15, 35, 14, 1] have gained significant attention due to their remarkable capabilities in text-to-image [31, 35, 15, 17] and text-to-video generation [14, 34, 3, 12, 4]. Leveraging the capabilities of these models, it becomes feasible to manipulate videos [4] based on textual input, holding great potential for various applications in areas such as film production and content creation.

However, the computational cost associated with sequential sampling in diffusion models presents a significant challenge for efficient inference, especially in video editing scenarios where a set of frames need to be processed. Moreover, the absence of high-quality open-source video diffusion models [10, 28, 10] that can generate consistent editing results within a single test time inference, combined with the constraints on video duration of video diffusion models, has led to the adoption of existing image generation models for achieving accurate video editing [2, 29, 11]. To align the distribution between image and video models and perform accurate video editing, some methods employ a test-time one-shot fine-tuning for inflated image generation model on each input video [38, 33, 23, 37, 26]. However, this process further exacerbates the time-consuming nature of the editing process, which makes it impractical for real-time applications.

To enable faster video editing, three types of zero-shot methods have been proposed in the literature: (1) Layer-atlas-based methods [2, 21, 7], which involve editing the video on a flattened texture map and ensuring the temporal consistency by guaranteeing texture map consistency. However, the absence of a 3D motion prior in the 2D atlas approach results in suboptimal performance. (2) Dual-branch methods [29, 6, 11, 9], which leverage Denoising Diffusion Implicit Models (DDIM) [35] to extract source video features and generate novel content based on the target diffusion branch. The use of DDIM inversion leads to a doubling of the inference time required for video editing. (3) Additional conditional constraints incorporating methods [41, 37, 8, 43], which involve directly adding noise to the source video and denoising the noisy video using a conditioned diffusion model for preserving essential content while imposing restrictions on the editing process. While these methods are efficient during diffusion model inference, they do require additional information extraction, which slows down the overall speed of the process.

To address the issue of long computational times encountered in previous video editing methods, we introduce FastVideoEdit, which is inspired by recent advances in Consistency Models (CMs) [36]. Specifically, FastVideoEdit is a zero-shot video editing approach that not only achieves state-of-the-art performance but also significantly reduces editing time by eliminating the need for time-consuming inversion or additional condition extraction steps. The key insight of our proposed method is that the self-consistency property of CMs enables a special variance schedule that facilitates the editing process, transforming it from a process of adding noise and then denoising to one of a direct map** from source video to target video. Furthermore, the content preservation capability of CMs enables the use of fewer sampling steps while maintaining comparable generation quality, which results in an improved speed advantage of FastVideoEdit.

To evaluate FastVideoEdit, we consider metrics that encompass editing speed, temporal consistency, and text-video alignment. We compare the performance of FastVideoEdit with previous video editing methods using the TGVE 2023 open-source dataset [39] as our benchmark. The results demonstrate the superior performance of FastVideoEdit in terms of editing quality. Additionally, FastVideoEdit achieves this superior performance while requiring significantly less time for video editing tasks. This shows the efficiency and effectiveness of our approach, making it a standout choice for efficient high-quality video editing.

2 Related Work

2.1 Video Editing with Diffusion Models

The remarkable success of diffusion-based text-to-image [31, 35, 15, 19, 18] and text-to-video generation models [14, 34, 3, 12, 4] has opened up new possibilities for exciting opportunities in text-based image [13, 17] and video editing [10]. Although editing video directly through video diffusion models [10, 28, 10] show high temporal consistency, the challenges associated with extensive video model training, unstable generation quality, and video duration time limit make using inflated off-the-shelf image generation models a preferable choice for video editing, which inflating 2D model to 3D with an additional temporal channel.

Specifically, several works require a test-time one-shot fine-tuning on the inflated image generation model with each input video [38, 33, 23, 37, 26], which is time-consuming and too slow for real-time applications. Zero-shot video editing methods [2, 21, 7, 29, 13, 37, 6, 41, 11, 43, 9, 8] leverage training-free editing techniques with specialized modules to enhance temporal consistency across frames, which provide a practical and efficient solution for editing videos without the need of extensive training. Specifically, layer-atlas-based methods [2, 21, 7] edit the video on a flattened texture map, however the lack of 3d motion prior in 2d atlas leads to suboptimal performance. FateZero [29] solves this problem with a two-branch inflated image diffusion model that merges attention features of the structural preservation branch and editing branch. Similarly, Text2Video-Zero [22] and Pix2Video [6] align the feature of the source image and target image via an attention operation. To enhance pixel-level temporal consistency, Rerender A Video [41], TokenFlow [11], and Flatten [9] extract temporal-aware inter-frame features to propagate the edits throughout the video. However, previous zero-shot methods that relied on flattened image diffusion were limited by the need for DDIM inversion or additional conditional constraints (e.g., optical flow), resulting in a long runtime. In contrast, our proposed FastVideoEdit directly incorporates editing into the inference process by leveraging consistency models [36], which ensures both runtime efficiency and effective modifications.

2.2 Efficient Diffusion Models

To tackle the computational time limitations of diffusion models caused by the sequential sampling strategy, faster numerical ODE solvers [35, 42, 24] or distillation techniques [25, 32, 27, 44] have been employed. While these methods can be integrated into existing diffusion-based video editing techniques, they still face the challenge of requiring DDIM inversion or additional conditional constraints for essential content preservation.

Recently, the introduction of Consistency Models (CMs) [36, 40] has enabled faster generation by sampling along a trajectory map, thereby opening up exciting possibilities for more efficient video editing techniques. The few-step sampling strategy is particularly suitable for efficient video editing with a fast sampling speed and strong reconstruction ability. FastVideoEdit leverages the self-consistency characteristic of CMs, where the improved essential content preservation ability eliminates the need for accurate DDIM inversion and additional conditional constraints. Concurrent to our approach, OCD [20] separates diffusion sampling for edited objects and background areas, focusing most denoising steps on the former to enhance efficiency. FastVideoEdit can be directly combined with OCD to further enhance the overall efficiency of video editing.

3 Preliminaries

Diffusion models include a forward process that adds Gaussian noise $\epsilon$ to convert clean sample $z_{0}$ to noise sample $z_{T}$ , and a backward process that iteratively performs denoising from $z_{T}$ to $z_{0}$ , where $T$ represents the total number of timesteps. The denoising process of DDPM sampling [15] at step $t$ can be formulated as:

$\displaystyle z_{t-1}=$	$\displaystyle\sqrt{{{\alpha}}_{t-1}}\left(\frac{z_{t}-\sqrt{1-{{\alpha}}_{t}}% \varepsilon_{\theta}(z_{t},t)}{\sqrt{{\alpha}}_{t}}\right)$	(predicted $z_{0}$ )	(1)
	$\displaystyle+\sqrt{1-{\alpha}_{t-1}-\sigma_{t}^{2}}\cdot\varepsilon_{\theta}(% z_{t},t)$	(direction to $z_{t}$ )
	$\displaystyle+\sigma_{t}\varepsilon_{t}\quad\text{where }\varepsilon_{t}\sim% \mathcal{N}(\bm{0},\bm{I})$	(random noise).

By setting $\sigma_{t}$ to zero, DDIM sampling [35] results in an implicit probabilistic model with a deterministic forward process:

\bar{z}_{0}=f_{\theta}(z_{t},t)=\left(z_{t}-\sqrt{1-{\alpha}_{t}}\cdot% \varepsilon_{\theta}(z_{t},t)\right)/\sqrt{{\alpha}_{t}}.

(2)

Following DDIM, we can use the function $f_{\theta}$ to predict and reconstruct $\bar{z_{0}}$ given noise sample $z_{t}$ , where $t\sim\left[1,T\right]$ , $\alpha$ is the hyper-parameter, $\varepsilon_{\theta}$ is a learnable network, and $T$ represents the total number of timesteps.

Sampling in CMs [36] is carried out through a sequence of timesteps $\tau_{1:n}\in[t_{0},T]$ . Starting from an initial noise $\hat{z}_{T}$ and $z_{0}^{(T)}=f_{\theta}(\hat{z}_{T},T)$ , at each time-step $\tau_{i}$ , the process samples $\varepsilon\sim\mathcal{N}(\bm{0},\bm{I})$ and iteratively updates the Multistep Consistency Sampling process through the following equation:

	$\displaystyle\hat{z}_{\tau_{i}}$	$\displaystyle=z_{0}^{(\tau_{i+1})}+\sqrt{\tau_{i}^{2}-t_{0}^{2}}\varepsilon$		(3)
	$\displaystyle z_{0}^{(\tau_{i})}$	$\displaystyle=f_{\theta}(\hat{z}_{\tau_{i}},\tau_{i}).$		(3)

When combined with a condition $c$ with classifier-free guidance [16], sampling in CMs at $\tau_{i}$ starts with $\varepsilon\sim\mathcal{N}(\bm{0},\bm{I})$ and updates through:

	$\displaystyle\hat{z}_{\tau_{i}}$	$\displaystyle=\sqrt{{\alpha}_{\tau_{i}}}z_{0}^{(\tau_{i+1})}+\sigma_{\tau_{i}}\varepsilon,$		(4)
	$\displaystyle z_{0}^{(\tau_{i})}$	$\displaystyle=f_{\theta}(\hat{z}_{\tau_{i}},\tau_{i},c).$		(4)

Consider a special case of Eq. 1 where $\sigma_{t}$ is chosen as $\sqrt{1-\alpha_{t-1}}$ at all times $t$ . Then the DDPM forward process naturally aligns with the Multistep Consistency Sampling, and the second term of Eq. 1 vanishes:

	$\displaystyle z_{t-1}=$	$\displaystyle\sqrt{{{\alpha}}_{t-1}}\left(\frac{z_{t}-\sqrt{1-{{\alpha}}_{t}}% \varepsilon_{\theta}(z_{t},t)}{\sqrt{{\alpha}}_{t}}\right)$		(predicted $z_{0}$ )		(5)
		$\displaystyle+\sqrt{1-\alpha_{t-1}}\varepsilon_{t}\quad\varepsilon_{t}\sim% \mathcal{N}(\bm{0},\bm{I})$		(random noise).		(5)

Consider $f(z_{t},t;z_{0})=\left(z_{t}-\sqrt{1-{\alpha}_{t}}\varepsilon^{\prime}(z_{t},t% ;z_{0})\right)/\sqrt{{\alpha}_{t}}$ , where the initial $z_{0}$ is available and we replace the parameterized noise predictor $\varepsilon_{\theta}$ with $\varepsilon^{\prime}$ more generally. Eq. 5 turns into the following expression:

\displaystyle z_{t-1}=\sqrt{{{\alpha}}_{t-1}}f(z_{t},t;z_{0})+\sqrt{1-\alpha_{% t-1}}\varepsilon_{t}

(6)

which is in the same form as the Multistep Consistency Sampling step in Eq 4.

In order to make $f(z_{t},t)$ self-consistent so that it can be considered as a consistency function, i.e., $f(z_{t},t;z_{0})=z_{0}$ , we can directly solve the equation and $\varepsilon^{\prime}$ can be computed without parameterization:

\varepsilon^{\text{cons}}=\varepsilon^{\prime}(z_{t},t;z_{0})=\frac{z_{t}-% \sqrt{{\alpha}_{t}}z_{0}}{\sqrt{1-{\alpha}_{t}}}.

(7)

We arrive at a non-Markovian forward process, in which $z_{t}$ directly points to the ground truth $z_{0}$ without neural prediction, and $z_{t-1}$ does not depend on the previous step $z_{t}$ like a consistency model.

4 Method

The task of video editing can be described as the following: Given an ordered set of $m$ source video frames $\mathcal{I}_{src}=\{I_{src}^{1},I_{src}^{2},...,I_{src}^{m}\}$ and a source prompt $\mathcal{P}_{src}$ describing the source video, we aim to generate an edited video with temporally consistent frames $\mathcal{I}_{edit}=\{I_{edit}^{1},I_{edit}^{2},...,I_{edit}^{m}\}$ according to a target prompt $\mathcal{P}_{tgt}$ .

This paper introduces FastVideoEdit, an end-to-end video edit framework that edits video efficiently while producing high-quality and temporally consistent editing content. Notably, our method achieves better background preservation compared with existing methods when editing foreground object-level attributes. Unlike many existing methods that depend on additional estimations such as depth control, edge control, or optical flow, FastVideoEdit requires only the source video frames and prompts as input throughout the editing process.

4.1 Video Reconstruction with Consistency Model

To our knowledge, FastVideoEdit is the first method in video editing that eliminates the need for the DDIM inversion process while simultaneously performing a complete denoising process on individual video frames. To enable direct editing of the source video without the need for the inversion process, we leverage a consistency model inspired by InfEdit [40]. The key idea to reconstruct source latent is to start with randomly sampled reconstruction noise rather than randomly initialized noisy latents. Following the Multistep Consistency Sampling in Eq 3, we sample a noise $\varepsilon_{t}^{\text{cons}}$ at each timestep $t$ and the noisy latent $z_{t}^{\text{src}}$ becomes directly tractable when $z_{0}^{\text{src}}$ is given in the editing problem. Instead of denoising the randomly initialized noisy latent $z_{T}^{\text{src}}$ , the whole trajectory of $\{z_{t}^{\text{src}}\}$ is obtained directly from the sampled noise trajectory $\{\varepsilon_{t}^{\text{cons}}\}$ , and in the reverse direction each $\varepsilon_{t}^{\text{cons}}$ can be used to reconstruct $z_{0}^{\text{src}}$ given $z_{t}^{\text{src}}$ . The map**s between $z_{t}^{\text{src}}$ and $\{\varepsilon_{t}^{\text{cons}}\}$ given $z_{0}^{\text{src}}$ are given by:

	$\displaystyle z_{t}^{\text{src}}$	$\displaystyle=\sqrt{{\alpha}_{t}}z_{0}^{\text{src}}+\sqrt{1-{\alpha}_{t}}% \varepsilon_{t}^{\text{cons}}$		(8)
	$\displaystyle\varepsilon^{\text{cons}}_{t}$	$\displaystyle=(z_{t}^{\text{src}}-\sqrt{{\alpha}_{t}}z_{0}^{\text{src}})/\sqrt% {1-{\alpha}_{t}}.$		(8)

where $\varepsilon_{t}^{\text{cons}}\sim\mathcal{N}(\bm{0},\bm{I})$ is sampled independently at each timestep. As a result, the reconstructed latent $z_{t}=z_{0}$ is guaranteed at each timestep using Eq (2).

4.2 Video Editing with Consistency Model

This section introduces the method to compute $z_{0}^{\text{edit}}$ given $z_{0}^{\text{src}}$ . In addition to $z_{t}^{\text{src}}$ and $\varepsilon^{\text{cons}}_{t}$ obtained from Eq (8), we need to predict the editing noise $\varepsilon_{\theta}(z_{t}^{\text{edit}},t,\mathcal{P}_{\text{tgt}})$ to generate the editing latent $z_{0}^{\text{edit}}$ according to target prompt $\mathcal{P}_{\text{tgt}}$ . Due to the self-consistency property of LCMs, the gap between $\varepsilon_{\theta}(z_{t}^{\text{edit}},t,\mathcal{P}_{\text{tgt}})$ and $\varepsilon_{t}^{\text{edit}}$ is small. Therefore, using the noise calibration $\Delta\varepsilon_{t}^{\text{cons}}$ from $\varepsilon_{\theta}(z_{t}^{\text{src}},t,\mathcal{P}_{\text{src}})$ to the ground-truth source reconstruction noise $\varepsilon^{\text{cons}}_{t}$ , we can estimate the editing reconstruction noise as well as the editing latent $z_{0}^{\text{edit}}$ at each timestep $t$ :

$\displaystyle\Delta\varepsilon_{t}^{\text{cons}}$	$\displaystyle=\varepsilon^{\text{cons}}_{t}-\varepsilon_{\theta}(z_{t}^{\text{% src}},t,\mathcal{P}_{s})$	(9)
$\displaystyle\varepsilon_{t}^{\text{edit}}$	$\displaystyle=\varepsilon_{\theta}(z_{t}^{\text{edit}},t,\mathcal{P}_{t})+% \Delta\varepsilon_{t}^{\text{cons}}$
$\displaystyle z_{0}^{\text{edit}}$	$\displaystyle=\left(z_{t}^{\text{edit}}-\sqrt{1-{\alpha}_{t}}\cdot\varepsilon_% {t}^{\text{edit}}\right)/\sqrt{{\alpha}_{t}}.$

Compared with editing a single frame, we impose the constraints that the initial latent and random noise sampled at each timestep are identical across all frames. Since the forward process of the denoising network $\varepsilon_{\theta}(\cdot,\cdot,\cdot)$ as well as the calibration process of noise and the updating process of latent are all deterministic relative to their inputs, identical initial latents and noise samples at each timestep result in identical output latents when source latents are also identical. In practice, if source latents are temporally consistent and close to each other, the output latents should also maintain good temporal consistency.

4.3 Batch Attention Control

As an end-to-end inference-based editing framework FastVideoEdit starts with directly denoising the batched latent $\mathcal{Z}_{t}^{\text{edit}}$ according to the target prompt $\mathcal{P}_{\text{tgt}}$ . A naive way of editing the target frame latent $z_{0}^{\text{src}}$ by the target prompt is to denoise the DDIM inversion $z_{T}^{\text{inv}}$ of $z_{0}^{\text{src}}$ iteratively through $\varepsilon_{\theta}(z_{t}^{\text{inv}},t,\mathcal{P}_{\text{tgt}})$ . In section 4.2, we introduced consistency model-based batch editing which leverages the property of LCMs to skip the time-consuming DDIM inversion process and directly denoise randomly initialized latent while kee** content aligned faithfully with source frames. However, without additional control, denoising conditioned on a target prompt $\mathcal{P}_{\text{tgt}}$ can still produce editing content distinct from the source content.

Inspired by MasaCtrl [5] and Prompt-to-prompt [13], we propose Cross-Frame Mutual Self-Attention (CF-Masa) and Re-weighted Cross Attention (Re-CA) to allow further attention control when denoising the $z_{t}^{\text{edit}}$ conditioned on $\mathcal{P}_{\text{tgt}}$ . Specifically, we concurrently denoise two batched latents $[\mathcal{Z}_{t}^{\text{src}},\mathcal{Z}_{t}^{\text{edit}}]$ conditioned on $[\mathcal{P}_{\text{src}},\mathcal{P}_{\text{tgt}}]$ respectively. The proposed CF-Masa and Re-CA can be directly applied in the forward process of $\varepsilon_{\theta}([\mathcal{Z}_{t}^{\text{src}},\mathcal{Z}_{t}^{\text{edit% }}],t,[\mathcal{P}_{\text{src}},\mathcal{P}_{\text{tgt}}])$ .

4.3.1 Cross-Frame Mutual Self-Attention

The denoising UNet consists of different size downsample/upsample blocks and a middle block, which have four resolution levels in the latent space. Each resolution level incorporates a 2D convolution layer followed by self-attention and cross-attention layers. The attention mechanism can be formulated as:

\displaystyle\text{attn}(Q,K,V)=\text{softmax}(\frac{QK^{T}}{\sqrt{d}}V).

(10)

In self-attention layers, $Q,K,V$ are the query, key, and value features obtained by projecting the same spatial features. Without attention control, the self-attention output of source branch $\text{attn}(Q^{\text{src}},K^{\text{src}},V^{\text{src}})$ and editing branch $\text{attn}(Q^{\text{edit}},K^{\text{edit}},V^{\text{edit}})$ are computed concurrently and independently of each other. We make two changes on self-attention layers to preserve content consistency as well as temporal consistency between and within editing latent and source latent. In contrast to MasaCtrl [5], the preservation of content consistency in FastVideoEdit is achieved by replacing $Q^{\text{edit}}$ and $K^{\text{edit}}$ with $Q^{\text{src}}$ and $K^{\text{src}}$ after a fixed step $t_{s}$ and the editing branch remains unchanged before $t_{s}$ . To further maintain temporal consistency across batched latents within a branch, we concatenate the key features $[K_{1},K_{2},...,K_{m}]$ and value features $[V_{1},V_{2},...,V_{m}]$ along their sequence length dimension resulting in the final format becomes:

		$\displaystyle\text{CF-Masa}(\{Q_{i}^{\text{edit}},K_{i}^{\text{edit}},V_{i}^{% \text{edit}}\},t)$		(11)
		$\displaystyle:=\begin{cases}\{Q_{i}^{\text{src}},\text{concat}\{K^{\text{src}}% \},\text{concat}\{V^{\text{edit}}\}\}&t\geq t_{s}\\ \{Q_{i}^{\text{edit}},\text{concat}\{K^{\text{edit}}\},\text{concat}\{V^{\text% {edit}}\}\}&t<t_{s}\end{cases}.$		(11)

4.3.2 Re-weighted Cross Attention

The forward process of cross-attention can be edited in a similar way to self-attention. In cross-attention layers, $Q$ is the set of query features obtained obtaining by projecting spatial features coming from self-attention layer, $K,V$ are obtained from the prompt embeddings. By replacing the cross-attention map of the editing branch with that of the source branch [13], the scattering from source prompt mutual content to the source spatial features can be maintained on editing spatial features. To further enhance the effect of the editing token, the corresponding attention map of the editing token can be multiplied by a replace scale $r\geq 1$ . The resulting formulation of the Re-weighted Cross Attention is given by:

	$\displaystyle\text{Refine}(A^{\text{src}},A^{\text{edit}})_{i,j}=\begin{cases}% \left(A^{\text{edit}}\right)_{i,j}&\text{if}\ f_{\mathcal{P}}(j)=\text{None}\\ \left(A^{\text{edit}}\right)_{i,f_{\mathcal{P}}(j)}&\text{otherwise}\end{cases}$		(12)
	$\displaystyle\text{Re-CA}(A^{\text{src}},A^{\text{edit}},t):=\begin{cases}r% \cdot\text{Refine}(A^{\text{src}},A^{\text{edit}})&t\geq t_{c}\\ A^{\text{edit}}&t<t_{c}\end{cases}$		(12)

where $f_{\mathcal{P}}(\cdot)$ is the alignment function indicating the source prompt token index of the $j^{th}$ token in the target prompt and None if missing.

4.4 Background Preservation via Latent Replacement

There is a trade-off in existing video editing methods between the editing effect of foreground objects and content preservation of background. Changing the attributes of an object in the foreground usually makes the background more consistent with the change. This is because the control methods that are applied to the forward process are not strict control over the latent space. Therefore the change of tokens in the target prompt also influences irrelevant regions of editing latent through attention mechanisms. Compared with state-of-the-art video editing methods, a significant advantage of FastVideoEdit is the accuracy of foreground editing. This is shown in both quantative and qualitative results in Sec. 5. We achieve this by multiple designs of FastVideoEdit. Consistent initial latents and noise in Batch Consistency Sampling algorithm and attention control both provide faithful editing concerning the source video. In addition to this, we propose further background preservation strategies to enhance the faithfulness of the edited content to the source content. Specifically, we propose to simultaneously denoise a background branch that maintains the structure information of the editing branch while aligning content with the source branch. Based on the background branch, we additionally propose a latent replacement algorithm that replaces the background part in the editing latent with the corresponding part in the background latent.

4.4.1 Background Branch

By simultaneously denoising a background branch conditioned on $\mathcal{P}_{src}$ and imposing self-attention control from the source branch and editing branch, we expect the background branch to maintain the structure of the editing branch and the content of the source branch. We modify the self-attention process of the background branch as follows:

		$\displaystyle\text{Bg-Masa}(\{Q_{i}^{\text{bg}},K_{i}^{\text{bg}},V_{i}^{\text% {bg}}\},t)$		(13)
		$\displaystyle:=\begin{cases}\{Q_{i}^{\text{src}},\text{concat}\{K^{\text{src}}% \},\text{concat}\{V^{\text{src}}\}\}&t\geq t_{bg}\\ \{Q_{i}^{\text{edit}},\text{concat}\{K^{\text{src}}\},\text{concat}\{V^{\text{% src}}\}\}&t<t_{bg}.\end{cases}$		(13)

To maintain the editing structure and source content, we employ a similar editing approach to MasaCtrl [5] since query features from the edit branch are used to maintain structure information. Meanwhile, the key and value features are copied from the source branch to maintain consistency with the source content. Note that the joint attention is working at early timestep instead of later timesteps as described in MasaCtrl [5] because our observation is that the structure is formed at early steps and content details are refined at later steps.

4.4.2 Latent Replacement

At the end of each denoising step, we employ the latent replacement operation to replace the background region of the editing latent with the corresponding region of the source latent. The region is determined by computing the relative region from a cross-attention map. Specifically, given a cross-attention map $(A^{\text{edit}})_{m\times n}$ , we obtain a replacement map $(M^{\text{edit}})_{m}$ where $m$ is the sequence length of the attention map or the size of the feature map, and $n$ is the number of tokens in $\mathcal{P}_{tgt}$ . The replacement map is computed as follows:

	$\displaystyle(\hat{A}^{\text{edit}})_{i}$	$\displaystyle=\frac{\Sigma_{j}(A^{\text{edit}})_{i,j}\cdot\mathbf{I}_{f_{% \mathcal{P}}(j)\neq\text{None}}}{\Sigma_{j}(A^{\text{edit}})_{i,j}}$		(14)
	$\displaystyle(M^{\text{edit}})_{i}$	$\displaystyle=\mathbf{I}_{(\hat{A}^{\text{edit}})_{i}}\geq\text{thresh}_{\text% {edit}}.$		(14)

Intuitively, the replacement map has $1$ at positions where the edited tokens receive high attention scores among all the tokens, and $0$ anywhere else. In practice, $A^{\text{edit}}$ is obtained by averaging among all the cross-attention maps of the same size in a fixed resolution level. The replaced edited latent at the end of denoising step $t$ is:

\displaystyle(z_{t}^{\text{edit}})=M_{t}^{\text{edit}}\odot(z_{t}^{\text{edit}% })+(1-M_{t}^{\text{edit}})\odot(z_{t}^{\text{bg}}).

(15)

4.5 Frame Consistency with Tokenflow

Following [11], we apply tokenflow to improve temporal consistency across frames. Tokenflow is a plug-and-play module that can be applied at each layer of the denoising network. The idea of Tokenflow is to first select and denoise a group of keyframes, and then replace the original spatial features with the weighted sum of the two most similar spatial features from two adjacent keyframes when denoising each frame latent. In the first stage, Tokenflow selects a group of keyframes of indices $\kappa$ and in each layer at each step and store $\mathbf{T}_{base}=\{\phi(z^{i})\}_{i\in\kappa}$ , where $\phi(\cdot)$ maps the latent to its spatial features $(z^{i})$ . When computing the features of an arbitrary frame latent $z^{i}$ , the method queries its two adjacent frames latent of indices $i-$ and $i+$ , and gets the closest feature index $\gamma^{i\pm}[p]$ for each of its feature indexed $p$ as follows:

\gamma^{i\pm}[p]=\operatorname*{arg\,min}_{q}{\mathcal{D}\left({\phi({z}^{i})[% p]},{\phi({z}^{i\pm})[q]}\right)}

(16)

where $\mathcal{D}$ represents cosine distance of two features. The output weighted spatial features of frame latent $z_{i}$ therefore become:

\mathcal{F}_{\gamma}(\mathbf{T}_{base},i,p)=w_{i}\cdot\phi(z^{i+})[\gamma^{i+}% [p]]\;+\;(1-w_{i})\cdot\phi(z^{i-})[\gamma^{i-}[p]].

(17)

In practice, Tokenflow is a plug-and-play operation that can be applied after the self-attention layer. It replaces the original output of spatial features $\phi(z^{i})$ of the original frame latent with the features of weighted sum of two adjacent key frames $\{\mathcal{F}_{\gamma}(\mathbf{T}_{base},i,p)\}_{p}$ .

The overall FastVideoEdit algorithm is shown in Algorithm 1 and Figure 2.

Algorithm 1 FastVideoEdit editing

1:For abbreviation, we denote

\mathcal{A}\sim\mathcal{P}

as every element in the

\mathcal{A}

has the same value sampled from distribution

\mathcal{P}

2:Input:

3: Latent Consistency Model

\varepsilon_{\theta}(\cdot,\cdot,\cdot)

4: Sequence of timesteps

\tau_{1}>\tau_{2}>\cdots>\tau_{N-1}

5: Batched source latents

\mathcal{Z}_{0}^{\text{src}}=\{z_{0}^{\text{src},(i)}\ |\ 1\leq i\leq m\}

6: Source and target prompts

\mathcal{P}_{src},\mathcal{P}_{tgt}

7:Set batch attention control on

\varepsilon_{\theta}(\cdot,\cdot,\cdot)

8:Set Tokenflow propagation on

\varepsilon_{\theta}(\cdot,\cdot,\cdot)

9:Initial batched latents

\mathcal{Z}_{\tau_{1}}^{\text{src}}=\mathcal{Z}_{\tau_{1}}^{\text{edit}}=% \mathcal{Z}_{\tau_{1}}^{\text{bg}}\sim\mathcal{N}(\bm{0},\bm{I})

10:Compute

\{\varepsilon^{\text{cons}}_{\tau_{1}}\}

using Eq 8

11:for

n=1

N-1

12: Compute

\mathbf{T}_{\text{base}}^{\text{edit}}

and

\mathbf{T}_{\text{base}}^{\text{edit}}

13: Denoise three branches

\{\varepsilon_{\theta}(\{z^{\text{src}}_{\tau_{n}},z^{\text{edit}}_{\tau_{n}},% z^{\text{bg}}_{\tau_{n}}\},{\tau_{n}},\{\mathcal{P}_{src},\mathcal{P}_{tgt},% \mathcal{P}_{src})\};\mathbf{T}_{\text{base}}\}

14: Update

\mathcal{Z}^{\text{src}}_{\tau_{n+1}}

using Eq 8

15: Update

\mathcal{Z}^{\text{edit}}_{0}

and

\mathcal{Z}^{\text{bg}}_{0}

using Eq 9

16: Sample new reconstruction noise

\{\varepsilon_{\tau_{n+1}}^{\text{cons}}\}\sim\mathcal{N}(\bm{0},\bm{I})

17: Update

\mathcal{Z}^{\text{edit}}_{\tau_{n+1}}

and

\mathcal{Z}^{\text{bg}}_{\tau_{n+1}}

using Eq 8

18: Replace latents

\mathcal{Z}^{\text{edit}}_{\tau_{n+1}}

using Eq 14 and 15

19:end for

20:Output:

\mathcal{Z}_{0}^{\textrm{edit}}

5 Experiments

In this section, we first introduce the evaluation benchmark and evaluation metrics used in our experiment in Sec. 5.1. Following that, we present a quantitative comparison of our methods in Sec. 5.2 and a qualitative comparison in Sec. 5.3.

5.1 Evaluation Benchmark and Metrics

Evaluation Dataset.

For the evaluation of video editing, we utilize the TGVE 2023 open-source dataset [39] as our benchmark. This dataset consists of 76 videos, each containing 32 frames with a resolution of 480x480 pixels.

Evaluation Metrics.

Following previous work [29, 11], we evaluate the temporal consistency of our approach by utilizing clip similarity [30] among frames (‘Tem-Con’). Additionally, we measure the frame-wise editing accuracy through two metrics. ‘Txt-Sim’ for clip similarity between the embeddings of text and image and ‘Clip-Acc’ for the percentage of frames where the edited image has a higher CLIP similarity to the target prompt compared to the source prompt. Furthermore, as an additional evaluation metric, we measure the time consumption of editing $32$ frames’ video using FastVideoEdit and previous methods in both the inversion and forward processes to evaluate the speed.

5.2 Quantitative Comparison

In Tab. 1 we compare FastVideoEdit with two additional conditional constraints incorporating methods Rerender [41] and Text2Video-Zero [22] as well as three dual-branch methods FateZero [29], Pix2Video [6], and TokenFlow [11].

The results demonstrate that FastVideoEdit achieves state-of-the-art performance in terms of temporal consistency and per-frame editing accuracy, while significantly reducing the time required for the editing process. Comparatively, our method outperforms previous additional conditional constraints incorporating methods and dual-branch methods in terms of efficiency, delivering high-quality results in less time. The reduction in runtime originates from two aspects: the elimination of inversion and additional condition feature extraction, and the use of fewer sampling steps. This highlights the effectiveness and efficiency of FastVideoEdit in video editing tasks.

Table 1: Comparison of FastVideoEdit with previous video editing methods. Bold indicates the best. Underline indicates the second best.

Model	CLIP Metrics $\uparrow$			Time $\downarrow$
Model	Tem-Con	Txt-Sim	Clip-Acc	Inversion	Forward	Sum
Rerender [41]	95.7	25.0	48.5	-	174.3	174.3
Text2Video-Zero [22]	96.9	27.1	70.7	-	131.0	131.0
FateZero [29]	95.7	24.9	35.8	233.7	347.0	581.7
Pix2Video [6]	96.0	27.5	68.5	185.3	213.0	399.3
TokenFlow [11]	96.5	25.5	54.7	176.5	115.9	292.4
Ours	96.5	27.7	71.1	-	61.7	61.7

5.3 Qualitative Comparison

Qualitative comparison of FastVideoEdit and previous video editing methods is shown in Fig. 3. We compare additional conditional constraints incorporating methods Rerender [41] and Text2Video-Zero [22] as well as three dual-branch methods FateZero [29], Pix2Video [6], and TokenFlow [11].

The results show that FastVideoEdit effectively performs video editing aligned with the text prompt while preserving the essential content of the source video. Through attention control, latent replacement, and leveraging the preservation ability of the consistency model, FastVideoEdit successfully performs video foreground editing while preserving the background. This approach enables targeted editing of the foreground elements in the video while ensuring that the background remains intact. By selectively focusing on specific regions of interest and employing latent replacement techniques, FastVideoEdit achieves accurate and consistent editing results, maintaining the integrity of the background content. It is worth noting that FastVideoEdit achieves superior performance compared to other methods while requiring significantly less time. This highlights the efficiency and effectiveness of our approach in delivering high-quality results in a more time-efficient manner.

5.4 Ablation Study

We ablate the use of Bg-Masa, CF-Masa, Re-CA and TokenFlow propagation. Quantitative results and qualitative results are shown in Tab. 2 and Fig. 4. Without background preservation, the background dirt is changed. Results show that removing CF-Masa and TokenFlow results in a worse temporal consistency. Moreover, replacing our attention control with PnP results in a worse editing effect (See left rabbit’s ears and right rabbit’s tail).

Tab. 2 shows that without latent replacement the temporal consistency and CLIP accuracy metrics rise, which illustrates that latent replacement protects background but does not help with either temporal consistency or CLIP accuracy. The improvement in background preservation is observed evidently in qualitative results which is not reflected on CLIP metrics. Imposing background preservation prevents the adaption of background to the editing prompt which is negatively reflected on CLIP based similarity evaluation. However, visual observation by eyes can hardly capture the negative impact it causes in terms of content editing. Apart from background preservation designs, the rest of our proposed attention controls achieve better performance in all the three metrics, which shows the effectiveness of our proposed methods.

Table 2: Ablation study for architecture design of FastVideoEdit. Bold indicates the best. Underline indicates the second best.

Model	CLIP Metrics $\uparrow$
Model	Tem-Con	Txt-Sim	Clip-Acc
Ours	96.5	27.7	71.1
w/o Bg-Masa	96.7	27.5	72.3
w/o CF-Masa	96.3	26.7	69.3
w/ PnP	96.5	25.8	60.0

6 Conclusion

Conclusion.

In this work, we have introduced FastVideoEdit, a zero-shot video editing approach that addresses the computational challenges faced by previous methods. By leveraging the self-consistency property of Consistency Models (CMs), our method eliminates the need for time-consuming inversion or additional condition extraction steps. We have also introduced a novel approach for maintaining background preservation via latent replacement, which simultaneously denoises a background branch while imposing self-attention control from the source and editing branches. Our experimental results demonstrate the superior performance of FastVideoEdit in terms of editing quality while requiring significantly less time for video editing tasks.

Limitations and future work.

However, FastVideoEdit still has some limitations: (1) FastVideoEdit may require tuning its hyperparameters to achieve optimal performance on each video. This dependency on hyperparameter adjustment adds complexity to the editing process and may require expertise or extensive experimentation to achieve satisfactory results. (2) While FastVideoEdit demonstrates state-of-the-art performance in video editing, there is no guarantee of success for every editing case. The effectiveness of the approach may vary depending on factors such as input data quality, the complexity of the editing task, and the suitability of chosen hyperparameters. (3) The performance of FastVideoEdit relies on the quality and capabilities of the underlying consistency models. In our future work, we are committed to making further improvements to address the challenges highlighted in our research.

Possible negative social impact.

Video editing approaches may pose privacy risks if used to alter videos without appropriate consent or to create content that invades someone’s privacy. Moreover, the convenience and speed offered by FastVideoEdit may inadvertently encourage irresponsible editing practices, leading to ethical dilemmas in areas such as journalism, entertainment, and personal communication. Addressing these possible negative social impacts and continuously improving our models are key focuses for our future releases.

References

[1] Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
[2] Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text-driven layered image and video editing. In: European conference on computer vision. pp. 707–723. Springer (2022)
[3] Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22563–22575 (2023)
[4] Brooks, T., Peebles, B., Homes, C., DePue, W., Guo, Y., **g, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024), https://openai.com/research/video-generation-models-as-world-simulators
[5] Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465 (2023)
[6] Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: Video editing using image diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23206–23217 (2023)
[7] Chai, W., Guo, X., Wang, G., Lu, Y.: Stablevideo: Text-driven consistency-aware diffusion video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23040–23050 (2023)
[8] Chen, W., Wu, J., Xie, P., Wu, H., Li, J., Xia, X., Xiao, X., Lin, L.: Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840 (2023)
[9] Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J.M., Rosenhahn, B., Xiang, T., He, S.: Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922 (2023)
[10] Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7346–7356 (2023)
[11] Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373 (2023)
[12] Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Fei-Fei, L., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662 (2023)
[13] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
[14] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
[15] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
[16] Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
[17] Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)
[18] Ju, X., Zeng, A., Wang, J., Xu, Q., Zhang, L.: Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 618–629 (2023)
[19] Ju, X., Zeng, A., Zhao, C., Wang, J., Zhang, L., Xu, Q.: Humansd: A native skeleton-guided diffusion model for human image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15988–15998 (2023)
[20] Kahatapitiya, K., Karjauv, A., Abati, D., Porikli, F., Asano, Y.M., Habibian, A.: Object-centric diffusion for efficient video editing. arXiv preprint arXiv:2401.05735 (2024)
[21] Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. ACM Transactions on Graphics (TOG) 40(6), 1–12 (2021)
[22] Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., Shi, H.: Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)
[23] Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761 (2023)
[24] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems 35, 5775–5787 (2022)
[25] Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)
[26] Ma, Y., Cun, X., He, Y., Qi, C., Wang, X., Shan, Y., Li, X., Chen, Q.: Magicstick: Controllable video editing via control handle transformations. arXiv preprint arXiv:2312.03047 (2023)
[27] Meng, C., Rombach, R., Gao, R., Kingma, D., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14297–14306 (2023)
[28] Molad, E., Horwitz, E., Valevski, D., Acha, A.R., Matias, Y., Pritch, Y., Leviathan, Y., Hoshen, Y.: Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)
[29] Qi, C., Cun, X., Zhang, Y., Lei, C., Wang, X., Shan, Y., Chen, Q.: Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)
[30] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). pp. 8748–8763. PMLR (2021)
[31] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
[32] Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)
[33] Shin, C., Kim, H., Lee, C.H., Lee, S.g., Yoon, S.: Edit-a-video: Single video editing with object-aware consistency. arXiv preprint arXiv:2303.07945 (2023)
[34] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
[35] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
[36] Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Proceedings of the 40th International Conference on Machine Learning (2023)
[37] Wang, W., Jiang, Y., Xie, K., Liu, Z., Chen, H., Cao, Y., Wang, X., Shen, C.: Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599 (2023)
[38] Wu, J.Z., Ge, Y., Wang, X., Lei, S.W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z.: Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7623–7633 (2023)
[39] Wu, J.Z., Li, X., Gao, D., Dong, Z., Bai, J., Singh, A., Xiang, X., Li, Y., Huang, Z., Sun, Y., et al.: Cvpr 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003 (2023)
[40] Xu, S., Huang, Y., Pan, J., Ma, Z., Chai, J.: Inversion-free image editing with natural language (2024)
[41] Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954 (2023)
[42] Zhang, Q., Chen, Y.: Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902 (2022)
[43] Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077 (2023)
[44] Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., Anandkumar, A.: Fast sampling of diffusion models via operator learning. In: International Conference on Machine Learning. pp. 42390–42402. PMLR (2023)