HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.06269v1 [cs.CV] 10 Mar 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: 11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTMcGill University 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTThe Chinese University of Hong Kong
11email: [email protected]11email: [email protected]11email: [email protected]

FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

Youyuan Zhang 11    Xuan Ju Corresponding author.22    James J. Clark {}^{\star}start_FLOATSUPERSCRIPT ⋆ end_FLOATSUPERSCRIPT 11
Abstract

Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation, opening up possibilities for video editing based on textual input. However, the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning, additional condition extraction, or DDIM inversion, making real-time applications impractical. In this work, we propose FastVideoEdit, an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs, we eliminate the need for time-consuming inversion or additional condition extraction, reducing editing time. Our method enables direct map** from source video to target video with strong preservation ability utilizing a special variance schedule. This results in improved speed advantages, as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed, temporal consistency, and text-video alignment.

Keywords:
Video Editing Diffusion Models Consistency Models
Refer to caption
Figure 1: Editing Results of FastVideoEdit. FastVideoEdit offers efficient, consistent, high-quality, and text-aligned editing capabilities for both artificial (left col) and natural (right col) videos. The top row displays the source video, while the second and third rows showcase two edited videos. Each row features a text prompt at the top, with the edited words highlighted in red. This visual representation effectively demonstrates how our method can successfully achieve desired edits such as attribute change, object change, background change, and style change.

1 Introduction

Diffusion models [15, 35, 14, 1] have gained significant attention due to their remarkable capabilities in text-to-image [31, 35, 15, 17] and text-to-video generation [14, 34, 3, 12, 4]. Leveraging the capabilities of these models, it becomes feasible to manipulate videos [4] based on textual input, holding great potential for various applications in areas such as film production and content creation.

However, the computational cost associated with sequential sampling in diffusion models presents a significant challenge for efficient inference, especially in video editing scenarios where a set of frames need to be processed. Moreover, the absence of high-quality open-source video diffusion models [10, 28, 10] that can generate consistent editing results within a single test time inference, combined with the constraints on video duration of video diffusion models, has led to the adoption of existing image generation models for achieving accurate video editing [2, 29, 11]. To align the distribution between image and video models and perform accurate video editing, some methods employ a test-time one-shot fine-tuning for inflated image generation model on each input video [38, 33, 23, 37, 26]. However, this process further exacerbates the time-consuming nature of the editing process, which makes it impractical for real-time applications.

To enable faster video editing, three types of zero-shot methods have been proposed in the literature: (1) Layer-atlas-based methods [2, 21, 7], which involve editing the video on a flattened texture map and ensuring the temporal consistency by guaranteeing texture map consistency. However, the absence of a 3D motion prior in the 2D atlas approach results in suboptimal performance. (2) Dual-branch methods [29, 6, 11, 9], which leverage Denoising Diffusion Implicit Models (DDIM) [35] to extract source video features and generate novel content based on the target diffusion branch. The use of DDIM inversion leads to a doubling of the inference time required for video editing. (3) Additional conditional constraints incorporating methods [41, 37, 8, 43], which involve directly adding noise to the source video and denoising the noisy video using a conditioned diffusion model for preserving essential content while imposing restrictions on the editing process. While these methods are efficient during diffusion model inference, they do require additional information extraction, which slows down the overall speed of the process.

To address the issue of long computational times encountered in previous video editing methods, we introduce FastVideoEdit, which is inspired by recent advances in Consistency Models (CMs) [36]. Specifically, FastVideoEdit is a zero-shot video editing approach that not only achieves state-of-the-art performance but also significantly reduces editing time by eliminating the need for time-consuming inversion or additional condition extraction steps. The key insight of our proposed method is that the self-consistency property of CMs enables a special variance schedule that facilitates the editing process, transforming it from a process of adding noise and then denoising to one of a direct map** from source video to target video. Furthermore, the content preservation capability of CMs enables the use of fewer sampling steps while maintaining comparable generation quality, which results in an improved speed advantage of FastVideoEdit.

To evaluate FastVideoEdit, we consider metrics that encompass editing speed, temporal consistency, and text-video alignment. We compare the performance of FastVideoEdit with previous video editing methods using the TGVE 2023 open-source dataset [39] as our benchmark. The results demonstrate the superior performance of FastVideoEdit in terms of editing quality. Additionally, FastVideoEdit achieves this superior performance while requiring significantly less time for video editing tasks. This shows the efficiency and effectiveness of our approach, making it a standout choice for efficient high-quality video editing.

2 Related Work

2.1 Video Editing with Diffusion Models

The remarkable success of diffusion-based text-to-image [31, 35, 15, 19, 18] and text-to-video generation models [14, 34, 3, 12, 4] has opened up new possibilities for exciting opportunities in text-based image [13, 17] and video editing [10]. Although editing video directly through video diffusion models [10, 28, 10] show high temporal consistency, the challenges associated with extensive video model training, unstable generation quality, and video duration time limit make using inflated off-the-shelf image generation models a preferable choice for video editing, which inflating 2D model to 3D with an additional temporal channel.

Specifically, several works require a test-time one-shot fine-tuning on the inflated image generation model with each input video [38, 33, 23, 37, 26], which is time-consuming and too slow for real-time applications. Zero-shot video editing methods [2, 21, 7, 29, 13, 37, 6, 41, 11, 43, 9, 8] leverage training-free editing techniques with specialized modules to enhance temporal consistency across frames, which provide a practical and efficient solution for editing videos without the need of extensive training. Specifically, layer-atlas-based methods [2, 21, 7] edit the video on a flattened texture map, however the lack of 3d motion prior in 2d atlas leads to suboptimal performance. FateZero [29] solves this problem with a two-branch inflated image diffusion model that merges attention features of the structural preservation branch and editing branch. Similarly, Text2Video-Zero [22] and Pix2Video [6] align the feature of the source image and target image via an attention operation. To enhance pixel-level temporal consistency, Rerender A Video [41], TokenFlow [11], and Flatten [9] extract temporal-aware inter-frame features to propagate the edits throughout the video. However, previous zero-shot methods that relied on flattened image diffusion were limited by the need for DDIM inversion or additional conditional constraints (e.g., optical flow), resulting in a long runtime. In contrast, our proposed FastVideoEdit directly incorporates editing into the inference process by leveraging consistency models [36], which ensures both runtime efficiency and effective modifications.

2.2 Efficient Diffusion Models

To tackle the computational time limitations of diffusion models caused by the sequential sampling strategy, faster numerical ODE solvers [35, 42, 24] or distillation techniques [25, 32, 27, 44] have been employed. While these methods can be integrated into existing diffusion-based video editing techniques, they still face the challenge of requiring DDIM inversion or additional conditional constraints for essential content preservation.

Recently, the introduction of Consistency Models (CMs) [36, 40] has enabled faster generation by sampling along a trajectory map, thereby opening up exciting possibilities for more efficient video editing techniques. The few-step sampling strategy is particularly suitable for efficient video editing with a fast sampling speed and strong reconstruction ability. FastVideoEdit leverages the self-consistency characteristic of CMs, where the improved essential content preservation ability eliminates the need for accurate DDIM inversion and additional conditional constraints. Concurrent to our approach, OCD [20] separates diffusion sampling for edited objects and background areas, focusing most denoising steps on the former to enhance efficiency. FastVideoEdit can be directly combined with OCD to further enhance the overall efficiency of video editing.

3 Preliminaries

Diffusion models include a forward process that adds Gaussian noise ϵitalic-ϵ\epsilonitalic_ϵ to convert clean sample z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to noise sample zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and a backward process that iteratively performs denoising from zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT to z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where T𝑇Titalic_T represents the total number of timesteps. The denoising process of DDPM sampling [15] at step t𝑡titalic_t can be formulated as:

zt1=subscript𝑧𝑡1absent\displaystyle z_{t-1}=italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = αt1(zt1αtεθ(zt,t)αt)subscript𝛼𝑡1subscript𝑧𝑡1subscript𝛼𝑡subscript𝜀𝜃subscript𝑧𝑡𝑡subscript𝛼𝑡\displaystyle\sqrt{{{\alpha}}_{t-1}}\left(\frac{z_{t}-\sqrt{1-{{\alpha}}_{t}}% \varepsilon_{\theta}(z_{t},t)}{\sqrt{{\alpha}}_{t}}\right)square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( divide start_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) (predicted z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) (1)
+1αt1σt2εθ(zt,t)1subscript𝛼𝑡1superscriptsubscript𝜎𝑡2subscript𝜀𝜃subscript𝑧𝑡𝑡\displaystyle+\sqrt{1-{\alpha}_{t-1}-\sigma_{t}^{2}}\cdot\varepsilon_{\theta}(% z_{t},t)+ square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) (direction to ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT)
+σtεtwhere εt𝒩(𝟎,𝑰)similar-tosubscript𝜎𝑡subscript𝜀𝑡where subscript𝜀𝑡𝒩0𝑰\displaystyle+\sigma_{t}\varepsilon_{t}\quad\text{where }\varepsilon_{t}\sim% \mathcal{N}(\bm{0},\bm{I})+ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT where italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) (random noise).

By setting σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to zero, DDIM sampling [35] results in an implicit probabilistic model with a deterministic forward process:

z¯0=fθ(zt,t)=(zt1αtεθ(zt,t))/αt.subscript¯𝑧0subscript𝑓𝜃subscript𝑧𝑡𝑡subscript𝑧𝑡1subscript𝛼𝑡subscript𝜀𝜃subscript𝑧𝑡𝑡subscript𝛼𝑡\bar{z}_{0}=f_{\theta}(z_{t},t)=\left(z_{t}-\sqrt{1-{\alpha}_{t}}\cdot% \varepsilon_{\theta}(z_{t},t)\right)/\sqrt{{\alpha}_{t}}.over¯ start_ARG italic_z end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) / square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (2)

Following DDIM, we can use the function fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to predict and reconstruct z0¯¯subscript𝑧0\bar{z_{0}}over¯ start_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG given noise sample ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where t[1,T]similar-to𝑡1𝑇t\sim\left[1,T\right]italic_t ∼ [ 1 , italic_T ], α𝛼\alphaitalic_α is the hyper-parameter, εθsubscript𝜀𝜃\varepsilon_{\theta}italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a learnable network, and T𝑇Titalic_T represents the total number of timesteps.

Sampling in CMs [36] is carried out through a sequence of timesteps τ1:n[t0,T]subscript𝜏:1𝑛subscript𝑡0𝑇\tau_{1:n}\in[t_{0},T]italic_τ start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT ∈ [ italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ]. Starting from an initial noise z^Tsubscript^𝑧𝑇\hat{z}_{T}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and z0(T)=fθ(z^T,T)superscriptsubscript𝑧0𝑇subscript𝑓𝜃subscript^𝑧𝑇𝑇z_{0}^{(T)}=f_{\theta}(\hat{z}_{T},T)italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_T ), at each time-step τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the process samples ε𝒩(𝟎,𝑰)similar-to𝜀𝒩0𝑰\varepsilon\sim\mathcal{N}(\bm{0},\bm{I})italic_ε ∼ caligraphic_N ( bold_0 , bold_italic_I ) and iteratively updates the Multistep Consistency Sampling process through the following equation:

z^τisubscript^𝑧subscript𝜏𝑖\displaystyle\hat{z}_{\tau_{i}}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT =z0(τi+1)+τi2t02εabsentsuperscriptsubscript𝑧0subscript𝜏𝑖1superscriptsubscript𝜏𝑖2superscriptsubscript𝑡02𝜀\displaystyle=z_{0}^{(\tau_{i+1})}+\sqrt{\tau_{i}^{2}-t_{0}^{2}}\varepsilon= italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + square-root start_ARG italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_ε (3)
z0(τi)superscriptsubscript𝑧0subscript𝜏𝑖\displaystyle z_{0}^{(\tau_{i})}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =fθ(z^τi,τi).absentsubscript𝑓𝜃subscript^𝑧subscript𝜏𝑖subscript𝜏𝑖\displaystyle=f_{\theta}(\hat{z}_{\tau_{i}},\tau_{i}).= italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

When combined with a condition c𝑐citalic_c with classifier-free guidance [16], sampling in CMs at τisubscript𝜏𝑖\tau_{i}italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT starts with ε𝒩(𝟎,𝑰)similar-to𝜀𝒩0𝑰\varepsilon\sim\mathcal{N}(\bm{0},\bm{I})italic_ε ∼ caligraphic_N ( bold_0 , bold_italic_I ) and updates through:

z^τisubscript^𝑧subscript𝜏𝑖\displaystyle\hat{z}_{\tau_{i}}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT =ατiz0(τi+1)+στiε,absentsubscript𝛼subscript𝜏𝑖superscriptsubscript𝑧0subscript𝜏𝑖1subscript𝜎subscript𝜏𝑖𝜀\displaystyle=\sqrt{{\alpha}_{\tau_{i}}}z_{0}^{(\tau_{i+1})}+\sigma_{\tau_{i}}\varepsilon,= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_ε , (4)
z0(τi)superscriptsubscript𝑧0subscript𝜏𝑖\displaystyle z_{0}^{(\tau_{i})}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT =fθ(z^τi,τi,c).absentsubscript𝑓𝜃subscript^𝑧subscript𝜏𝑖subscript𝜏𝑖𝑐\displaystyle=f_{\theta}(\hat{z}_{\tau_{i}},\tau_{i},c).= italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c ) .

Consider a special case of Eq. 1 where σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is chosen as 1αt11subscript𝛼𝑡1\sqrt{1-\alpha_{t-1}}square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG at all times t𝑡titalic_t. Then the DDPM forward process naturally aligns with the Multistep Consistency Sampling, and the second term of Eq. 1 vanishes:

zt1=subscript𝑧𝑡1absent\displaystyle z_{t-1}=italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = αt1(zt1αtεθ(zt,t)αt)subscript𝛼𝑡1subscript𝑧𝑡1subscript𝛼𝑡subscript𝜀𝜃subscript𝑧𝑡𝑡subscript𝛼𝑡\displaystyle\sqrt{{{\alpha}}_{t-1}}\left(\frac{z_{t}-\sqrt{1-{{\alpha}}_{t}}% \varepsilon_{\theta}(z_{t},t)}{\sqrt{{\alpha}}_{t}}\right)square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( divide start_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_ARG start_ARG square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) (predicted z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) (5)
+1αt1εtεt𝒩(𝟎,𝑰)similar-to1subscript𝛼𝑡1subscript𝜀𝑡subscript𝜀𝑡𝒩0𝑰\displaystyle+\sqrt{1-\alpha_{t-1}}\varepsilon_{t}\quad\varepsilon_{t}\sim% \mathcal{N}(\bm{0},\bm{I})+ square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) (random noise).

Consider f(zt,t;z0)=(zt1αtε(zt,t;z0))/αt𝑓subscript𝑧𝑡𝑡subscript𝑧0subscript𝑧𝑡1subscript𝛼𝑡superscript𝜀subscript𝑧𝑡𝑡subscript𝑧0subscript𝛼𝑡f(z_{t},t;z_{0})=\left(z_{t}-\sqrt{1-{\alpha}_{t}}\varepsilon^{\prime}(z_{t},t% ;z_{0})\right)/\sqrt{{\alpha}_{t}}italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ; italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ; italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) / square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG, where the initial z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is available and we replace the parameterized noise predictor εθsubscript𝜀𝜃\varepsilon_{\theta}italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with εsuperscript𝜀\varepsilon^{\prime}italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT more generally. Eq. 5 turns into the following expression:

zt1=αt1f(zt,t;z0)+1αt1εtsubscript𝑧𝑡1subscript𝛼𝑡1𝑓subscript𝑧𝑡𝑡subscript𝑧01subscript𝛼𝑡1subscript𝜀𝑡\displaystyle z_{t-1}=\sqrt{{{\alpha}}_{t-1}}f(z_{t},t;z_{0})+\sqrt{1-\alpha_{% t-1}}\varepsilon_{t}italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ; italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (6)

which is in the same form as the Multistep Consistency Sampling step in Eq 4.

In order to make f(zt,t)𝑓subscript𝑧𝑡𝑡f(z_{t},t)italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) self-consistent so that it can be considered as a consistency function, i.e., f(zt,t;z0)=z0𝑓subscript𝑧𝑡𝑡subscript𝑧0subscript𝑧0f(z_{t},t;z_{0})=z_{0}italic_f ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ; italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we can directly solve the equation and εsuperscript𝜀\varepsilon^{\prime}italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT can be computed without parameterization:

εcons=ε(zt,t;z0)=ztαtz01αt.superscript𝜀conssuperscript𝜀subscript𝑧𝑡𝑡subscript𝑧0subscript𝑧𝑡subscript𝛼𝑡subscript𝑧01subscript𝛼𝑡\varepsilon^{\text{cons}}=\varepsilon^{\prime}(z_{t},t;z_{0})=\frac{z_{t}-% \sqrt{{\alpha}_{t}}z_{0}}{\sqrt{1-{\alpha}_{t}}}.italic_ε start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT = italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ; italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG . (7)

We arrive at a non-Markovian forward process, in which ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT directly points to the ground truth z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT without neural prediction, and zt1subscript𝑧𝑡1z_{t-1}italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT does not depend on the previous step ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT like a consistency model.

4 Method

The task of video editing can be described as the following: Given an ordered set of m𝑚mitalic_m source video frames src={Isrc1,Isrc2,,Isrcm}subscript𝑠𝑟𝑐superscriptsubscript𝐼𝑠𝑟𝑐1superscriptsubscript𝐼𝑠𝑟𝑐2superscriptsubscript𝐼𝑠𝑟𝑐𝑚\mathcal{I}_{src}=\{I_{src}^{1},I_{src}^{2},...,I_{src}^{m}\}caligraphic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } and a source prompt 𝒫srcsubscript𝒫𝑠𝑟𝑐\mathcal{P}_{src}caligraphic_P start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT describing the source video, we aim to generate an edited video with temporally consistent frames edit={Iedit1,Iedit2,,Ieditm}subscript𝑒𝑑𝑖𝑡superscriptsubscript𝐼𝑒𝑑𝑖𝑡1superscriptsubscript𝐼𝑒𝑑𝑖𝑡2superscriptsubscript𝐼𝑒𝑑𝑖𝑡𝑚\mathcal{I}_{edit}=\{I_{edit}^{1},I_{edit}^{2},...,I_{edit}^{m}\}caligraphic_I start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT = { italic_I start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_I start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_e italic_d italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } according to a target prompt 𝒫tgtsubscript𝒫𝑡𝑔𝑡\mathcal{P}_{tgt}caligraphic_P start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT.

This paper introduces FastVideoEdit, an end-to-end video edit framework that edits video efficiently while producing high-quality and temporally consistent editing content. Notably, our method achieves better background preservation compared with existing methods when editing foreground object-level attributes. Unlike many existing methods that depend on additional estimations such as depth control, edge control, or optical flow, FastVideoEdit requires only the source video frames and prompts as input throughout the editing process.

4.1 Video Reconstruction with Consistency Model

To our knowledge, FastVideoEdit is the first method in video editing that eliminates the need for the DDIM inversion process while simultaneously performing a complete denoising process on individual video frames. To enable direct editing of the source video without the need for the inversion process, we leverage a consistency model inspired by InfEdit [40]. The key idea to reconstruct source latent is to start with randomly sampled reconstruction noise rather than randomly initialized noisy latents. Following the Multistep Consistency Sampling in Eq 3, we sample a noise εtconssuperscriptsubscript𝜀𝑡cons\varepsilon_{t}^{\text{cons}}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT at each timestep t𝑡titalic_t and the noisy latent ztsrcsuperscriptsubscript𝑧𝑡srcz_{t}^{\text{src}}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT becomes directly tractable when z0srcsuperscriptsubscript𝑧0srcz_{0}^{\text{src}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT is given in the editing problem. Instead of denoising the randomly initialized noisy latent zTsrcsuperscriptsubscript𝑧𝑇srcz_{T}^{\text{src}}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT, the whole trajectory of {ztsrc}superscriptsubscript𝑧𝑡src\{z_{t}^{\text{src}}\}{ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT } is obtained directly from the sampled noise trajectory {εtcons}superscriptsubscript𝜀𝑡cons\{\varepsilon_{t}^{\text{cons}}\}{ italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT }, and in the reverse direction each εtconssuperscriptsubscript𝜀𝑡cons\varepsilon_{t}^{\text{cons}}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT can be used to reconstruct z0srcsuperscriptsubscript𝑧0srcz_{0}^{\text{src}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT given ztsrcsuperscriptsubscript𝑧𝑡srcz_{t}^{\text{src}}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT. The map**s between ztsrcsuperscriptsubscript𝑧𝑡srcz_{t}^{\text{src}}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT and {εtcons}superscriptsubscript𝜀𝑡cons\{\varepsilon_{t}^{\text{cons}}\}{ italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT } given z0srcsuperscriptsubscript𝑧0srcz_{0}^{\text{src}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT are given by:

ztsrcsuperscriptsubscript𝑧𝑡src\displaystyle z_{t}^{\text{src}}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT =αtz0src+1αtεtconsabsentsubscript𝛼𝑡superscriptsubscript𝑧0src1subscript𝛼𝑡superscriptsubscript𝜀𝑡cons\displaystyle=\sqrt{{\alpha}_{t}}z_{0}^{\text{src}}+\sqrt{1-{\alpha}_{t}}% \varepsilon_{t}^{\text{cons}}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT (8)
εtconssubscriptsuperscript𝜀cons𝑡\displaystyle\varepsilon^{\text{cons}}_{t}italic_ε start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(ztsrcαtz0src)/1αt.absentsuperscriptsubscript𝑧𝑡srcsubscript𝛼𝑡superscriptsubscript𝑧0src1subscript𝛼𝑡\displaystyle=(z_{t}^{\text{src}}-\sqrt{{\alpha}_{t}}z_{0}^{\text{src}})/\sqrt% {1-{\alpha}_{t}}.= ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT - square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) / square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG .

where εtcons𝒩(𝟎,𝑰)similar-tosuperscriptsubscript𝜀𝑡cons𝒩0𝑰\varepsilon_{t}^{\text{cons}}\sim\mathcal{N}(\bm{0},\bm{I})italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I ) is sampled independently at each timestep. As a result, the reconstructed latent zt=z0subscript𝑧𝑡subscript𝑧0z_{t}=z_{0}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is guaranteed at each timestep using Eq (2).

Refer to caption
Figure 2: Overview of FastVideoEdit. Our model directly denoises three branches of batch frames using three attention control methos: CF-Masa, Re-CA and Bg-Masa. The model uses batch consistency sampling (BCS) with LCMs to improve efficiency, background latent replacement to align editing content with source video and TokenFlow propagation to further improve temporal consistency.


4.2 Video Editing with Consistency Model

This section introduces the method to compute z0editsuperscriptsubscript𝑧0editz_{0}^{\text{edit}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT given z0srcsuperscriptsubscript𝑧0srcz_{0}^{\text{src}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT. In addition to ztsrcsuperscriptsubscript𝑧𝑡srcz_{t}^{\text{src}}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT and εtconssubscriptsuperscript𝜀cons𝑡\varepsilon^{\text{cons}}_{t}italic_ε start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT obtained from Eq (8), we need to predict the editing noise εθ(ztedit,t,𝒫tgt)subscript𝜀𝜃superscriptsubscript𝑧𝑡edit𝑡subscript𝒫tgt\varepsilon_{\theta}(z_{t}^{\text{edit}},t,\mathcal{P}_{\text{tgt}})italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , italic_t , caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ) to generate the editing latent z0editsuperscriptsubscript𝑧0editz_{0}^{\text{edit}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT according to target prompt 𝒫tgtsubscript𝒫tgt\mathcal{P}_{\text{tgt}}caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT. Due to the self-consistency property of LCMs, the gap between εθ(ztedit,t,𝒫tgt)subscript𝜀𝜃superscriptsubscript𝑧𝑡edit𝑡subscript𝒫tgt\varepsilon_{\theta}(z_{t}^{\text{edit}},t,\mathcal{P}_{\text{tgt}})italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , italic_t , caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ) and εteditsuperscriptsubscript𝜀𝑡edit\varepsilon_{t}^{\text{edit}}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT is small. Therefore, using the noise calibration ΔεtconsΔsuperscriptsubscript𝜀𝑡cons\Delta\varepsilon_{t}^{\text{cons}}roman_Δ italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT from εθ(ztsrc,t,𝒫src)subscript𝜀𝜃superscriptsubscript𝑧𝑡src𝑡subscript𝒫src\varepsilon_{\theta}(z_{t}^{\text{src}},t,\mathcal{P}_{\text{src}})italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_t , caligraphic_P start_POSTSUBSCRIPT src end_POSTSUBSCRIPT ) to the ground-truth source reconstruction noise εtconssubscriptsuperscript𝜀cons𝑡\varepsilon^{\text{cons}}_{t}italic_ε start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we can estimate the editing reconstruction noise as well as the editing latent z0editsuperscriptsubscript𝑧0editz_{0}^{\text{edit}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT at each timestep t𝑡titalic_t:

ΔεtconsΔsuperscriptsubscript𝜀𝑡cons\displaystyle\Delta\varepsilon_{t}^{\text{cons}}roman_Δ italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT =εtconsεθ(ztsrc,t,𝒫s)absentsubscriptsuperscript𝜀cons𝑡subscript𝜀𝜃superscriptsubscript𝑧𝑡src𝑡subscript𝒫𝑠\displaystyle=\varepsilon^{\text{cons}}_{t}-\varepsilon_{\theta}(z_{t}^{\text{% src}},t,\mathcal{P}_{s})= italic_ε start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_t , caligraphic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (9)
εteditsuperscriptsubscript𝜀𝑡edit\displaystyle\varepsilon_{t}^{\text{edit}}italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT =εθ(ztedit,t,𝒫t)+Δεtconsabsentsubscript𝜀𝜃superscriptsubscript𝑧𝑡edit𝑡subscript𝒫𝑡Δsuperscriptsubscript𝜀𝑡cons\displaystyle=\varepsilon_{\theta}(z_{t}^{\text{edit}},t,\mathcal{P}_{t})+% \Delta\varepsilon_{t}^{\text{cons}}= italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , italic_t , caligraphic_P start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + roman_Δ italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT
z0editsuperscriptsubscript𝑧0edit\displaystyle z_{0}^{\text{edit}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT =(ztedit1αtεtedit)/αt.absentsuperscriptsubscript𝑧𝑡edit1subscript𝛼𝑡superscriptsubscript𝜀𝑡editsubscript𝛼𝑡\displaystyle=\left(z_{t}^{\text{edit}}-\sqrt{1-{\alpha}_{t}}\cdot\varepsilon_% {t}^{\text{edit}}\right)/\sqrt{{\alpha}_{t}}.= ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_ε start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) / square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG .

Compared with editing a single frame, we impose the constraints that the initial latent and random noise sampled at each timestep are identical across all frames. Since the forward process of the denoising network εθ(,,)subscript𝜀𝜃\varepsilon_{\theta}(\cdot,\cdot,\cdot)italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ , ⋅ , ⋅ ) as well as the calibration process of noise and the updating process of latent are all deterministic relative to their inputs, identical initial latents and noise samples at each timestep result in identical output latents when source latents are also identical. In practice, if source latents are temporally consistent and close to each other, the output latents should also maintain good temporal consistency.

4.3 Batch Attention Control

As an end-to-end inference-based editing framework FastVideoEdit starts with directly denoising the batched latent 𝒵teditsuperscriptsubscript𝒵𝑡edit\mathcal{Z}_{t}^{\text{edit}}caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT according to the target prompt 𝒫tgtsubscript𝒫tgt\mathcal{P}_{\text{tgt}}caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT. A naive way of editing the target frame latent z0srcsuperscriptsubscript𝑧0srcz_{0}^{\text{src}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT by the target prompt is to denoise the DDIM inversion zTinvsuperscriptsubscript𝑧𝑇invz_{T}^{\text{inv}}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT inv end_POSTSUPERSCRIPT of z0srcsuperscriptsubscript𝑧0srcz_{0}^{\text{src}}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT iteratively through εθ(ztinv,t,𝒫tgt)subscript𝜀𝜃superscriptsubscript𝑧𝑡inv𝑡subscript𝒫tgt\varepsilon_{\theta}(z_{t}^{\text{inv}},t,\mathcal{P}_{\text{tgt}})italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT inv end_POSTSUPERSCRIPT , italic_t , caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ). In section 4.2, we introduced consistency model-based batch editing which leverages the property of LCMs to skip the time-consuming DDIM inversion process and directly denoise randomly initialized latent while kee** content aligned faithfully with source frames. However, without additional control, denoising conditioned on a target prompt 𝒫tgtsubscript𝒫tgt\mathcal{P}_{\text{tgt}}caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT can still produce editing content distinct from the source content.

Inspired by MasaCtrl [5] and Prompt-to-prompt [13], we propose Cross-Frame Mutual Self-Attention (CF-Masa) and Re-weighted Cross Attention (Re-CA) to allow further attention control when denoising the zteditsuperscriptsubscript𝑧𝑡editz_{t}^{\text{edit}}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT conditioned on 𝒫tgtsubscript𝒫tgt\mathcal{P}_{\text{tgt}}caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT. Specifically, we concurrently denoise two batched latents [𝒵tsrc,𝒵tedit]superscriptsubscript𝒵𝑡srcsuperscriptsubscript𝒵𝑡edit[\mathcal{Z}_{t}^{\text{src}},\mathcal{Z}_{t}^{\text{edit}}][ caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ] conditioned on [𝒫src,𝒫tgt]subscript𝒫srcsubscript𝒫tgt[\mathcal{P}_{\text{src}},\mathcal{P}_{\text{tgt}}][ caligraphic_P start_POSTSUBSCRIPT src end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ] respectively. The proposed CF-Masa and Re-CA can be directly applied in the forward process of εθ([𝒵tsrc,𝒵tedit],t,[𝒫src,𝒫tgt])subscript𝜀𝜃superscriptsubscript𝒵𝑡srcsuperscriptsubscript𝒵𝑡edit𝑡subscript𝒫srcsubscript𝒫tgt\varepsilon_{\theta}([\mathcal{Z}_{t}^{\text{src}},\mathcal{Z}_{t}^{\text{edit% }}],t,[\mathcal{P}_{\text{src}},\mathcal{P}_{\text{tgt}}])italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( [ caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , caligraphic_Z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ] , italic_t , [ caligraphic_P start_POSTSUBSCRIPT src end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT ] ).

4.3.1 Cross-Frame Mutual Self-Attention

The denoising UNet consists of different size downsample/upsample blocks and a middle block, which have four resolution levels in the latent space. Each resolution level incorporates a 2D convolution layer followed by self-attention and cross-attention layers. The attention mechanism can be formulated as:

attn(Q,K,V)=softmax(QKTdV).attn𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇𝑑𝑉\displaystyle\text{attn}(Q,K,V)=\text{softmax}(\frac{QK^{T}}{\sqrt{d}}V).attn ( italic_Q , italic_K , italic_V ) = softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG italic_V ) . (10)

In self-attention layers, Q,K,V𝑄𝐾𝑉Q,K,Vitalic_Q , italic_K , italic_V are the query, key, and value features obtained by projecting the same spatial features. Without attention control, the self-attention output of source branch attn(Qsrc,Ksrc,Vsrc)attnsuperscript𝑄srcsuperscript𝐾srcsuperscript𝑉src\text{attn}(Q^{\text{src}},K^{\text{src}},V^{\text{src}})attn ( italic_Q start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT ) and editing branch attn(Qedit,Kedit,Vedit)attnsuperscript𝑄editsuperscript𝐾editsuperscript𝑉edit\text{attn}(Q^{\text{edit}},K^{\text{edit}},V^{\text{edit}})attn ( italic_Q start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , italic_K start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) are computed concurrently and independently of each other. We make two changes on self-attention layers to preserve content consistency as well as temporal consistency between and within editing latent and source latent. In contrast to MasaCtrl [5], the preservation of content consistency in FastVideoEdit is achieved by replacing Qeditsuperscript𝑄editQ^{\text{edit}}italic_Q start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT and Keditsuperscript𝐾editK^{\text{edit}}italic_K start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT with Qsrcsuperscript𝑄srcQ^{\text{src}}italic_Q start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT and Ksrcsuperscript𝐾srcK^{\text{src}}italic_K start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT after a fixed step tssubscript𝑡𝑠t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the editing branch remains unchanged before tssubscript𝑡𝑠t_{s}italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. To further maintain temporal consistency across batched latents within a branch, we concatenate the key features [K1,K2,,Km]subscript𝐾1subscript𝐾2subscript𝐾𝑚[K_{1},K_{2},...,K_{m}][ italic_K start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] and value features [V1,V2,,Vm]subscript𝑉1subscript𝑉2subscript𝑉𝑚[V_{1},V_{2},...,V_{m}][ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_V start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] along their sequence length dimension resulting in the final format becomes:

CF-Masa({Qiedit,Kiedit,Viedit},t)CF-Masasuperscriptsubscript𝑄𝑖editsuperscriptsubscript𝐾𝑖editsuperscriptsubscript𝑉𝑖edit𝑡\displaystyle\text{CF-Masa}(\{Q_{i}^{\text{edit}},K_{i}^{\text{edit}},V_{i}^{% \text{edit}}\},t)CF-Masa ( { italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT } , italic_t ) (11)
:={{Qisrc,concat{Ksrc},concat{Vedit}}tts{Qiedit,concat{Kedit},concat{Vedit}}t<ts.assignabsentcasessuperscriptsubscript𝑄𝑖srcconcatsuperscript𝐾srcconcatsuperscript𝑉edit𝑡subscript𝑡𝑠superscriptsubscript𝑄𝑖editconcatsuperscript𝐾editconcatsuperscript𝑉edit𝑡subscript𝑡𝑠\displaystyle:=\begin{cases}\{Q_{i}^{\text{src}},\text{concat}\{K^{\text{src}}% \},\text{concat}\{V^{\text{edit}}\}\}&t\geq t_{s}\\ \{Q_{i}^{\text{edit}},\text{concat}\{K^{\text{edit}}\},\text{concat}\{V^{\text% {edit}}\}\}&t<t_{s}\end{cases}.:= { start_ROW start_CELL { italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , concat { italic_K start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT } , concat { italic_V start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT } } end_CELL start_CELL italic_t ≥ italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL { italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , concat { italic_K start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT } , concat { italic_V start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT } } end_CELL start_CELL italic_t < italic_t start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_CELL end_ROW .

4.3.2 Re-weighted Cross Attention

The forward process of cross-attention can be edited in a similar way to self-attention. In cross-attention layers, Q𝑄Qitalic_Q is the set of query features obtained obtaining by projecting spatial features coming from self-attention layer, K,V𝐾𝑉K,Vitalic_K , italic_V are obtained from the prompt embeddings. By replacing the cross-attention map of the editing branch with that of the source branch [13], the scattering from source prompt mutual content to the source spatial features can be maintained on editing spatial features. To further enhance the effect of the editing token, the corresponding attention map of the editing token can be multiplied by a replace scale r1𝑟1r\geq 1italic_r ≥ 1. The resulting formulation of the Re-weighted Cross Attention is given by:

Refine(Asrc,Aedit)i,j={(Aedit)i,jiff𝒫(j)=None(Aedit)i,f𝒫(j)otherwiseRefinesubscriptsuperscript𝐴srcsuperscript𝐴edit𝑖𝑗casessubscriptsuperscript𝐴edit𝑖𝑗ifsubscript𝑓𝒫𝑗Nonesubscriptsuperscript𝐴edit𝑖subscript𝑓𝒫𝑗otherwise\displaystyle\text{Refine}(A^{\text{src}},A^{\text{edit}})_{i,j}=\begin{cases}% \left(A^{\text{edit}}\right)_{i,j}&\text{if}\ f_{\mathcal{P}}(j)=\text{None}\\ \left(A^{\text{edit}}\right)_{i,f_{\mathcal{P}}(j)}&\text{otherwise}\end{cases}Refine ( italic_A start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = { start_ROW start_CELL ( italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_CELL start_CELL if italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_j ) = None end_CELL end_ROW start_ROW start_CELL ( italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_j ) end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW (12)
Re-CA(Asrc,Aedit,t):={rRefine(Asrc,Aedit)ttcAeditt<tcassignRe-CAsuperscript𝐴srcsuperscript𝐴edit𝑡cases𝑟Refinesuperscript𝐴srcsuperscript𝐴edit𝑡subscript𝑡𝑐superscript𝐴edit𝑡subscript𝑡𝑐\displaystyle\text{Re-CA}(A^{\text{src}},A^{\text{edit}},t):=\begin{cases}r% \cdot\text{Refine}(A^{\text{src}},A^{\text{edit}})&t\geq t_{c}\\ A^{\text{edit}}&t<t_{c}\end{cases}Re-CA ( italic_A start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , italic_t ) := { start_ROW start_CELL italic_r ⋅ Refine ( italic_A start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) end_CELL start_CELL italic_t ≥ italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT end_CELL start_CELL italic_t < italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_CELL end_ROW

where f𝒫()subscript𝑓𝒫f_{\mathcal{P}}(\cdot)italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( ⋅ ) is the alignment function indicating the source prompt token index of the jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT token in the target prompt and None if missing.

4.4 Background Preservation via Latent Replacement

There is a trade-off in existing video editing methods between the editing effect of foreground objects and content preservation of background. Changing the attributes of an object in the foreground usually makes the background more consistent with the change. This is because the control methods that are applied to the forward process are not strict control over the latent space. Therefore the change of tokens in the target prompt also influences irrelevant regions of editing latent through attention mechanisms. Compared with state-of-the-art video editing methods, a significant advantage of FastVideoEdit is the accuracy of foreground editing. This is shown in both quantative and qualitative results in Sec. 5. We achieve this by multiple designs of FastVideoEdit. Consistent initial latents and noise in Batch Consistency Sampling algorithm and attention control both provide faithful editing concerning the source video. In addition to this, we propose further background preservation strategies to enhance the faithfulness of the edited content to the source content. Specifically, we propose to simultaneously denoise a background branch that maintains the structure information of the editing branch while aligning content with the source branch. Based on the background branch, we additionally propose a latent replacement algorithm that replaces the background part in the editing latent with the corresponding part in the background latent.

4.4.1 Background Branch

By simultaneously denoising a background branch conditioned on 𝒫srcsubscript𝒫𝑠𝑟𝑐\mathcal{P}_{src}caligraphic_P start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT and imposing self-attention control from the source branch and editing branch, we expect the background branch to maintain the structure of the editing branch and the content of the source branch. We modify the self-attention process of the background branch as follows:

Bg-Masa({Qibg,Kibg,Vibg},t)Bg-Masasuperscriptsubscript𝑄𝑖bgsuperscriptsubscript𝐾𝑖bgsuperscriptsubscript𝑉𝑖bg𝑡\displaystyle\text{Bg-Masa}(\{Q_{i}^{\text{bg}},K_{i}^{\text{bg}},V_{i}^{\text% {bg}}\},t)Bg-Masa ( { italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT , italic_K start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT , italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT } , italic_t ) (13)
:={{Qisrc,concat{Ksrc},concat{Vsrc}}ttbg{Qiedit,concat{Ksrc},concat{Vsrc}}t<tbg.assignabsentcasessuperscriptsubscript𝑄𝑖srcconcatsuperscript𝐾srcconcatsuperscript𝑉src𝑡subscript𝑡𝑏𝑔superscriptsubscript𝑄𝑖editconcatsuperscript𝐾srcconcatsuperscript𝑉src𝑡subscript𝑡𝑏𝑔\displaystyle:=\begin{cases}\{Q_{i}^{\text{src}},\text{concat}\{K^{\text{src}}% \},\text{concat}\{V^{\text{src}}\}\}&t\geq t_{bg}\\ \{Q_{i}^{\text{edit}},\text{concat}\{K^{\text{src}}\},\text{concat}\{V^{\text{% src}}\}\}&t<t_{bg}.\end{cases}:= { start_ROW start_CELL { italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT , concat { italic_K start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT } , concat { italic_V start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT } } end_CELL start_CELL italic_t ≥ italic_t start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL { italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT , concat { italic_K start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT } , concat { italic_V start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT } } end_CELL start_CELL italic_t < italic_t start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT . end_CELL end_ROW

To maintain the editing structure and source content, we employ a similar editing approach to MasaCtrl [5] since query features from the edit branch are used to maintain structure information. Meanwhile, the key and value features are copied from the source branch to maintain consistency with the source content. Note that the joint attention is working at early timestep instead of later timesteps as described in MasaCtrl [5] because our observation is that the structure is formed at early steps and content details are refined at later steps.

4.4.2 Latent Replacement

At the end of each denoising step, we employ the latent replacement operation to replace the background region of the editing latent with the corresponding region of the source latent. The region is determined by computing the relative region from a cross-attention map. Specifically, given a cross-attention map (Aedit)m×nsubscriptsuperscript𝐴edit𝑚𝑛(A^{\text{edit}})_{m\times n}( italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_m × italic_n end_POSTSUBSCRIPT, we obtain a replacement map (Medit)msubscriptsuperscript𝑀edit𝑚(M^{\text{edit}})_{m}( italic_M start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT where m𝑚mitalic_m is the sequence length of the attention map or the size of the feature map, and n𝑛nitalic_n is the number of tokens in 𝒫tgtsubscript𝒫𝑡𝑔𝑡\mathcal{P}_{tgt}caligraphic_P start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT. The replacement map is computed as follows:

(A^edit)isubscriptsuperscript^𝐴edit𝑖\displaystyle(\hat{A}^{\text{edit}})_{i}( over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =Σj(Aedit)i,j𝐈f𝒫(j)NoneΣj(Aedit)i,jabsentsubscriptΣ𝑗subscriptsuperscript𝐴edit𝑖𝑗subscript𝐈subscript𝑓𝒫𝑗NonesubscriptΣ𝑗subscriptsuperscript𝐴edit𝑖𝑗\displaystyle=\frac{\Sigma_{j}(A^{\text{edit}})_{i,j}\cdot\mathbf{I}_{f_{% \mathcal{P}}(j)\neq\text{None}}}{\Sigma_{j}(A^{\text{edit}})_{i,j}}= divide start_ARG roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ⋅ bold_I start_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT ( italic_j ) ≠ None end_POSTSUBSCRIPT end_ARG start_ARG roman_Σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT end_ARG (14)
(Medit)isubscriptsuperscript𝑀edit𝑖\displaystyle(M^{\text{edit}})_{i}( italic_M start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =𝐈(A^edit)ithreshedit.absentsubscript𝐈subscriptsuperscript^𝐴edit𝑖subscriptthreshedit\displaystyle=\mathbf{I}_{(\hat{A}^{\text{edit}})_{i}}\geq\text{thresh}_{\text% {edit}}.= bold_I start_POSTSUBSCRIPT ( over^ start_ARG italic_A end_ARG start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≥ thresh start_POSTSUBSCRIPT edit end_POSTSUBSCRIPT .

Intuitively, the replacement map has 1111 at positions where the edited tokens receive high attention scores among all the tokens, and 00 anywhere else. In practice, Aeditsuperscript𝐴editA^{\text{edit}}italic_A start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT is obtained by averaging among all the cross-attention maps of the same size in a fixed resolution level. The replaced edited latent at the end of denoising step t𝑡titalic_t is:

(ztedit)=Mtedit(ztedit)+(1Mtedit)(ztbg).superscriptsubscript𝑧𝑡editdirect-productsuperscriptsubscript𝑀𝑡editsuperscriptsubscript𝑧𝑡editdirect-product1superscriptsubscript𝑀𝑡editsuperscriptsubscript𝑧𝑡bg\displaystyle(z_{t}^{\text{edit}})=M_{t}^{\text{edit}}\odot(z_{t}^{\text{edit}% })+(1-M_{t}^{\text{edit}})\odot(z_{t}^{\text{bg}}).( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) = italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ⊙ ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) + ( 1 - italic_M start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT ) ⊙ ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT ) . (15)

4.5 Frame Consistency with Tokenflow

Following  [11], we apply tokenflow to improve temporal consistency across frames. Tokenflow is a plug-and-play module that can be applied at each layer of the denoising network. The idea of Tokenflow is to first select and denoise a group of keyframes, and then replace the original spatial features with the weighted sum of the two most similar spatial features from two adjacent keyframes when denoising each frame latent. In the first stage, Tokenflow selects a group of keyframes of indices κ𝜅\kappaitalic_κ and in each layer at each step and store 𝐓base={ϕ(zi)}iκsubscript𝐓𝑏𝑎𝑠𝑒subscriptitalic-ϕsuperscript𝑧𝑖𝑖𝜅\mathbf{T}_{base}=\{\phi(z^{i})\}_{i\in\kappa}bold_T start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT = { italic_ϕ ( italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i ∈ italic_κ end_POSTSUBSCRIPT, where ϕ()italic-ϕ\phi(\cdot)italic_ϕ ( ⋅ ) maps the latent to its spatial features (zi)superscript𝑧𝑖(z^{i})( italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ). When computing the features of an arbitrary frame latent zisuperscript𝑧𝑖z^{i}italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, the method queries its two adjacent frames latent of indices ilimit-from𝑖i-italic_i - and i+limit-from𝑖i+italic_i +, and gets the closest feature index γi±[p]superscript𝛾limit-from𝑖plus-or-minusdelimited-[]𝑝\gamma^{i\pm}[p]italic_γ start_POSTSUPERSCRIPT italic_i ± end_POSTSUPERSCRIPT [ italic_p ] for each of its feature indexed p𝑝pitalic_p as follows:

γi±[p]=argminq𝒟(ϕ(zi)[p],ϕ(zi±)[q])superscript𝛾limit-from𝑖plus-or-minusdelimited-[]𝑝subscriptargmin𝑞𝒟italic-ϕsuperscript𝑧𝑖delimited-[]𝑝italic-ϕsuperscript𝑧limit-from𝑖plus-or-minusdelimited-[]𝑞\gamma^{i\pm}[p]=\operatorname*{arg\,min}_{q}{\mathcal{D}\left({\phi({z}^{i})[% p]},{\phi({z}^{i\pm})[q]}\right)}italic_γ start_POSTSUPERSCRIPT italic_i ± end_POSTSUPERSCRIPT [ italic_p ] = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT caligraphic_D ( italic_ϕ ( italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) [ italic_p ] , italic_ϕ ( italic_z start_POSTSUPERSCRIPT italic_i ± end_POSTSUPERSCRIPT ) [ italic_q ] ) (16)

where 𝒟𝒟\mathcal{D}caligraphic_D represents cosine distance of two features. The output weighted spatial features of frame latent zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT therefore become:

γ(𝐓base,i,p)=wiϕ(zi+)[γi+[p]]+(1wi)ϕ(zi)[γi[p]].subscript𝛾subscript𝐓𝑏𝑎𝑠𝑒𝑖𝑝subscript𝑤𝑖italic-ϕsuperscript𝑧limit-from𝑖delimited-[]superscript𝛾limit-from𝑖delimited-[]𝑝1subscript𝑤𝑖italic-ϕsuperscript𝑧limit-from𝑖delimited-[]superscript𝛾limit-from𝑖delimited-[]𝑝\mathcal{F}_{\gamma}(\mathbf{T}_{base},i,p)=w_{i}\cdot\phi(z^{i+})[\gamma^{i+}% [p]]\;+\;(1-w_{i})\cdot\phi(z^{i-})[\gamma^{i-}[p]].caligraphic_F start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( bold_T start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT , italic_i , italic_p ) = italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_ϕ ( italic_z start_POSTSUPERSCRIPT italic_i + end_POSTSUPERSCRIPT ) [ italic_γ start_POSTSUPERSCRIPT italic_i + end_POSTSUPERSCRIPT [ italic_p ] ] + ( 1 - italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_ϕ ( italic_z start_POSTSUPERSCRIPT italic_i - end_POSTSUPERSCRIPT ) [ italic_γ start_POSTSUPERSCRIPT italic_i - end_POSTSUPERSCRIPT [ italic_p ] ] . (17)

In practice, Tokenflow is a plug-and-play operation that can be applied after the self-attention layer. It replaces the original output of spatial features ϕ(zi)italic-ϕsuperscript𝑧𝑖\phi(z^{i})italic_ϕ ( italic_z start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) of the original frame latent with the features of weighted sum of two adjacent key frames {γ(𝐓base,i,p)}psubscriptsubscript𝛾subscript𝐓𝑏𝑎𝑠𝑒𝑖𝑝𝑝\{\mathcal{F}_{\gamma}(\mathbf{T}_{base},i,p)\}_{p}{ caligraphic_F start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ( bold_T start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT , italic_i , italic_p ) } start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

The overall FastVideoEdit  algorithm is shown in Algorithm 1 and Figure 2.

Algorithm 1 FastVideoEdit editing
1:For abbreviation, we denote 𝒜𝒫similar-to𝒜𝒫\mathcal{A}\sim\mathcal{P}caligraphic_A ∼ caligraphic_P as every element in the 𝒜𝒜\mathcal{A}caligraphic_A has the same value sampled from distribution 𝒫𝒫\mathcal{P}caligraphic_P.
2:Input:
3:     Latent Consistency Model εθ(,,)subscript𝜀𝜃\varepsilon_{\theta}(\cdot,\cdot,\cdot)italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ , ⋅ , ⋅ )
4:     Sequence of timesteps τ1>τ2>>τN1subscript𝜏1subscript𝜏2subscript𝜏𝑁1\tau_{1}>\tau_{2}>\cdots>\tau_{N-1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > ⋯ > italic_τ start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT
5:     Batched source latents 𝒵0src={z0src,(i)| 1im}superscriptsubscript𝒵0srcconditional-setsuperscriptsubscript𝑧0src𝑖1𝑖𝑚\mathcal{Z}_{0}^{\text{src}}=\{z_{0}^{\text{src},(i)}\ |\ 1\leq i\leq m\}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT = { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src , ( italic_i ) end_POSTSUPERSCRIPT | 1 ≤ italic_i ≤ italic_m }
6:     Source and target prompts 𝒫src,𝒫tgtsubscript𝒫𝑠𝑟𝑐subscript𝒫𝑡𝑔𝑡\mathcal{P}_{src},\mathcal{P}_{tgt}caligraphic_P start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT
7:Set batch attention control on εθ(,,)subscript𝜀𝜃\varepsilon_{\theta}(\cdot,\cdot,\cdot)italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ , ⋅ , ⋅ )
8:Set Tokenflow propagation on εθ(,,)subscript𝜀𝜃\varepsilon_{\theta}(\cdot,\cdot,\cdot)italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ , ⋅ , ⋅ )
9:Initial batched latents 𝒵τ1src=𝒵τ1edit=𝒵τ1bg𝒩(𝟎,𝑰)superscriptsubscript𝒵subscript𝜏1srcsuperscriptsubscript𝒵subscript𝜏1editsuperscriptsubscript𝒵subscript𝜏1bgsimilar-to𝒩0𝑰\mathcal{Z}_{\tau_{1}}^{\text{src}}=\mathcal{Z}_{\tau_{1}}^{\text{edit}}=% \mathcal{Z}_{\tau_{1}}^{\text{bg}}\sim\mathcal{N}(\bm{0},\bm{I})caligraphic_Z start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT = caligraphic_Z start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT = caligraphic_Z start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 , bold_italic_I )
10:Compute {ετ1cons}subscriptsuperscript𝜀conssubscript𝜏1\{\varepsilon^{\text{cons}}_{\tau_{1}}\}{ italic_ε start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT } using Eq 8
11:for n=1𝑛1n=1italic_n = 1 to N1𝑁1N-1italic_N - 1 do
12:     Compute 𝐓baseeditsuperscriptsubscript𝐓baseedit\mathbf{T}_{\text{base}}^{\text{edit}}bold_T start_POSTSUBSCRIPT base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT and 𝐓baseeditsuperscriptsubscript𝐓baseedit\mathbf{T}_{\text{base}}^{\text{edit}}bold_T start_POSTSUBSCRIPT base end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT
13:     Denoise three branches {εθ({zτnsrc,zτnedit,zτnbg},τn,{𝒫src,𝒫tgt,𝒫src)};𝐓base}subscript𝜀𝜃subscriptsuperscript𝑧srcsubscript𝜏𝑛subscriptsuperscript𝑧editsubscript𝜏𝑛subscriptsuperscript𝑧bgsubscript𝜏𝑛subscript𝜏𝑛subscript𝒫𝑠𝑟𝑐subscript𝒫𝑡𝑔𝑡subscript𝒫𝑠𝑟𝑐subscript𝐓base\{\varepsilon_{\theta}(\{z^{\text{src}}_{\tau_{n}},z^{\text{edit}}_{\tau_{n}},% z^{\text{bg}}_{\tau_{n}}\},{\tau_{n}},\{\mathcal{P}_{src},\mathcal{P}_{tgt},% \mathcal{P}_{src})\};\mathbf{T}_{\text{base}}\}{ italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( { italic_z start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } , italic_τ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , { caligraphic_P start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT , caligraphic_P start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) } ; bold_T start_POSTSUBSCRIPT base end_POSTSUBSCRIPT }
14:     Update 𝒵τn+1srcsubscriptsuperscript𝒵srcsubscript𝜏𝑛1\mathcal{Z}^{\text{src}}_{\tau_{n+1}}caligraphic_Z start_POSTSUPERSCRIPT src end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT using Eq 8
15:     Update 𝒵0editsubscriptsuperscript𝒵edit0\mathcal{Z}^{\text{edit}}_{0}caligraphic_Z start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and 𝒵0bgsubscriptsuperscript𝒵bg0\mathcal{Z}^{\text{bg}}_{0}caligraphic_Z start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using Eq 9
16:     Sample new reconstruction noise {ετn+1cons}𝒩(𝟎,𝑰)similar-tosuperscriptsubscript𝜀subscript𝜏𝑛1cons𝒩0𝑰\{\varepsilon_{\tau_{n+1}}^{\text{cons}}\}\sim\mathcal{N}(\bm{0},\bm{I}){ italic_ε start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT cons end_POSTSUPERSCRIPT } ∼ caligraphic_N ( bold_0 , bold_italic_I )
17:     Update 𝒵τn+1editsubscriptsuperscript𝒵editsubscript𝜏𝑛1\mathcal{Z}^{\text{edit}}_{\tau_{n+1}}caligraphic_Z start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝒵τn+1bgsubscriptsuperscript𝒵bgsubscript𝜏𝑛1\mathcal{Z}^{\text{bg}}_{\tau_{n+1}}caligraphic_Z start_POSTSUPERSCRIPT bg end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT using Eq 8
18:     Replace latents 𝒵τn+1editsubscriptsuperscript𝒵editsubscript𝜏𝑛1\mathcal{Z}^{\text{edit}}_{\tau_{n+1}}caligraphic_Z start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT using Eq 14 and 15
19:end for
20:Output: 𝒵0editsuperscriptsubscript𝒵0edit\mathcal{Z}_{0}^{\textrm{edit}}caligraphic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT edit end_POSTSUPERSCRIPT

5 Experiments

In this section, we first introduce the evaluation benchmark and evaluation metrics used in our experiment in Sec. 5.1. Following that, we present a quantitative comparison of our methods in Sec. 5.2 and a qualitative comparison in Sec. 5.3.

5.1 Evaluation Benchmark and Metrics

Evaluation Dataset.

For the evaluation of video editing, we utilize the TGVE 2023 open-source dataset [39] as our benchmark. This dataset consists of 76 videos, each containing 32 frames with a resolution of 480x480 pixels.

Evaluation Metrics.

Following previous work [29, 11], we evaluate the temporal consistency of our approach by utilizing clip similarity [30] among frames (‘Tem-Con’). Additionally, we measure the frame-wise editing accuracy through two metrics. ‘Txt-Sim’ for clip similarity between the embeddings of text and image and ‘Clip-Acc’ for the percentage of frames where the edited image has a higher CLIP similarity to the target prompt compared to the source prompt. Furthermore, as an additional evaluation metric, we measure the time consumption of editing 32323232 frames’ video using FastVideoEdit and previous methods in both the inversion and forward processes to evaluate the speed.

5.2 Quantitative Comparison

In Tab. 1 we compare FastVideoEdit with two additional conditional constraints incorporating methods Rerender [41] and Text2Video-Zero [22] as well as three dual-branch methods FateZero [29], Pix2Video [6], and TokenFlow [11].

The results demonstrate that FastVideoEdit achieves state-of-the-art performance in terms of temporal consistency and per-frame editing accuracy, while significantly reducing the time required for the editing process. Comparatively, our method outperforms previous additional conditional constraints incorporating methods and dual-branch methods in terms of efficiency, delivering high-quality results in less time. The reduction in runtime originates from two aspects: the elimination of inversion and additional condition feature extraction, and the use of fewer sampling steps. This highlights the effectiveness and efficiency of FastVideoEdit in video editing tasks.

Table 1: Comparison of FastVideoEdit with previous video editing methods. Bold indicates the best. Underline indicates the second best.
Model CLIP Metrics\uparrow Time\downarrow
Tem-Con Txt-Sim Clip-Acc Inversion Forward Sum
Rerender [41] 95.7 25.0 48.5 - 174.3 174.3
Text2Video-Zero [22] 96.9 27.1 70.7 - 131.0 131.0
FateZero [29] 95.7 24.9 35.8 233.7 347.0 581.7
Pix2Video [6] 96.0 27.5 68.5 185.3 213.0 399.3
TokenFlow [11] 96.5 25.5 54.7 176.5 115.9 292.4
Ours 96.5 27.7 71.1 - 61.7 61.7

5.3 Qualitative Comparison

Qualitative comparison of FastVideoEdit and previous video editing methods is shown in Fig. 3. We compare additional conditional constraints incorporating methods Rerender [41] and Text2Video-Zero [22] as well as three dual-branch methods FateZero [29], Pix2Video [6], and TokenFlow [11].

Refer to caption
Figure 3: Qualitative comparison of FastVideoEdit with previous video editing methods. The top row displays the source video, while the following rows showcase edited videos by previous editing methods and FastVideoEdit. Source and target text prompt at shown the top, with the edited words highlighted in red.

The results show that FastVideoEdit effectively performs video editing aligned with the text prompt while preserving the essential content of the source video. Through attention control, latent replacement, and leveraging the preservation ability of the consistency model, FastVideoEdit successfully performs video foreground editing while preserving the background. This approach enables targeted editing of the foreground elements in the video while ensuring that the background remains intact. By selectively focusing on specific regions of interest and employing latent replacement techniques, FastVideoEdit achieves accurate and consistent editing results, maintaining the integrity of the background content. It is worth noting that FastVideoEdit achieves superior performance compared to other methods while requiring significantly less time. This highlights the efficiency and effectiveness of our approach in delivering high-quality results in a more time-efficient manner.

5.4 Ablation Study

Refer to caption
Figure 4: Illustration of ablation on model architecture.

We ablate the use of Bg-Masa, CF-Masa, Re-CA and TokenFlow propagation. Quantitative results and qualitative results are shown in Tab. 2 and Fig. 4. Without background preservation, the background dirt is changed. Results show that removing CF-Masa and TokenFlow results in a worse temporal consistency. Moreover, replacing our attention control with PnP results in a worse editing effect (See left rabbit’s ears and right rabbit’s tail).

Tab. 2 shows that without latent replacement the temporal consistency and CLIP accuracy metrics rise, which illustrates that latent replacement protects background but does not help with either temporal consistency or CLIP accuracy. The improvement in background preservation is observed evidently in qualitative results which is not reflected on CLIP metrics. Imposing background preservation prevents the adaption of background to the editing prompt which is negatively reflected on CLIP based similarity evaluation. However, visual observation by eyes can hardly capture the negative impact it causes in terms of content editing. Apart from background preservation designs, the rest of our proposed attention controls achieve better performance in all the three metrics, which shows the effectiveness of our proposed methods.

Table 2: Ablation study for architecture design of FastVideoEdit. Bold indicates the best. Underline indicates the second best.
Model CLIP Metrics\uparrow
Tem-Con Txt-Sim Clip-Acc
Ours 96.5 27.7 71.1
w/o Bg-Masa 96.7 27.5 72.3
w/o CF-Masa 96.3 26.7 69.3
w/ PnP 96.5 25.8 60.0

6 Conclusion

Conclusion.

In this work, we have introduced FastVideoEdit, a zero-shot video editing approach that addresses the computational challenges faced by previous methods. By leveraging the self-consistency property of Consistency Models (CMs), our method eliminates the need for time-consuming inversion or additional condition extraction steps. We have also introduced a novel approach for maintaining background preservation via latent replacement, which simultaneously denoises a background branch while imposing self-attention control from the source and editing branches. Our experimental results demonstrate the superior performance of FastVideoEdit in terms of editing quality while requiring significantly less time for video editing tasks.

Limitations and future work.

However, FastVideoEdit still has some limitations: (1) FastVideoEdit may require tuning its hyperparameters to achieve optimal performance on each video. This dependency on hyperparameter adjustment adds complexity to the editing process and may require expertise or extensive experimentation to achieve satisfactory results. (2) While FastVideoEdit demonstrates state-of-the-art performance in video editing, there is no guarantee of success for every editing case. The effectiveness of the approach may vary depending on factors such as input data quality, the complexity of the editing task, and the suitability of chosen hyperparameters. (3) The performance of FastVideoEdit relies on the quality and capabilities of the underlying consistency models. In our future work, we are committed to making further improvements to address the challenges highlighted in our research.

Possible negative social impact.

Video editing approaches may pose privacy risks if used to alter videos without appropriate consent or to create content that invades someone’s privacy. Moreover, the convenience and speed offered by FastVideoEdit may inadvertently encourage irresponsible editing practices, leading to ethical dilemmas in areas such as journalism, entertainment, and personal communication. Addressing these possible negative social impacts and continuously improving our models are key focuses for our future releases.

References

  • [1] Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., et al.: ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)
  • [2] Bar-Tal, O., Ofri-Amar, D., Fridman, R., Kasten, Y., Dekel, T.: Text2live: Text-driven layered image and video editing. In: European conference on computer vision. pp. 707–723. Springer (2022)
  • [3] Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22563–22575 (2023)
  • [4] Brooks, T., Peebles, B., Homes, C., DePue, W., Guo, Y., **g, L., Schnurr, D., Taylor, J., Luhman, T., Luhman, E., Ng, C., Wang, R., Ramesh, A.: Video generation models as world simulators (2024), https://openai.com/research/video-generation-models-as-world-simulators
  • [5] Cao, M., Wang, X., Qi, Z., Shan, Y., Qie, X., Zheng, Y.: Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465 (2023)
  • [6] Ceylan, D., Huang, C.H.P., Mitra, N.J.: Pix2video: Video editing using image diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23206–23217 (2023)
  • [7] Chai, W., Guo, X., Wang, G., Lu, Y.: Stablevideo: Text-driven consistency-aware diffusion video editing. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23040–23050 (2023)
  • [8] Chen, W., Wu, J., Xie, P., Wu, H., Li, J., Xia, X., Xiao, X., Lin, L.: Control-a-video: Controllable text-to-video generation with diffusion models. arXiv preprint arXiv:2305.13840 (2023)
  • [9] Cong, Y., Xu, M., Simon, C., Chen, S., Ren, J., Xie, Y., Perez-Rua, J.M., Rosenhahn, B., Xiang, T., He, S.: Flatten: optical flow-guided attention for consistent text-to-video editing. arXiv preprint arXiv:2310.05922 (2023)
  • [10] Esser, P., Chiu, J., Atighehchian, P., Granskog, J., Germanidis, A.: Structure and content-guided video synthesis with diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7346–7356 (2023)
  • [11] Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373 (2023)
  • [12] Gupta, A., Yu, L., Sohn, K., Gu, X., Hahn, M., Fei-Fei, L., Essa, I., Jiang, L., Lezama, J.: Photorealistic video generation with diffusion models. arXiv preprint arXiv:2312.06662 (2023)
  • [13] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
  • [14] Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D.P., Poole, B., Norouzi, M., Fleet, D.J., et al.: Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303 (2022)
  • [15] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33, 6840–6851 (2020)
  • [16] Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)
  • [17] Ju, X., Zeng, A., Bian, Y., Liu, S., Xu, Q.: Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506 (2023)
  • [18] Ju, X., Zeng, A., Wang, J., Xu, Q., Zhang, L.: Human-art: A versatile human-centric dataset bridging natural and artificial scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 618–629 (2023)
  • [19] Ju, X., Zeng, A., Zhao, C., Wang, J., Zhang, L., Xu, Q.: Humansd: A native skeleton-guided diffusion model for human image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15988–15998 (2023)
  • [20] Kahatapitiya, K., Karjauv, A., Abati, D., Porikli, F., Asano, Y.M., Habibian, A.: Object-centric diffusion for efficient video editing. arXiv preprint arXiv:2401.05735 (2024)
  • [21] Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. ACM Transactions on Graphics (TOG) 40(6), 1–12 (2021)
  • [22] Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., Shi, H.: Text2video-zero: Text-to-image diffusion models are zero-shot video generators. arXiv preprint arXiv:2303.13439 (2023)
  • [23] Liu, S., Zhang, Y., Li, W., Lin, Z., Jia, J.: Video-p2p: Video editing with cross-attention control. arXiv preprint arXiv:2303.04761 (2023)
  • [24] Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems 35, 5775–5787 (2022)
  • [25] Luhman, E., Luhman, T.: Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388 (2021)
  • [26] Ma, Y., Cun, X., He, Y., Qi, C., Wang, X., Shan, Y., Li, X., Chen, Q.: Magicstick: Controllable video editing via control handle transformations. arXiv preprint arXiv:2312.03047 (2023)
  • [27] Meng, C., Rombach, R., Gao, R., Kingma, D., Ermon, S., Ho, J., Salimans, T.: On distillation of guided diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14297–14306 (2023)
  • [28] Molad, E., Horwitz, E., Valevski, D., Acha, A.R., Matias, Y., Pritch, Y., Leviathan, Y., Hoshen, Y.: Dreamix: Video diffusion models are general video editors. arXiv preprint arXiv:2302.01329 (2023)
  • [29] Qi, C., Cun, X., Zhang, Y., Lei, C., Wang, X., Shan, Y., Chen, Q.: Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535 (2023)
  • [30] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning (ICML). pp. 8748–8763. PMLR (2021)
  • [31] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
  • [32] Salimans, T., Ho, J.: Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512 (2022)
  • [33] Shin, C., Kim, H., Lee, C.H., Lee, S.g., Yoon, S.: Edit-a-video: Single video editing with object-aware consistency. arXiv preprint arXiv:2303.07945 (2023)
  • [34] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, S., Hu, Q., Yang, H., Ashual, O., Gafni, O., et al.: Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792 (2022)
  • [35] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)
  • [36] Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Proceedings of the 40th International Conference on Machine Learning (2023)
  • [37] Wang, W., Jiang, Y., Xie, K., Liu, Z., Chen, H., Cao, Y., Wang, X., Shen, C.: Zero-shot video editing using off-the-shelf image diffusion models. arXiv preprint arXiv:2303.17599 (2023)
  • [38] Wu, J.Z., Ge, Y., Wang, X., Lei, S.W., Gu, Y., Shi, Y., Hsu, W., Shan, Y., Qie, X., Shou, M.Z.: Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7623–7633 (2023)
  • [39] Wu, J.Z., Li, X., Gao, D., Dong, Z., Bai, J., Singh, A., Xiang, X., Li, Y., Huang, Z., Sun, Y., et al.: Cvpr 2023 text guided video editing competition. arXiv preprint arXiv:2310.16003 (2023)
  • [40] Xu, S., Huang, Y., Pan, J., Ma, Z., Chai, J.: Inversion-free image editing with natural language (2024)
  • [41] Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: Zero-shot text-guided video-to-video translation. arXiv preprint arXiv:2306.07954 (2023)
  • [42] Zhang, Q., Chen, Y.: Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902 (2022)
  • [43] Zhang, Y., Wei, Y., Jiang, D., Zhang, X., Zuo, W., Tian, Q.: Controlvideo: Training-free controllable text-to-video generation. arXiv preprint arXiv:2305.13077 (2023)
  • [44] Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., Anandkumar, A.: Fast sampling of diffusion models via operator learning. In: International Conference on Machine Learning. pp. 42390–42402. PMLR (2023)