MAVIN: Multi-Action Video Generation with Diffusion Models via Transition Video Infilling

Bowen Zhang
&Xiaofei Xie 11footnotemark: 1
&Haotian Lu
&Na Ma
&Tianlin Li
&Qing Guo  
Singapore Management University, SingaporeTsinghua University, ChinaYonsei University, KoreaNanyang Technological University, SingaporeIHPC and CFAR, Agency for Science, Technology and Research (A*STAR), Singapore. Corresponding authors: [email protected] and [email protected]
Abstract

Diffusion-based video generation has achieved significant progress, yet generating multiple actions that occur sequentially remains a formidable task. Directly generating a video with sequential actions can be extremely challenging due to the scarcity of fine-grained action annotations and the difficulty in establishing temporal semantic correspondences and maintaining long-term consistency. To tackle this, we propose an intuitive and straightforward solution: splicing multiple single-action video segments sequentially. The core challenge lies in generating smooth and natural transitions between these segments given the inherent complexity and variability of action transitions. We introduce MAVIN (Multi-Action Video INfilling model), designed to generate transition videos that seamlessly connect two given videos, forming a cohesive integrated sequence. MAVIN incorporates several innovative techniques to address challenges in the transition video infilling task. Firstly, a consecutive noising strategy coupled with variable-length sampling is employed to handle large infilling gaps and varied generation lengths. Secondly, boundary frame guidance (BFG) is proposed to address the lack of semantic guidance during transition generation. Lastly, a Gaussian filter mixer (GFM) dynamically manages noise initialization during inference, mitigating train-test discrepancy while preserving generation flexibility. Additionally, we introduce a new metric, CLIP-RS (CLIP Relative Smoothness), to evaluate temporal coherence and smoothness, complementing traditional quality-based metrics. Experimental results on horse and tiger scenarios demonstrate MAVIN’s superior performance in generating smooth and coherent video transitions compared to existing methods. Codes will be available at https://github.com/18445864529/MAVIN.

1 Introduction

The evolution of video generation models has been significantly shaped by the advent of diffusion-based techniques, offering unprecedented fidelity and temporal coherence in video synthesis [1, 2, 3, 4, 5]. However, these models often struggle to generate videos that encompass multiple actions or adhere to complex instructions, and typically produce relatively short clips, limiting their use in scenarios requiring longer, multi-action sequences.

Generating multi-action videos directly presents numerous unresolved challenges. Firstly, the lack of fine-grained action-level annotations in existing large-scale video datasets hampers model training. Secondly, multi-action sequences, involving extended durations and significant motion ranges, challenge models to maintain spatiotemporal consistency throughout the video. The structural characteristics of video U-Net models further complicate complex temporal semantic correspondence modeling. To circumvent these challenges, in this paper, we propose an innovative approach for generating multi-action videos by integrating several single-action video clips. This process entails two fundamental steps: first, the production of various video clips featuring the same subject engaging in distinct actions; second, the concatenation of these clips through action transitions. While the first step has been facilitated by recent advancements in text-conditioned image-to-video (TI2V) generation [6, 7, 8, 9, 10, 11], the second step remains understudied. To this end, we introduce MAVIN (Multi-Action Video INfilling model), a transition model designed to infill an intermediate video clip between two adjoining clips, ensuring a fluid and seamless transition.

This task requires meticulous attention to overall motion consistency and smoothness. Therefore, MAVIN is trained with consistent conditioning on the reference videos. To manage the potential for substantial motion gaps and the requirement for flexible infilling lengths, we utilize a variable-length sampling strategy. The performance of MAVIN is further enhanced by boundary frame guidance (BFG) and a Gaussian filter mixer (GFM). BFG leverages high-level semantic features from the boundary frames of input videos to guide the video infilling process, ensuring visual coherence throughout the transition. Meanwhile, GFM dynamically manages the introduction and modulation of noise during inference, improving generation fidelity while maintaining flexibility. Our method is trained in a self-supervised manner, eliminating the need for finely annotated video-text transcriptions.

Moreover, existing metrics for evaluating video generation primarily focus on visual quality and often overlook temporal coherence, which is crucial for assessing action transitions. To address this gap, we introduce a new metric, CLIP-RS (CLIP Relative Smoothness), specifically designed to evaluate the temporal consistency and smoothness of transition videos. This metric complements traditional quality-based metrics and provides a comprehensive evaluation of our model’s performance. Experimental results conducted on two distinct animal scenarios—horses and tigers—demonstrate our method’s superior performance in generating smooth and natural video transitions over existing methods, both in qualitative and quantitative assessments.

2 Related Work

Text-to-Video Generation. Text-to-Video (T2V) studies have shifted their focus from GAN-based models [12, 13, 14, 15] and auto-regressive models [16, 17, 18, 19] to diffusion models [20, 21, 3, 22, 23, 24, 25, 2], attributed to their superiority in generation quality, training stability, and condition flexibility. Foundation T2V models such as ModelScopeT2V [26] and VideoCrafter [27, 28] are trained on large-scale captioned datasets, possessing rich motion priors and text-motion correspondences. Nevertheless, challenges persist in generating actions that fully adhere to complex text descriptions.

Image-to-Video Generation. Generating videos solely from text prompts leads to a high degree of randomness in the appearance of each generation, thereby limiting its range of applications. Image-to-Video (I2V) generations, on the other hand, animate a user-input image by leveraging the motion priors learned from video-only datasets [1, 29, 30, 31] and have demonstrated the ability to generate high-fidelity and aesthetically pleasing videos. However, they often exhibit limitations in the form of minor and uncontrollable motion patterns. Considering these issues, many studies have begun to focus on text-conditioned image-to-video generation (TI2V) synthesis, which involves generating videos from a reference image, coupled with a text prompt indicating how the image should be animated. Videos generated in this manner typically use the provided image as the initial frame [7, 6, 32, 9, 33] or retain its appearance identity and characteristics [11, 10, 8], while performing the motion described in the text. There has also been a stream of works that further specialize in motion controllability by integrating extra controlling signals [34, 35, 36, 37].

Generative Video Interpolation. Diffusion models have also gained momentum in video interpolation, challenging traditional methods that rely on optical flow computation and frame blending techniques. MCVD [38] and RaMViD [39] adopt diffusion-based models with random frame masking, making it capable of handling a range of video generative modeling tasks, including video prediction and interpolation. LDMVFI [40] claims to be the first effort solving video interpolation using latent diffusion models and has achieved superior perceptual quality compared to traditional models. However, these works are primarily centered on standard video frame interpolation tasks, where the motions are less ambiguous and straightforward. A concurrent work [41] delves into large and challenging motions by interpolating 7 frames with an approximate stride of 3. It utilizes a cascaded framework where the interpolation occurs at a low-resolution pixel level and is subsequently upsampled with a super-resolution model. However, the absence of open-source availability of the model and data renders further evaluation under our task scenario unfeasible. SEINE [42] explores diffusion-based scene transition, where the model can generate a smooth transition from the start image depicting one scene to the end image representing another.

3 Methodology

3.1 Preliminaries

Latent diffusion model (LDM) [43] is a diffusion model (DM) [44] variant that operates on the compressed latent space instead of the pixel space, and has exhibited its strong efficacy in image generation. LDM first encodes an input image sample x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into a clean latent code z0=(x0)subscript𝑧0subscript𝑥0z_{0}=\mathcal{E}(x_{0})italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) using a VAE [45, 46] encoder ()\mathcal{E}(\cdot)caligraphic_E ( ⋅ ). The latent code then undergoes a forward diffusion process, where it is incrementally perturbed with Gaussian noise following a Markov chain

q(zt|zt1)=𝒩(zt;1βtzt1,βtI),𝑞conditionalsubscript𝑧𝑡subscript𝑧𝑡1𝒩subscript𝑧𝑡1subscript𝛽𝑡subscript𝑧𝑡1subscript𝛽𝑡𝐼q(z_{t}|z_{t-1})=\mathcal{N}(z_{t};\sqrt{1-\beta_{t}}z_{t-1},\beta_{t}I),italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_I ) , (1)

where t{1,,T}𝑡1𝑇t\in\{1,\ldots,T\}italic_t ∈ { 1 , … , italic_T }, and T𝑇Titalic_T is the number of total forward diffusion steps. βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT controls the noise strength at each step. By rewriting α¯t:=i=1t(1βi)assignsubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡1subscript𝛽𝑖\bar{\alpha}_{t}:=\prod_{i=1}^{t}(1-\beta_{i})over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT := ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), this formula can be simplified as

zt=α¯tz0+1α¯tϵ,ϵ𝒩(0,I).formulae-sequencesubscript𝑧𝑡subscript¯𝛼𝑡subscript𝑧01subscript¯𝛼𝑡italic-ϵsimilar-toitalic-ϵ𝒩0𝐼z_{t}=\sqrt{\bar{\alpha}_{t}}z_{0}+\sqrt{1-\bar{\alpha}_{t}}\epsilon,\;\;% \epsilon\sim\mathcal{N}(0,I).italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ , italic_ϵ ∼ caligraphic_N ( 0 , italic_I ) . (2)

A U-Net [47] model parameterized with θ𝜃\thetaitalic_θ works as a noise prediction function ϵθ()subscriptitalic-ϵ𝜃\epsilon_{\theta}(\cdot)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ) to predict the added noise ϵitalic-ϵ\epsilonitalic_ϵ given the time step t𝑡titalic_t and condition c𝑐citalic_c (e.g. text prompt). The training objective can be formulated as

argminθ𝔼z0,ϵ,t,cϵϵθ(zt,t,c)22.subscript𝜃subscript𝔼subscript𝑧0italic-ϵ𝑡𝑐subscriptsuperscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡𝑐22\arg\min_{\theta}\mathbb{E}_{z_{0},\epsilon,t,c}\|\epsilon-\epsilon_{\theta}(z% _{t},t,c)\|^{2}_{2}\ .roman_arg roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ , italic_t , italic_c end_POSTSUBSCRIPT ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (3)

Video latent diffusion model (VLDM) [48, 5, 49, 50] inflates the U-Net model into a 3D architecture by inserting temporal modules, making it capable of handling video data. Given an encoded video latent representation zn×h×w×c𝑧superscript𝑛𝑤𝑐z\in\mathbb{R}^{n\times h\times w\times c}italic_z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_h × italic_w × italic_c end_POSTSUPERSCRIPT where n𝑛nitalic_n is the number of frames in the video; hhitalic_h and w𝑤witalic_w denote the height and width of the latent code; and c𝑐citalic_c is the dimension of the latent space, the model performs spatial operations over the h×w𝑤h\times witalic_h × italic_w space and temporal operations along the n𝑛nitalic_n axis. The spatiotemporal structure empowers the model to manage spatial and temporal dependencies in a coordinated manner, facilitating the generation of coherent and high-quality video sequences.

3.2 Problem Formulation and Challenges

Problem formulation. The proposed transition video infilling task is a specialized form of video interpolation that deals with long ranges and large motions, with the input being two videos. The objective of this task is to generate a transition video given two videos, one preceding and one following, thereby seamlessly connecting the two. Given an encoded preceding video latent z0𝒫={z00,,z0s}superscriptsubscript𝑧0𝒫superscriptsubscript𝑧00superscriptsubscript𝑧0𝑠z_{0}^{\mathcal{P}}=\{z_{0}^{0},\ldots,z_{0}^{s}\}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT = { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT } and a following video latent z0={z0e,,z0L1}superscriptsubscript𝑧0superscriptsubscript𝑧0𝑒superscriptsubscript𝑧0𝐿1z_{0}^{\mathcal{F}}=\{z_{0}^{e},\ldots,z_{0}^{L-1}\}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_F end_POSTSUPERSCRIPT = { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT }, the model aims to generate an intermediate latent z0={z0s+1,,z0e1}superscriptsubscript𝑧0superscriptsubscript𝑧0𝑠1superscriptsubscript𝑧0𝑒1z_{0}^{\mathcal{I}}=\{z_{0}^{s+1},\ldots,z_{0}^{e-1}\}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT = { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e - 1 end_POSTSUPERSCRIPT }, where s𝑠sitalic_s is the end frame index of the preceding video and e𝑒eitalic_e is the start frame index of the following video. We term these two frames as the boundary frames for simplicity and clarity. L𝐿Litalic_L is the length of the integrated video after infilling.

Challenges and remedies. The novel nature of this task presents new challenges. In this section, we briefly outline the challenges and our solutions, with a detailed elaboration to follow in the next section.

The first challenge lies in temporal dependency modeling, which should support generating a transition video with potentially large motion gaps while maintaining motion consistency. Existing works [42, 39] typically adopt a BERT-like masking strategy for conditional modeling. However, such approaches are not effective for learning long-span motion patterns as prediction targets and references appear alternately on the temporal axis. To address this issue, we propose to consistently apply noise to a consecutive subsequence of training data. This method allows for natural conditioning on reference videos in temporal modules while creating large motion gaps in the middle, enforcing the capture of long-term temporal dependencies. Nevertheless, denoising only the middle part of training data can restrict data utilization and model robustness. To overcome this, we implement a variable-length sampling strategy to optimize data usage and simultaneously enhance the flexibility in generation length.

Furthermore, in text-conditioned generations, text prompts guide the generation direction in spatial modules, where each frame is processed independently, and the temporal modules align them, as vividly illustrated in [48]. However, in this task, the model operates in a self-supervised fashion and there is no text providing content or semantic guidance to the model. Without such guidance, spatial modules can generate incoherent images, placing extreme burdens on temporal modules to align them. We propose boundary frame guidance (BFG) in spatial modules to mitigate this issue.

Lastly, as revealed in [51, 52], a train-test noise initialization discrepancy hinders VLDM from generating high-quality videos. While previous solutions in I2V generation [31, 6] generally involve a shared noise strategy, it is not optimal for this task because the preserved condition frame signal throughout the generation sequence can limit motion range, discouraging synthesis of distinct transition states. To better serve the transition video infilling scenario, we propose a Gaussian filter mixer (GFM) module to balance initialization discrepancy and generation flexibility.

3.3 Model Architecture

Refer to caption
Figure 1: Model architecture. The input sequence is divided into three clips using variable-length sampling. Noise is added exclusively to the latent of the intermediate clip, with length embedded as extra information. Boundary frames are encoded with a CLIP vision encoder as content guidance for spatial transformers. During inference, a Gaussian filter mixer (GFM) is used for noise initialization.

The overall model architecture is depicted in Figure 1. In the model training stage, we simulate transition video infilling by dividing a training video into three segments. An entire video sample is first encoded into z0={z00,,z0L1}subscript𝑧0superscriptsubscript𝑧00superscriptsubscript𝑧0𝐿1z_{0}=\{z_{0}^{0},\ldots,z_{0}^{L-1}\}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT }, and only the intermediate clip is corrupted by t𝑡titalic_t-step Gaussian noise according to Eq. 2 resulting in ztsubscriptsuperscript𝑧𝑡z^{\mathcal{I}}_{t}italic_z start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The input to the U-Net model hence becomes zt={z0𝒫,zt,z0}subscript𝑧𝑡subscriptsuperscript𝑧𝒫0subscriptsuperscript𝑧𝑡subscriptsuperscript𝑧0z_{t}=\{z^{\mathcal{P}}_{0},z^{\mathcal{I}}_{t},z^{\mathcal{F}}_{0}\}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_z start_POSTSUPERSCRIPT caligraphic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT caligraphic_I end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUPERSCRIPT caligraphic_F end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT }, and the model is optimized to predict {ϵts+1,,ϵte1}superscriptsubscriptitalic-ϵ𝑡𝑠1superscriptsubscriptitalic-ϵ𝑡𝑒1\{\epsilon_{t}^{s+1},\ldots,\epsilon_{t}^{e-1}\}{ italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s + 1 end_POSTSUPERSCRIPT , … , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e - 1 end_POSTSUPERSCRIPT } as per Eq. 3. Loss is computed only on noised frames. Since this approach does not involve extra mask or condition frame concatenation to the channel dimension, it enables the utilization of most pre-trained foundation models that accept 3-channel RGB inputs.

Variable-length sampling. To improve data utilization and generation flexibility, we employ variable-length sampling by randomly shifting the start and end points of the infilling clip. Concretely, at each training step, we draw the boundary frame indices s𝑠sitalic_s and e𝑒eitalic_e randomly from two independent uniform distributions: s𝒰(as,bs)similar-to𝑠𝒰subscript𝑎𝑠subscript𝑏𝑠s\sim\mathcal{U}(a_{s},b_{s})italic_s ∼ caligraphic_U ( italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ); e𝒰(ae,be)similar-to𝑒𝒰subscript𝑎𝑒subscript𝑏𝑒e\sim\mathcal{U}(a_{e},b_{e})italic_e ∼ caligraphic_U ( italic_a start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ), where 0<as<bs<ae<be<L10subscript𝑎𝑠subscript𝑏𝑠subscript𝑎𝑒subscript𝑏𝑒𝐿10<a_{s}<b_{s}<a_{e}<b_{e}<L-10 < italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT < italic_a start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT < italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT < italic_L - 1. The resulting length of the noised clip l:=es1assign𝑙𝑒𝑠1l:=e-s-1italic_l := italic_e - italic_s - 1 thereby follows a triangular distribution with the upper and lower limits lupper=beassubscript𝑙𝑢𝑝𝑝𝑒𝑟subscript𝑏𝑒subscript𝑎𝑠l_{{upper}}=b_{e}-a_{s}italic_l start_POSTSUBSCRIPT italic_u italic_p italic_p italic_e italic_r end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, llower=aebssubscript𝑙𝑙𝑜𝑤𝑒𝑟subscript𝑎𝑒subscript𝑏𝑠l_{{lower}}=a_{e}-b_{s}italic_l start_POSTSUBSCRIPT italic_l italic_o italic_w italic_e italic_r end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Particularly, when bsas=beaesubscript𝑏𝑠subscript𝑎𝑠subscript𝑏𝑒subscript𝑎𝑒b_{s}-a_{s}=b_{e}-a_{e}italic_b start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_b start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, the PDF of the distribution is symmetric and has the mode lmode=(llower+lupper)/2subscript𝑙𝑚𝑜𝑑𝑒subscript𝑙𝑙𝑜𝑤𝑒𝑟subscript𝑙𝑢𝑝𝑝𝑒𝑟2l_{mode}=(l_{lower}+l_{upper})/2italic_l start_POSTSUBSCRIPT italic_m italic_o italic_d italic_e end_POSTSUBSCRIPT = ( italic_l start_POSTSUBSCRIPT italic_l italic_o italic_w italic_e italic_r end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_u italic_p italic_p italic_e italic_r end_POSTSUBSCRIPT ) / 2. To avoid confusion stemming from variable-length sampling, we equip the model with an awareness of the generation length it is handling. This is achieved by incorporating a length embedding using sinusoidal encoding followed by an MLP. The length embedding is subsequently added to the timestep encoding and collectively processed through another MLP into the spatial convolution module. This approach improves the model’s capacity to leverage training samples by accommodating predictions at varying positions and lengths. It effectively allows for generating at various lengths and accepting reference videos of diverse durations, thereby bolstering the model’s robustness and flexibility.

Dynamic boundary frame guidance. Boundary frames play a pivotal role in guiding the model’s generation as they provide explicit information about the gap the model is tasked to bridge. Therefore, we propose boundary frame guidance (BFG) to compensate for the lack of guidance in transition video generation. Most popular frame conditioning strategies entail extending the keys and values of spatial self-attention layers to include those of the condition frames [53, 9, 54]. However, empirical experiments did not prove these approaches effective for this task, and the low-level visual information sometimes restricted the freedom of generation, resulting in synthesized frames copying too much information from existing ones. Instead, we inject the guidance signal into the cross-attention layers via a higher-level CLIP representation. Concretely, we encode the pixel-level boundary frames x0ssuperscriptsubscript𝑥0𝑠x_{0}^{s}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT and x0esuperscriptsubscript𝑥0𝑒x_{0}^{e}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT using a CLIP vision encoder and concatenate the representations along the sequence dimension. This integrated representation serves both as content and semantic guidance due to the nature of CLIP representations, informing the model about the generation direction. A short text prompt briefly describing the subject, such as “horse movement”, can be optionally leveraged to help classify the action subject or extract knowledge from a pre-trained foundation model. The combined use of the CLIP encoders and the concatenation operation provides the model with a consistent understanding of the integrated condition signal.

Gaussian filter mixer for inference-time noise initialization. We propose a dynamic inference-time noise mixing strategy tailored for the transition video infilling task. Since the infilling video functions as a bridge, its first few frames should thereby resemble the preceding video, while the last few frames approach the following video. The frames in the middle should be granted the flexibility to display transition states that are significantly distinct from any reference frames. Inspired by FreeInit [55], we propose a Gaussian filter mixer (GFM) module that dynamically retains a certain amount of information from the closest boundary frame latent. This is accomplished by kee** the low-frequency component of the diffused boundary latent, which offers a rough layout guidance to the denoising process. The preserved information gradually diminishes as the frame position moves away from the boundaries, allowing for greater freedom in generation. It is then mixed with individual Gaussian noise at each frame, resulting in the mixed inference-time noise initialization z~tnsuperscriptsubscript~𝑧𝑡𝑛\tilde{z}_{t}^{n}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT at frame n𝑛nitalic_n as

low(n)superscript𝑙𝑜𝑤𝑛\displaystyle\mathcal{F}^{low}(n)caligraphic_F start_POSTSUPERSCRIPT italic_l italic_o italic_w end_POSTSUPERSCRIPT ( italic_n ) ={𝒯3D(zts)𝒢(fS(n),fT(n))ifns+e2,𝒯3D(zte)𝒢(fS(n),fT(n))ifn>s+e2,absentcasesdirect-productsubscript𝒯3𝐷superscriptsubscript𝑧𝑡𝑠𝒢subscript𝑓𝑆𝑛subscript𝑓𝑇𝑛if𝑛𝑠𝑒2direct-productsubscript𝒯3𝐷superscriptsubscript𝑧𝑡𝑒𝒢subscript𝑓𝑆𝑛subscript𝑓𝑇𝑛if𝑛𝑠𝑒2\displaystyle=\begin{dcases}\mathcal{FFT}_{3D}(z_{t}^{s})\odot\mathcal{G}(f_{S% }(n),f_{T}(n))&\text{if}\;n\leq\frac{s+e}{2},\\ \mathcal{FFT}_{3D}(z_{t}^{e})\odot\mathcal{G}(f_{S}(n),f_{T}(n))&\text{if}\;n>% \frac{s+e}{2},\end{dcases}= { start_ROW start_CELL caligraphic_F caligraphic_F caligraphic_T start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ⊙ caligraphic_G ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_n ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_n ) ) end_CELL start_CELL if italic_n ≤ divide start_ARG italic_s + italic_e end_ARG start_ARG 2 end_ARG , end_CELL end_ROW start_ROW start_CELL caligraphic_F caligraphic_F caligraphic_T start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ) ⊙ caligraphic_G ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_n ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_n ) ) end_CELL start_CELL if italic_n > divide start_ARG italic_s + italic_e end_ARG start_ARG 2 end_ARG , end_CELL end_ROW (4)
high(n)superscript𝑖𝑔𝑛\displaystyle\mathcal{F}^{high}(n)caligraphic_F start_POSTSUPERSCRIPT italic_h italic_i italic_g italic_h end_POSTSUPERSCRIPT ( italic_n ) =𝒯3D(ϵtn)(1𝒢(fS(n),fT(n))),absentdirect-productsubscript𝒯3𝐷superscriptsubscriptitalic-ϵ𝑡𝑛1𝒢subscript𝑓𝑆𝑛subscript𝑓𝑇𝑛\displaystyle=\mathcal{FFT}_{3D}(\epsilon_{t}^{n})\odot\left(1-\mathcal{G}(f_{% S}(n),f_{T}(n))\right),= caligraphic_F caligraphic_F caligraphic_T start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ⊙ ( 1 - caligraphic_G ( italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( italic_n ) , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_n ) ) ) , (5)
z~tnsuperscriptsubscript~𝑧𝑡𝑛\displaystyle\tilde{z}_{t}^{n}over~ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT =GFM(n)=𝒯3D(low(n)+high(n)),absent𝐺𝐹𝑀𝑛subscript𝒯3𝐷superscript𝑙𝑜𝑤𝑛superscript𝑖𝑔𝑛\displaystyle=GFM(n)=\mathcal{IFFT}_{3D}(\mathcal{F}^{low}(n)+\mathcal{F}^{% high}(n)),= italic_G italic_F italic_M ( italic_n ) = caligraphic_I caligraphic_F caligraphic_F caligraphic_T start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( caligraphic_F start_POSTSUPERSCRIPT italic_l italic_o italic_w end_POSTSUPERSCRIPT ( italic_n ) + caligraphic_F start_POSTSUPERSCRIPT italic_h italic_i italic_g italic_h end_POSTSUPERSCRIPT ( italic_n ) ) , (6)

where s𝑠sitalic_s and e𝑒eitalic_e are the indices of the boundary frames; 𝒯3D()subscript𝒯3𝐷\mathcal{FFT}_{3D}(\cdot)caligraphic_F caligraphic_F caligraphic_T start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( ⋅ ) and 𝒯3D()subscript𝒯3𝐷\mathcal{IFFT}_{3D}(\cdot)caligraphic_I caligraphic_F caligraphic_F caligraphic_T start_POSTSUBSCRIPT 3 italic_D end_POSTSUBSCRIPT ( ⋅ ) represent discrete fast Fourier transform and its inverse operation, performing in 3D dimensions; fS()subscript𝑓𝑆f_{S}(\cdot)italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( ⋅ ) and fT()subscript𝑓𝑇f_{T}(\cdot)italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) are functions that adjust the spatial and temporal stop frequencies, respectively; and 𝒢(,)𝒢\mathcal{G}(\cdot,\cdot)caligraphic_G ( ⋅ , ⋅ ) is a 3D Gaussian low-pass filter taking both spatial and temporal stop frequencies as control parameters.

Eq. 4 first ensures that each intermediate frame refers to its closest boundary frame. Subsequently, the adjusting functions progressively reduce the stop frequency values as the distance to the selected boundary increases. We opt for a straightforward linear decreasing function with a scaling coefficient λ𝜆\lambdaitalic_λ to realize such control. The stop frequency for both fS()subscript𝑓𝑆f_{S}(\cdot)italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( ⋅ ) and fT()subscript𝑓𝑇f_{T}(\cdot)italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( ⋅ ) is computed as

f(n)=max(0,f0λmin(|ns|,|ne|)f0),𝑓𝑛0subscript𝑓0𝜆𝑛𝑠𝑛𝑒subscript𝑓0f(n)=\max(0,f_{0}-\lambda\cdot\min(|n-s|,|n-e|)\cdot f_{0}),italic_f ( italic_n ) = roman_max ( 0 , italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_λ ⋅ roman_min ( | italic_n - italic_s | , | italic_n - italic_e | ) ⋅ italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , (7)

where f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial stop frequency of the low-pass filter. Here, f0subscript𝑓0f_{0}italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT determines the maximum layout information we aim to retain from the boundary frames, and λ𝜆\lambdaitalic_λ regulates the rate at which such information decreases as the synthesis target moves away from the referred boundary.

4 Experiments

4.1 Experimental Setup

Table 1: Quantitative comparison with other generative transition models.

test-manual test-auto MS-SSIM↑ PSNR↑ LPIPS↓ FVD↓ CLIP-RS↑ CLIPSIM↑ MS-SSIM↑ PSNR↑ LPIPS↓ FVD↓ CLIP-RS↑ CLIPSIM↑ Horse - 8 frames DynamiCrafter [8] 0.564 17.51 0.204 802.0 0.746 0.478 0.355 15.24 0.253 355.5 0.715 0.368 DynamiCrafter-Vid 0.609 17.75 0.198 1073.6 0.830 0.468 0.430 15.96 0.264 814.2 0.790 0.380 SEINE [42] 0.710 19.79 0.129 635.4 0.866 0.546 0.481 16.90 0.184 271.7 0.813 0.439 SEINE-Vid 0.588 17.54 0.179 571.7 0.719 0.480 0.434 16.23 0.224 165.6 0.700 0.395 MAVIN (Ours) 0.724 19.97 0.128 479.5 0.844 0.517 0.560 18.24 0.162 147.7 0.819 0.453 Horse - 12 frames DynamiCrafter [8] 0.452 16.05 0.252 866.6 0.740 0.408 0.317 14.75 0.280 367.8 0.747 0.340 DynamiCrafter-Vid 0.553 17.09 0.212 820.6 0.811 0.456 0.374 15.34 0.267 419.7 0.776 0.362 SEINE [42] 0.591 17.78 0.176 743.6 0.815 0.487 0.383 15.48 0.236 289.6 0.782 0.379 SEINE-Vid 0.357 14.15 0.330 1096.0 0.522 0.349 0.283 13.97 0.327 343.1 0.570 0.294 MAVIN (Ours) 0.666 19.12 0.148 559.9 0.844 0.491 0.458 16.68 0.211 208.8 0.798 0.400 Tiger - 8 frames DynamiCrafter [8] 0.383 15.58 0.251 856.8 0.733 0.447 0.260 13.81 0.309 377.9 0.691 0.415 DynamiCrafter-Vid 0.477 16.35 0.224 1177.3 0.822 0.477 0.408 15.22 0.239 619.5 0.836 0.491 SEINE [42] 0.613 18.23 0.152 612.5 0.860 0.553 0.417 15.29 0.211 297.3 0.844 0.506 SEINE-Vid 0.580 17.83 0.177 447.5 0.766 0.528 0.525 16.49 0.182 232.7 0.798 0.544 MAVIN (Ours) 0.678 19.17 0.137 536.8 0.846 0.530 0.635 17.87 0.139 245.3 0.869 0.562 Tiger - 12 frames DynamiCrafter [8] 0.346 15.13 0.276 834.3 0.763 0.423 0.221 13.50 0.336 395.7 0.744 0.385 DynamiCrafter-Vid 0.396 15.51 0.253 873.1 0.793 0.443 0.290 14.23 0.288 371.5 0.803 0.434 SEINE [42] 0.504 16.86 0.196 707.4 0.859 0.500 0.361 14.61 0.245 356.3 0.847 0.472 SEINE-Vid 0.390 15.49 0.268 733.5 0.703 0.425 0.277 13.72 0.314 423.6 0.675 0.405 MAVIN (Ours) 0.595 18.08 0.167 689.7 0.835 0.500 0.513 16.23 0.183 310.9 0.852 0.512

Datasets. For our experiments, we focus on two distinct animal species to verify the effectiveness of the proposed method: horses and tigers. We use the AnimalKingdom dataset [56] for training the horse model and the TigDog dataset [57] for the tiger model. The AnimalKingdom dataset encompasses a diverse range of species and tasks, but we only utilize the videos labeled as “Horse” from the action_recognition task for training. However, we noticed that the video clips in action_recognition are generally too short and correspond to only single actions, resulting in insufficient action transition patterns for model training. Therefore, we supplemented the horse training data with additional long-take web videos that capture horse movements. Consequently, the total duration of the training data for each dataset is approximately 45 minutes under 30 FPS.

Testing clips for the transition video infilling task should ideally contain action transitions or large motions to effectively evaluate the model’s efficacy. Such data, however, is challenging to source from existing datasets, prompting us to construct our own. We collect videos from the Internet and generate the test data in two ways: manual cutting, which yields high-quality samples, and automatic generation, which produces a large number of test clips. We refer to the test sets generated in these ways as test-manual and test-auto, respectively. For test-manual, we meticulously cut web videos into 32-frame clips by ensuring the occurrence of significant movements or posture changes (e.g., transitioning from grazing to standing upright) in the intermediate clips. We curated 34 such test samples for each animal class. For test-auto, we employ an optical flow estimator, RAFT [58], to estimate the motion intensity between the two reference clips. Concretely, we select video clips based on the average optical flow magnitudes of the boundary frames. Since small magnitude values suggest minor motions and excessively large values typically result from dramatic camera movements, only those with values falling within a certain range are leveraged. To formalize this, the boundary frame indices s𝑠sitalic_s and e𝑒eitalic_e for test-auto are selected using the following equation:

{(s,e)}={(s,e)|1hwi=1hj=1wflow(xs,xe)i,j2(Tlower,Tupper),es1=ltest},𝑠𝑒conditional-set𝑠𝑒formulae-sequence1𝑤superscriptsubscript𝑖1superscriptsubscript𝑗1𝑤subscriptnormsubscript𝑓𝑙𝑜𝑤subscriptsubscript𝑥𝑠subscript𝑥𝑒𝑖𝑗2subscript𝑇𝑙𝑜𝑤𝑒𝑟subscript𝑇𝑢𝑝𝑝𝑒𝑟𝑒𝑠1subscript𝑙𝑡𝑒𝑠𝑡\left\{(s,e)\right\}=\left\{(s,e)\,|\,\frac{1}{h\cdot w}\sum_{i=1}^{h}\sum_{j=% 1}^{w}\left\|\mathcal{E}_{flow}(x_{s},x_{e})_{i,j}\right\|_{2}\in\left(T_{% lower},T_{upper}\right),e\!-\!s\!-\!1\!=\!l_{test}\right\},{ ( italic_s , italic_e ) } = { ( italic_s , italic_e ) | divide start_ARG 1 end_ARG start_ARG italic_h ⋅ italic_w end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_w end_POSTSUPERSCRIPT ∥ caligraphic_E start_POSTSUBSCRIPT italic_f italic_l italic_o italic_w end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ ( italic_T start_POSTSUBSCRIPT italic_l italic_o italic_w italic_e italic_r end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_u italic_p italic_p italic_e italic_r end_POSTSUBSCRIPT ) , italic_e - italic_s - 1 = italic_l start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT } , (8)

where Tlowersubscript𝑇𝑙𝑜𝑤𝑒𝑟T_{lower}italic_T start_POSTSUBSCRIPT italic_l italic_o italic_w italic_e italic_r end_POSTSUBSCRIPT and Tuppersubscript𝑇𝑢𝑝𝑝𝑒𝑟T_{upper}italic_T start_POSTSUBSCRIPT italic_u italic_p italic_p italic_e italic_r end_POSTSUBSCRIPT are the lower and upper thresholds; hhitalic_h and w𝑤witalic_w are the height and width of estimated optical flows; and ltestsubscript𝑙𝑡𝑒𝑠𝑡l_{test}italic_l start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT is the length of the generation sequence we want to test. We tuned the thresholds and obtained 113 visually satisfactory test samples for the horse class and 104 for the tiger class by setting ltestsubscript𝑙𝑡𝑒𝑠𝑡l_{test}italic_l start_POSTSUBSCRIPT italic_t italic_e italic_s italic_t end_POSTSUBSCRIPT to 12.

Implementation details. We initialize our model from ModelScopeT2V-1.7b [26] and fine-tune it with the proposed framework for 40K steps. The optimization is carried out using an AdamW optimizer [59], with a constant learning rate of 5e-6 and a batch size of 1. Training videos are randomly sampled into 32-frame clips at a sample rate of 2, and pre-processed to eliminate potential shot transitions by excluding clips where any SSIM [60] value between consecutive frames falls below 0.1. Videos shorter than 32 frames are discarded. Experiments are conducted at a resolution of 256×\times×256. Training and inference require around 40 and 12 GB vRAM, respectively. All training is performed on a single NVIDIA L40 GPU, with each trial taking approximately one day. The length range of random intermediate clips is set to llower=2,lupper=22formulae-sequencesubscript𝑙𝑙𝑜𝑤𝑒𝑟2subscript𝑙𝑢𝑝𝑝𝑒𝑟22l_{lower}=2,l_{upper}=22italic_l start_POSTSUBSCRIPT italic_l italic_o italic_w italic_e italic_r end_POSTSUBSCRIPT = 2 , italic_l start_POSTSUBSCRIPT italic_u italic_p italic_p italic_e italic_r end_POSTSUBSCRIPT = 22. The GFM parameters employed are f0=0.6,λ=0.1formulae-sequencesubscript𝑓00.6𝜆0.1f_{0}=0.6,\lambda=0.1italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.6 , italic_λ = 0.1.

Evaluation Metrics. We present the following evaluation metrics: multi-scale structural similarity (MS-SSIM)[61], peak signal-to-noise ratio (PSNR), LPIPS[62], FVD [63], and CLIP similarity [64]. However, most of these metrics primarily assess reconstruction quality and similarity between generated and ground-truth videos, considering each frame independently. They do not adequately account for temporal coherence, which reflects the smoothness of generated motions. To address this gap, we propose a CLIP-similarity-based inner-frame consistency measurement to quantify the relative smoothness with respect to the ground truth video clip. We term this measure CLIP Relative Smoothness score (CLIP-RS), computed as follows:

CLIP-RS=1L1i=1L1min(CLIPSIM(pi1,pi),CLIPSIM(qi1,qi))max(CLIPSIM(pi1,pi),CLIPSIM(qi1,qi)),CLIP-RS1𝐿1superscriptsubscript𝑖1𝐿1CLIPSIMsubscript𝑝𝑖1subscript𝑝𝑖CLIPSIMsubscript𝑞𝑖1subscript𝑞𝑖CLIPSIMsubscript𝑝𝑖1subscript𝑝𝑖CLIPSIMsubscript𝑞𝑖1subscript𝑞𝑖\text{CLIP-RS}=\frac{1}{L-1}\sum_{i=1}^{L-1}{\frac{\min(\text{CLIPSIM}(p_{i-1}% ,p_{i}),\text{CLIPSIM}(q_{i-1},q_{i}))}{\max(\text{CLIPSIM}(p_{i-1},p_{i}),% \text{CLIPSIM}(q_{i-1},q_{i}))}},CLIP-RS = divide start_ARG 1 end_ARG start_ARG italic_L - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT divide start_ARG roman_min ( CLIPSIM ( italic_p start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , CLIPSIM ( italic_q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG start_ARG roman_max ( CLIPSIM ( italic_p start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , CLIPSIM ( italic_q start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) end_ARG , (9)

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th generated frame and qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding ground truth frame. L𝐿Litalic_L is the length of the generated video. CLIPSIM(pi,pj)CLIPSIMsubscript𝑝𝑖subscript𝑝𝑗\text{CLIPSIM}(p_{i},p_{j})CLIPSIM ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the cosine similarity between the CLIP representations of images pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and pjsubscript𝑝𝑗p_{j}italic_p start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Each summation term quantifies the relative frame change in the generated video compared to the actual change in the ground truth. Either a relatively drastic or subtle change results in a low score. For example, if the oracle transition occurs at a steady rate while the synthesized video initially remains stationary and then abruptly changes to complete the transition, the differences in transition pace will be captured, leading to low relative smoothness values.

Table 2: CLIP-RS responds to temporal changes and is not sensitive to visual aesthetics.
SSIM↑ PSNR↑ LPIPS↓ CLIPSIM↑ CLIP-RS↑
Original (self-comparison) 1.00 inf 0.00 1.00 1.00
Decrease luminance by 50% 0.71 13.6 0.21 0.66 0.97
Increase contrast by 50% 0.69 19.9 0.08 0.82 0.97
Zeroing-out red channel 0.67 11.3 0.29 0.66 0.96
Zeroing-out red&green channels 0.33 8.44 0.60 0.49 0.95
Replicating 1st frame as video 0.79 20.7 0.06 0.77 0.87

CLIP-RS is a metric calculated along the frame axis, measuring the degree of changes between adjacent frames. Although it references the ground truth video, it does not engage in any direct frame-to-frame comparisons between the two videos. This characteristic renders this metric indifferent to the quality of the generated images or their resemblance to the original video. We demonstrate this by manipulating a 12-frame video clip and computing the metrics with the original video. As shown in Table 2, when the video’s visual aesthetics are perturbed (rows 2-5), metrics based on predicted-actual similarity are significantly impacted despite the structural content and motion effect of the video remaining unchanged, whereas CLIP-RS maintains a score close to 1. In contrast, when the video’s temporal property is altered (the last row, where the new video is comprised of a 12-time repetition of the original video’s first frame), similarity-based and quality-based metrics yield superior results compared to when visual aesthetics were disturbed, while CLIP-RS can identify such smoothness discrepancies. This is in direct contrast to the use of absolute smoothness measurement [42], where a static video can achieve a perfect smoothness score of 1, which contradicts our objective. The CLIPSIM(,)CLIPSIM\text{CLIPSIM}(\cdot,\cdot)CLIPSIM ( ⋅ , ⋅ ) function in CLIP-RS can also be substituted with other similarity measurements such as SSIM, averaged optical flow momentum, etc.

4.2 Results

Refer to caption
Figure 2: Qualitative comparison of MAVIN with baseline models. The top two rows are input reference videos, with glowing frames marking the boundaries. MAVIN demonstrates smoother and more natural transitions and superior spatiotemporal consistency compared to baseline models.

Comparison with existing methods. For our comparative analysis, we selected two open-source diffusion-based generative models: DynamicCrafter [8] and SEINE [42]. Both models are capable of generating transition videos from two condition images. However, to ensure a more fair and relevant comparison to our work, we also conducted experiments where these models were conditioned on video inputs. We refer to these modified versions as DynamicCrafter-Vid and SEINE-Vid.

We conducted experiments with two infilling length settings: (i) generating 8 frames given 12-frame condition clips on each side, and (ii) generating 12 frames given 10-frame references. The total input length is 32, matching our test set samples. Particularly, we found that DynamicCrafter, which was trained to generate fixed 16-frame videos, performed poorly when this length was altered. Therefore, for DynamicCrafter, we use a 4-frame reference on each side for 8-frame infilling, and 2 for 12-frame infilling, maintaining a total length of 16. For image-conditioned generations, where DynamicCrafter generates 14 frames (16 minus 2 reference images), we evenly sampled 8 and 12 frames from the 14 for metric computations. All metrics were calculated only on intermediate clips, except for FVD, which was compared with the entire input sequence.

We show quantitative results in Table 1 and qualitative results in Figure 2. MAVIN substantially outperforms other generative baseline methods, especially when the motion is difficult. Specifically, SEINE is the most competitive transition generation model, but as the number of infilling frames increases, the gap between MAVIN and SEINE becomes obvious. test-auto is generated at a sample rate of 4, which is rather challenging. It is equivalent to bridging a 48-frame gap when infilling 12 frames on test-auto. The performance gap further increases under this setting, showing the effectiveness of the proposed method in infilling videos with large and complex motions.

As the only existing generative model trained for transition purposes, SEINE adopts a BERT-like masking strategy for masked modeling, where each frame is corrupted by chance independently, resulting in an intermittent corruption pattern. Although this method enhances data utilization and robustness, it falls short in generating long-term temporally cohesive videos because the corruption pattern allows the model to rely on nearby clean frames for predictions. In contrast, our method consistently applies continuous corruption up to a maximum length of 22 frames, compelling the model to capture long-term motion dependencies.

Ablation Study. We ablate the two key components of the proposed framework and present the qualitative results in Table 3. Results were obtained on Horse test-manual by predicting 12 frames under the same experimental setup. Boundary frame guidance (BFG) offers important content direction during model training, and the Gaussian filter mixer (GFM) helps stabilize the generation by providing essential information to address the discrepancy between training and inference phases. By ablating either BFG or GFM, the performance deteriorates across all metrics. When both components are removed together, the model experiences severe degradation, demonstrating the effectiveness and necessity of these components for high-quality video generation. (See supplementary materials)

Table 3: Ablation study on boundary frame guidance (BFG) during training and Gaussian filter mixer (GFM) noise initialization during inference.

MS-SSIM↑ PSNR↑ LPIPS↓ FVD↓ CLIP-RS↑ CLIPSIM↑ MAVIN (Proposed Method) 0.666 19.12 0.148 559.9 0.844 0.491 --Boundary Frame Guidance 0.651 (-2.3%) 19.00 (-0.6%) 0.153 (-3.4%) 570.9 (-2.0%) 0.833 (-1.3%) 0.483 (-1.6%) --Gaussian Filter Mixer 0.647 (-2.9%) 18.03 (-5.7%) 0.167 (-12.8%) 627.2 (-12.0%) 0.815 (-3.4%) 0.475 (-3.3%) --BFG --GFM 0.606 (-9.0%) 17.78 (-7.0%) 0.189 (-27.7%) 672.2 (-20.1%) 0.781 (-7.5%) 0.443 (-9.8%)

Refer to caption
Figure 3: Application of the transition video infilling model. It connects multiple TI2V-generated single-action video clips into a cohesive extended video with smooth and natural action transitions.

4.3 Application for Multi-Action Generation

We achieve multi-action generations by connecting single-action videos with MAVIN. This work does not focus on optimizing the single-action models; instead, we employ existing TI2V models that animate an input image through text control, as discussed in Section 2. In our empirical experiments, directly generating large motions or non-continuous actions using pre-trained TI2V models [6, 9] led to failure. Therefore, we fine-tune these models for the single-action generation purpose. To integrate the synthesized single-action videos, we insert noise of the desired length between two videos and use MAVIN to infill a transition video. Alternatively, instead of inserting noise, we can concatenate single-action videos and replace the junction frames with noise to regenerate the transition parts.

We initialize the action model from AnimateAnything [6]. The training data for single actions is also derived from the AnimalKingdom and TigDog datasets, except that we collect additional data for training the action “horse jum**”. We train one model per animal species. A fixed action prompt, such as “horse is jum**”, is tied to each action and serves as the text condition during training.

To create a multi-action video, we first use a single image, controlled by multiple action prompts, to generate multiple single-action videos separately. These videos are then concatenated into a longer sequence, arranged in the desired order. We regenerate the junction frames using the infilling model for smooth action transitions. Figure 3 illustrates an example of such an application. In this example, we generate 20 frames for each single action and refine 12 frames centered around each junction, resulting in a 60-frame-long video that contains three actions: jump, stand, and run. The first, third, and fifth rows depict the single-action videos generated by the action model, while the second and fourth rows are generated by the infilling model to connect them. This approach generates highly temporally cohesive examples with great flexibility.

5 Conclusion

In conclusion, this study has presented a novel approach to generative video infilling, specifically targeting the generation of transition clips in multi-action sequences, leveraging the capabilities of diffusion models. Our model, MAVIN, demonstrates a significant improvement over existing methods by generating smoother and more natural transition videos across complex motion sequences. This research lays the groundwork for future advancements in unsupervised motion pre-training, large-motion video interpolation, and multi-action video generation. While this technique enables new applications, it is also crucial to establish guidelines and implement safeguards to prevent its potential misuse in creating fake content, which raises ethical and security concerns.

Limitations. Due to computational limitations and the proprietary nature of the widely used video training dataset WebVid-10M [65], our experiments were conducted only under specific scenarios and initialized from existing foundation models. Further exploration of the task might require training at scale. Moreover, while we did not concentrate on optimizing the single-action (TI2V) models, a notable trade-off between visual quality and motion intensity persists even after fine-tuning, highlighting an area for further research. The failure cases include the single-action model’s inability to follow the action prompt and the inconsistency in appearance in later frames for actions involving large motions.

References

  • Blattmann et al. [2023a] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
  • Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  • Singer et al. [2022] Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data. In The Eleventh International Conference on Learning Representations, 2022.
  • Wang et al. [2023a] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023a.
  • Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
  • Dai et al. [2023] Zuozhuo Dai, Zhenghao Zhang, Yao Yao, Bingxue Qiu, Siyu Zhu, Long Qin, and Weizhi Wang. Animateanything: Fine-grained open domain image animation with motion guidance. arXiv e-prints, pages arXiv–2311, 2023.
  • Girdhar et al. [2023] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  • Xing et al. [2023] **bo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
  • Ren et al. [2024] Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun Du, Stephen Huang, and Wenhu Chen. Consisti2v: Enhancing visual consistency for image-to-video generation. arXiv preprint arXiv:2402.04324, 2024.
  • Zhang et al. [2024] David Junhao Zhang, Dongxu Li, Hung Le, Mike Zheng Shou, Caiming Xiong, and Doyen Sahoo. Moonshot: Towards controllable video generation and editing with multimodal conditions. arXiv preprint arXiv:2401.01827, 2024.
  • Wei et al. [2023] Yujie Wei, Shiwei Zhang, Zhiwu Qing, Hangjie Yuan, Zhiheng Liu, Yu Liu, Yingya Zhang, **gren Zhou, and Hongming Shan. Dreamvideo: Composing your dream videos with customized subject and motion. arXiv preprint arXiv:2312.04433, 2023.
  • Fox et al. [2021] Gereon Fox, Ayush Tewari, Mohamed Elgharib, and Christian Theobalt. Stylevideogan: A temporal generative model using a pretrained stylegan. In The 32nd British Machine Vision Conference. BMVA Press, 2021.
  • Brooks et al. [2022] Tim Brooks, Janne Hellsten, Miika Aittala, Ting-Chun Wang, Timo Aila, Jaakko Lehtinen, Ming-Yu Liu, Alexei Efros, and Tero Karras. Generating long videos of dynamic scenes. Advances in Neural Information Processing Systems, 35:31769–31781, 2022.
  • [14] Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N Metaxas, and Sergey Tulyakov. A good image generator is what you need for high-resolution video synthesis.
  • Shen et al. [2023] Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Mostgan-v: Video generation with temporal motion styles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5652–5661, 2023.
  • Ge et al. [2022] Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118. Springer, 2022.
  • Hong et al. [2022] Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In The Eleventh International Conference on Learning Representations, 2022.
  • Le Moing et al. [2021] Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. Ccvs: Context-aware controllable video synthesis. Advances in Neural Information Processing Systems, 34:14042–14055, 2021.
  • Yan et al. [2021] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  • Zhang et al. [2023] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  • Jeong et al. [2023] Hyeonho Jeong, Geon Yeong Park, and Jong Chul Ye. Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models. arXiv preprint arXiv:2312.00845, 2023.
  • Ge et al. [2023] Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  • He et al. [2022] Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, and Qifeng Chen. Latent video diffusion models for high-fidelity long video generation. arXiv preprint arXiv:2211.13221, 2022.
  • Zhou et al. [2022] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.
  • Yang et al. [2023] Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. Entropy, 25(10):1469, 2023.
  • Wang et al. [2023b] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
  • Chen et al. [2023a] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, **bo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023a.
  • Chen et al. [2024] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models, 2024.
  • Guo et al. [2023] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  • ** et al. [2024] Yang **, Zhicheng Sun, Kun Xu, Liwei Chen, Hao Jiang, Quzhe Huang, Chengru Song, Yuliang Liu, Di Zhang, Yang Song, et al. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:2402.03161, 2024.
  • Wu et al. [2023a] Ruiqi Wu, Liangyu Chen, Tong Yang, Chunle Guo, Chongyi Li, and Xiangyu Zhang. Lamp: Learn a motion pattern for few-shot-based video generation. arXiv preprint arXiv:2310.10769, 2023a.
  • Zeng et al. [2023] Yan Zeng, Guoqiang Wei, Jiani Zheng, Jiaxin Zou, Yang Wei, Yuchen Zhang, and Hang Li. Make pixels dance: High-dynamic video generation. arXiv preprint arXiv:2311.10982, 2023.
  • Gong et al. [2024] Litong Gong, Yiran Zhu, Weijie Li, Xiaoyang Kang, Biao Wang, Tiezheng Ge, and Bo Zheng. Atomovideo: High fidelity image-to-video generation. arXiv preprint arXiv:2403.01800, 2024.
  • Chen et al. [2023b] Xi Chen, Zhiheng Liu, Mengting Chen, Yutong Feng, Yu Liu, Yujun Shen, and Hengshuang Zhao. Livephoto: Real image animation with text-guided motion control. arXiv preprint arXiv:2312.02928, 2023b.
  • Kandala et al. [2024] Hitesh Kandala, Jianfeng Gao, and Jianwei Yang. Pix2gif: Motion-guided diffusion for gif generation. arXiv preprint arXiv:2403.04634, 2024.
  • Shi et al. [2024] Xiaoyu Shi, Zhaoyang Huang, Fu-Yun Wang, Weikang Bian, Dasong Li, Yi Zhang, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, et al. Motion-i2v: Consistent and controllable image-to-video generation with explicit motion modeling. arXiv preprint arXiv:2401.15977, 2024.
  • Ma et al. [2024] Yue Ma, Yingqing He, Hongfa Wang, Andong Wang, Chenyang Qi, Chengfei Cai, Xiu Li, Zhifeng Li, Heung-Yeung Shum, Wei Liu, et al. Follow-your-click: Open-domain regional image animation via short prompts. arXiv preprint arXiv:2403.08268, 2024.
  • Voleti et al. [2022] Vikram Voleti, Alexia Jolicoeur-Martineau, and Chris Pal. Mcvd-masked conditional video diffusion for prediction, generation, and interpolation. Advances in neural information processing systems, 35:23371–23385, 2022.
  • Höppe et al. [2022] Tobias Höppe, Arash Mehrjou, Stefan Bauer, Didrik Nielsen, and Andrea Dittadi. Diffusion models for video prediction and infilling. arXiv preprint arXiv:2206.07696, 2022.
  • Danier et al. [2024] Duolikun Danier, Fan Zhang, and David Bull. Ldmvfi: Video frame interpolation with latent diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1472–1480, 2024.
  • Jain et al. [2024] Siddhant Jain, Daniel Watson, Eric Tabellion, Aleksander Hołyński, Ben Poole, and Janne Kontkanen. Video interpolation with diffusion models. arXiv preprint arXiv:2404.01203, 2024.
  • Chen et al. [2023c] Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction. In The Twelfth International Conference on Learning Representations, 2023c.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Esser et al. [2021] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  • Blattmann et al. [2023b] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023b.
  • Esser et al. [2023] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, and Anastasis Germanidis. Structure and content-guided video synthesis with diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7346–7356, 2023.
  • Wang et al. [2024] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and **gren Zhou. Videocomposer: Compositional video synthesis with motion controllability. Advances in Neural Information Processing Systems, 36, 2024.
  • Lin et al. [2024] Shanchuan Lin, Bingchen Liu, Jiashi Li, and Xiao Yang. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5404–5411, 2024.
  • Chen [2023] Ting Chen. On the importance of noise scheduling for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  • Wu et al. [2023b] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7623–7633, 2023b.
  • Henschel et al. [2024] Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. arXiv preprint arXiv:2403.14773, 2024.
  • Wu et al. [2023c] Tianxing Wu, Chenyang Si, Yuming Jiang, Ziqi Huang, and Ziwei Liu. Freeinit: Bridging initialization gap in video diffusion models. arXiv preprint arXiv:2312.07537, 2023c.
  • Ng et al. [2022] Xun Long Ng, Kian Eng Ong, Qichen Zheng, Yun Ni, Si Yong Yeo, and Jun Liu. Animal kingdom: A large and diverse dataset for animal behavior understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19023–19034, 2022.
  • Del Pero et al. [2015] Luca Del Pero, Susanna Ricco, Rahul Sukthankar, and Vittorio Ferrari. Articulated motion discovery using pairs of trajectories. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2151–2160, 2015.
  • Teed and Deng [2020] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  • Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Wang et al. [2003] Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. Ieee, 2003.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • Unterthiner et al. [2019] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Bain et al. [2021] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.