CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Sijie Zhao Yong Zhang

{}^{~{}\textrm{{\char 0\relax}}}

Xiaodong Cun Shaoshu Yang Muyao Niu

Xiaoyu Li Wenbo Hu Ying Shan

Tencent AI Lab

https://github.com/AILab-CVC/CV-VAE

Abstract

Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI’s SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. To improve the training efficiency, we also design a novel architecture for the video VAE. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.

^†^†footnotetext:

{{}^{~{}\textrm{{\char 0\relax}}}}

Corresponding author

1 Introduction

Video generation has gained significant public attention, especially after the announcement of OpenAI SORA [22]. Current popular video models can be divided into two categories based on the modeling space, i.e., pixel and latent space. Imagen Video [14], Make-a-video [27], and Show-1 [39] are representative video diffusion models that directly learn the distribution of pixels. On the other hand, Phenaki [31], MAGVIT [38], VideoCrafter [4], AnimateDiff [12], VideoPeot [16], and SORA, etc, are representative latent generative video models that are trained in the latent space formed using variational autoencoders (VAEs). The latter category is more prevalent due to its training efficiency.

Furthermore, latent video generative models can be classified into two groups according to the type of VAE they utilize: LLM-like and diffusion-based video models. LLM-like models train a transformer on discrete tokens extracted by a 3D VAE with a quantizer within the VQ-VAE framework [30]. For example, VideoGPT [37] initially trains a 3D-VQVAE and subsequently an autoregressive transformer in the latent space. The 3D-VQVAE is inflated from the 2D-VQVAE [30] used in image generation. TATS [11] and MAGVIT [38] use 3D-VQGAN for better visual quality by employing discriminators, while Phenaki [31] utilizes a transformer-based encoder and decoder, namely CViViT.

However, recent latent diffusion-based video models typically exploit 2D VAEs, rather than 3D VAEs, to generate continuous latents to train a UNet or DiT [23]. The commonly used 2D VAE is the image VAE [26] from Stable Diffusion, as training a video model from scratch can be quite challenging. Almost all high-performing latent video models are trained with the SD image model [26] as initialization for the inflated UNet or DiT. Examples include Align-your-latent [3], VideoCrafter1 [4], AnimateDiff [12], SVD [2], Modelscope [32], LaVie [33], MagicVideo [41], Latte [21], etc. Temporal compression is simply achieved by uniform frame sampling while ignoring the motion information between frames (see Fig. 2). Consequently, the trained video models may not fully understand smooth motion, even when FPS is set as a condition. When projecting a sampled latent sequence to a video using the decoder of the 2D VAE, the generated video exhibits a low FPS and lacks visual smoothness.

Refer to caption — Figure 1: Temporal compression difference between an image VAE and our video one.

Currently, the research community lacks a commonly used 3D video VAE for generating continuous latent variables with spatio-temporal compression for latent video models. Training a high-quality video VAE without considering the compatibility with existing pretrained image and video models might not be too difficult. However, even though the trained video VAE exhibits low reconstruction errors, a gap exists between its learned latent space and the one used by pretrained models, such as the video VAE of Open-Sora-Plan [17]. This means that bridging the gap requires significant computational resources and extensive training time, even when using pre-trained models as initialization. One example is shown in Fig. 2. When training a video VAE independently without considering compatibility, the sampled latent of SVD [2] cannot be projected into the pixel space correctly due to the latent space gap, as shown in Fig. 2(a). After finetuning the SVD model in the new latent space on 16 A100 for 58K iterations, the quality of the generated video is still poor (see Fig. 2(b)). In contrast, our video VAE achieves promising results in the pretrained SVD even without finetuning the UNet as shown in Fig. 2(c).

In this work, we propose a novel method to train a video VAE to extract continuous latents for generative video models, which is compatible with existing pretrained image and video models, e.g. Stable Diffusion [26] and SVD [2]. We also inflate the SD image VAE to form a video VAE by adding 3D convolutions to both encoder and decoder of the 2D VAE, which allows us to train video models efficiently with the pretrained models as initialization in a truly spatio-temporally compressed latent space, instead of uniform frame sampling for temporal compression (see Fig. 2). Consequently, the generated videos will be smoother and have a higher FPS than those produced using a 2D VAE.

To ensure latent space compatibility between 2D and 3D VAEs, we propose a latent space regularization to avoid distribution shifts. We examine the effectiveness of using either the encoder or decoder of the 2D VAE to form constraints and explore four types of map** functions to design regularization. Moreover, to improve video VAE efficiency, we investigate its architecture and partially integrate 3D convolutions instead of exploiting 3D convolution in all blocks. The proposed video VAE can be used not only for training new video models with pretrained ones as initialization but also as a frame interpolator for existing video models with slight finetuning.

Our main contributions are summarized as follows: (1) We propose a video VAE that provides a truly spatio-temporally compressed continuous space for training latent generative video models, which is compatible with existing image and video models, greatly reducing the expense of training or finetuning video models. (2) We propose a latent space regularization to avoid distribution shifts and design an efficient architecture for the video VAE. (3) Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.

2 Related Work

Variational Autoencoder.

Variational Autoencoders (VAEs), introduced by [15], have been widely used in two-stage generative models. The first stage involves compressing the pixels into a lower-dimensional latent representation, followed by a second stage that generates pixels from this latent space. VAEs can be divided into two groups according to the token, i.e., discrete and continuous latent. The difference between the two types of VAEs is the quantization. Continuous VAEs have no quantization, while discrete VAEs learn a codebook for quantization and use it to convert the continuous latent features to discrete indices, called VQVAE [30]. When training discrete VAEs, some methods exploit a discriminator to improve the visual image quality, called VQGAN [9].

In video generation, 2D VAEs are typically inflated into 3D ones by injecting 3D Conv or temporal attention. 3D Convs are for CNN-based VAEs, e.g., 3D-VQVAE [37], 3D-VAQGAN [11, 38]. Attentions are for transformer-based VAEs, e.g., CViViT [31]. Although there are several discrete 3D VAEs for video generation, there are no commonly used continuous 3D VAEs.

Video Generative models.

Video generation has achieved remarkable progress in recent years. The announcement of Imagen Video [14] and Make-A-Video [27] made researchers see the hope of purely AI-generated videos. Then, the launch of OpenAI SORA [22] brought the enthusiasm of researchers in academia and industry to a climax. Many video generation models [14, 27, 39] directly learn the distribution of pixels while some others [3, 4, 41, 32, 33, 21, 38, 11, 31, 37, 2, 12] learn the distribution of tokens in a latent space. The tokens are always extracted by a variational autoencoder [8]. Latent video generation models can be categorized into two groups according to whether the token is discrete or continuous. TATS [11], MAGVIT [38], VideoGPT [37], and Phenaki [31] are representative models trained with discrete tokens extracted by a 3D VAE within the VQVAE framework [30]. A codebook is learned jointly with the VAE for quantization. SVD [2], AnimateDiff [12], VideoCrafter [4], etc., are video models trained with continuous latent extracted by a 2D VAE without quantization, rather than a 3D VAE. SD image VAE is the commonly used 2D VAE. One reason is that video models are difficult to train from scratch and they are always initialized with the weights of a pretrained T2I model such as Stable Diffusion UNet [26]. Hence, the corresponding image VAE is used to extract latents from a video. Since the image VAE can only perform spatial compression, the temporal compression is realized by uniform frame sampling. This strategy ignores the motion between key frames.

There lacks a video VAE that is compatible with the pretrained T2I or video models. Though it is not difficult to train a video VAE (3D VAE) independently with high reconstruction accuracy; it will result in a latent space gap between the learned video VAE and existing pre-trained image and video models that are always used as initialization. The Open-Sora-Plan project [17] offers a video VAE; however, it is not compatible with existing image or video models. Large computational resources and a long training time are required to bridge the gap. In this work, we propose a latent space regularization method to train a video VAE whose latent space is compatible with pretrained models.

3 Method

We propose a latent space regularization method for training a video VAE that is compatible with pre-trained image and video models. We examine multiple strategies for implementing the regularization, focusing on either the encoder or the decoder of the image VAE. Additionally, we explore four types of map** functions to develop the regularization loss. To enhance the efficiency of the video VAE, we introduce an architecture that employs different inflation strategies in distinct blocks, instead of incorporating 3D convolutions in all blocks.

3.1 Latent Space Regularization

We inflate a 2D VAE into a 3D VAE, initializing it with the 2D VAE’s weights. The 3D VAE is designed to be capable of encoding both image and video (see details in Sec.3.2). The key of building a compatible video VAE is the latent space alignment between the video VAE and the image VAE.

Notations.

Let $x\in\mathbb{R}^{H\times W\times 3}$ denote an image in RGB space and $X\in\mathbb{R}^{(T+1)\times H\times W\times 3}$ denote a video with $T+1$ frames. When $T=0$ , $X$ degrades into an image and the video VAE will process it with temporal padding. $z\in\mathbb{R}^{h\times w\times c}$ denotes the latent tokens extracted by either an image VAE or a video VAE. $Z\in\mathbb{R}^{(t+1)\times h\times w\times c}$ is latent tokens extracted by the video VAE. $\rho_{s}=H/h=W/w$ and $\rho_{t}=T/t$ are the spatial and temporal compression rates. Let $\mathcal{E}_{i}$ and $\mathcal{D}_{i}$ denote the encoder and decoder of the image VAE, respectively. While $\mathcal{E}_{v}$ and $\mathcal{D}_{v}$ are for the video VAE. Then, we have $z=\mathcal{E}_{i}(x)$ , $Z=\mathcal{E}_{v}(X)$ , and $z=\mathcal{E}_{v}(x)$ . $\tilde{x}=\mathcal{D}_{i}(z)=\mathcal{D}_{i}(\mathcal{E}_{i}(x))$ , $\tilde{X}=\mathcal{D}_{v}(Z)=\mathcal{D}_{v}(\mathcal{E}_{v}(X))$ , and $\tilde{x}=\mathcal{D}_{v}(z)=\mathcal{D}_{v}(\mathcal{E}_{v}(x))$ are the reconstructed image and video from the latent tokens.

Regularization.

We assume the latent of the image VAE follow a distribution, i.e, $z\sim p^{i}(z)$ . The joint distribution of $t+1$ independent frames is $p^{i}(Z)=\prod_{k}^{t+1}p^{i}(z_{k})$ . The latent distribution of the video VAE can be denoted as $Z\sim p^{v}(Z)$ . To achieve the alignment between the latent spaces of the image and video VAEs, we have to build map**s between $p^{i}(Z)$ and $p^{v}(Z)$ . Since both distributions have no analytic formulation, distance metric for measuring differences between distributions is not applicable.

Here, we build the cooperation between the image VAE and the video one to construct reconstruction loss for space alignment. When exploiting the encoder of the image VAE for alignment, the latent extracted from the image encoder should be corrected decoded by the decoder of the video VAE, i.e., $\tilde{X}_{i}^{v}=\mathcal{D}_{v}(\mathcal{E}_{i}(\psi(X)))$ . The illustration is shown in Fig. 3(a). For a given input video $X\in\mathbb{R}^{(T+1)\times H\times W\times 3}$ , we use a map** function $\psi$ to sample $\psi(X)\in\mathbb{R}^{(T/\rho_{t}+1)\times H\times W\times 3}$ . Thus the reconstructed video $\tilde{X}_{i}^{v}$ is the same as the shape of $X$ . Then, the reconstruction loss of using the image encoder can be defined as

L_{\text{reg}}^{\text{en}}=||X-\tilde{X}_{i}^{v}||^{2}.

(1)

When exploiting the decoder of the image VAE, the latent extracted by the video encoder can be decoded by the decoder of the image VAE, i.e., $\tilde{X}_{v}^{i}=\mathcal{D}_{i}(\mathcal{E}_{v}(X))$ . The illustration is shown in Fig. 3(b). For a given input video $X\in\mathbb{R}^{(T+1)\times H\times W\times 3}$ , the reconstructed video is $\tilde{X}_{i}^{v}\in\mathbb{R}^{(T/\rho_{t}+1)\times H\times W\times 3}$ . Then, the reconstruction loss of using the image decoder can be defined as

L_{\text{reg}}^{\text{dec}}=||\psi(X)-\tilde{X}_{v}^{i}||^{2}.

(2)

Map** Functions.

To bridge the dimension gap between $\tilde{X}_{v}^{i}$ or $\tilde{X}_{i}^{v}$ and $X$ , we investigate four types of map** functions $\psi$ as follows. 1) First frame. We compare only the first frame of the input video and the reconstructed one. The regularization loss degenerates to measure the difference between the input and reconstruction of the image. 2) Slice. $\psi$ samples one frame every $\rho_{t}$ frames to form a shorter video. It starts from the second frame and the first one is reserved. 3) Average. $\psi$ computes the average of every $\rho_{t}$ frames, starting from the second frame. 4) Random. $\psi$ randomly samples one frame from every $\rho_{t}$ frames, starting from the second frame.

Training Objective.

Following the training of the 2D VAE in LDM [26], our basic objective is a combination of a reconstruction loss [40], an adversarial loss [9], and a KL regularization [15], i.e.,

L_{\text{AE}}=\min_{\mathcal{E}_{v},\mathcal{D}_{v}}\max_{D_{v}}~{}~{}L_{\text% {rec}}(X,\mathcal{D}_{v}(\mathcal{E}_{v}(X))-L_{\text{adv}}(\mathcal{D}_{v}(% \mathcal{E}_{v}(X)))+\log D_{v}(X)+L_{\text{KL}}(X;\mathcal{E}_{v},\mathcal{D}% _{v}),

where the first term is the reconstruction loss, the second and third are the adversarial loss, and the last is the KL regularization. $D_{v}$ is the discriminator that differentiates original videos from reconstructed ones. It is inflated from the image discriminator in LDM by injecting 3D convolutions. Then, for latent space alignment, our full training objective is:

L_{\text{AE}}^{\text{align}}=L_{\text{AE}}+\lambda_{1}L_{\text{reg}}^{\text{% dec}}+\lambda_{2}L_{\text{reg}}^{\text{en}},

(3)

where $\lambda_{1}$ and $\lambda_{2}$ are trade-off parameters. We explore different settings of $\lambda_{1}$ and $\lambda_{2}$ and find that using the decoder only achieves the best performance. The framework of CV-VAE is shown in Fig. 3(c) and evaluations between different regularization methods can be found in Tab. 4.

3.2 Architecture Design of Video VAE

We design the architecture of the video VAE according to the image VAE in LDM [26]. The detailed architecture is presented in the Appendix A.1. We explain the key modifications as follows.

Model Inflation.

Considering the latent space compatibility and the convergence speed of the video VAE, we make full use of the pretrained weights of the image VAE for initialization, instead of training from scratch. We inflate the image VAE into the video VAE by replacing 2D convolutions with 3D convolutions. 3D convolutions are used to model the temporal dynamics among frames. To initialize the 3D convolutions, we copy the weights of the 2D Conv kernel to the corresponding positions in the 3D Conv kernel and set the remaining parameters to zero. We set the stride to achieve temporal downsampling and increase the number of 3D kernels by a factor of $s$ to achieve $s\times$ temporal upsampling. To enable the video VAE to handle both image and video, given $T+1$ frames as input, we use reflection padding in the temporal dimension for the first frame. By initializing the video VAE using the above operations, we can reconstruct images without training, significantly accelerating the training convergence speed on video datasets.

Efficient 3D Architecture.

Expanding 2D Convs to 3D Convs (e.g., $k\times k\to k\times k\times k$ ) results in $k\times$ parameters and computational complexity. To improve the computational efficiency of the model, we adopt a 2D+3D network structure. Specifically, we retain half of the convolutions in the ResBlock as 2D Convs and set the other half as 3D Convs. We find that, compared to setting all Convs to 3D, the number of parameters and the computational complexity are reduced by roughly 30%, while the reconstruction performance remains nearly the same. See Sec. 4.2 for experimental comparisons.

Temporal Tiling for Arbitrary Video Length

Existing image VAEs employ spatial tiling on large spatial resolution images to achieve memory-friendly processing, which cannot handle long videos. As a result, we introduce temporal tiling processing. During encoding, the video $X$ is divided into $[X_{1},X_{2},...X_{n}]$ , where $X_{i}\in\mathbb{R}^{(1+f\cdot\rho_{t})\times H\times W\times 3}$ and $f$ is a parameter controlling the size of each block. $X_{i}$ and $X_{i+1}$ have a one-frame overlap in the temporal dimension. After encoding each $X_{i}$ to obtain $Z_{i}$ , we discard the first frame of $Z_{i}$ when $i\neq 0$ and concatenate all $Z_{i}$ in the temporal dimension to obtain $Z$ . The decoding process is handled similarly to the encoding process. By combining our method with 2D tiling, we can encode videos with arbitrary resolution and length.

4 Experiments

4.1 Experimental Setups

Datasets and Metrics. We evaluate our CV-VAE on the COCO2017 [18] validation dataset and the Webvid [1] validation dataset which includes 1024 videos. Both images and videos are resized and cropped to a resolution of $256\times 256$ . Each video is sampled with 33 frames and a frame stride of 3. We evaluate the reconstruction performance of CV-VAE on images and videos using metrics such as PSNR, SSIM [34], and LPIPS scores [40]. We employ 3D tiled processing to encode and decode videos with arbitrary resolution and length within a limited memory footprint. During inference, we allow a single video block size of $17\times 576\times 576$ . We evaluate the video generation quality of our model using 2048 randomly sampled videos from UCF101 [28] and MSR-VTT [36]. Videos are resized and cropped to a resolution of $576\times 1024$ to fit the SVD [2]. We use Frechet Video Distance (FVD) [29], Kernel Video Distance (KVD) [29], and Perceptual Input Conformity (PIC) [35] metrics to evaluate video generation quality. For evaluating image generation quality, we use 2048 samples from the COCO2017 validation dataset and employ FID [13], CLIP score [24], and PIC score metrics.

Training Details. We train our CV-VAE model using image datasets including LAION-COCO [7] and Unsplash [20], as well as the video dataset Webvid-10M [1]. For image datasets, we employ two resolutions, i.e., $256\times 256$ and $512\times 512$ . In the case of video datasets, we use two settings of frames and resolutions: $9\times 256\times 256$ and $17\times 192\times 192$ . The batch sizes for these four settings are 8, 2, 1, and 1, with sampling ratios of 40%, 10%, 25%, and 25%, respectively. We employed the AdamW optimizer [19] with a learning rate of 1e-4 and cosine learning rate decay. To avoid numerical overflow, we trained CV-VAE using float32 precision, and the training was carried out on 16 A100 GPUs for 200K steps. To fine-tune the SVD on CV-VAE, we utilize in-house data with a frame rate and resolution of $97\times 576\times 1024$ . We employ deepspeed stage 2 [25], gradient checkpointing [6] techniques, and train with bfloat16 precision. We used a constant learning rate of 1e-5 with the AdamW [19] optimizer, and only optimized the last layer of U-Net. The training was carried out on 16 A100 GPUs for 5K steps.

4.2 Image and Video Reconstruction

We evaluated the reconstruction quality of various VAE models on image and video test sets. The comparison group includes: (1) VAE-SD2.1 [26] which is widely used in the community for image and video generation models. (2) VQGAN [9] which encoding pixels into discrete latents. We use the f8-8192 version for comparision. (3) TATS [11]: a 3D VQGAN designed for video generation. (4) VAE-OSP [17]: a 3D VAE from Open-Sora-Plan which is initialized from VAE-SD2.1 and trained with video data. (5) Our CV-VAE (2D+3D): retains half of the 2D convolutions to reduce computational overhead. (6) Our CV-VAE (3D): utilizes only 3D convolutions.

Method	Params	FCR	Comp.	COCO-Val			Webvid-Val
Method	Params	FCR	Comp.	PNSR( $\uparrow$ )	SSIM( $\uparrow$ )	LPIPS( $\downarrow$ )	PNSR( $\uparrow$ )	SSIM( $\uparrow$ )	LPIPS( $\downarrow$ )
VAE-SD2.1 [26]	34M + 49M	1x	-	26.6	0.773	0.127	28.9	0.810	0.145
VQGAN [9]	26M + 38M	1x	$\times$	22.7	0.678	0.186	24.6	0.718	0.179
TATS [11]	7M + 16M	4x	$\times$	23.4	0.741	0.287	24.1	0.729	0.310
VAE-OSP [17]	94M +135M	4x	$\times$	27.0	0.791	0.142	26.7	0.781	0.166
Ours(2D+3D)	68M + 114M	4x	$\checkmark$	27.6	0.805	0.136	28.5	0.817	0.143
Ours(3D)	100M + 156M	4x	$\checkmark$	27.7	0.805	0.135	28.6	0.819	0.145

Table 1: Quantitative evaluation on image and video reconstruction. FCR represents the frame compression rate, and Comp. indicates compatibility with existing generative models.

As illustrated in Tab. 1, we present the parameter count (Params), Frame Compression Ratio (FCR), and compatibility with existing diffusion models (Comp.) for various VAE models. Thanks to the latent constraint, our model is compatible with current diffusion models, compresses videos by 4 $\times$ in the temporal dimension, and achieves top-tier image and video reconstruction quality. This enables the generation of longer videos under roughly the same computational resources.

We also conducted a qualitative comparison of the reconstruction results for different VAE models, as shown in Fig. 4. In the top row, we reconstructed images with a resolution of $512\times 512$ and compared them with Image VAE models. All three models compressed the images to a latent size of $64\times 64$ . Our results were close to those of VAE-SD2.1, while VQGAN had the worst performance. In the bottom row, we reconstructed videos with a resolution of $33\times 512\times 512$ and compared them with Video VAE models. All three models compressed the videos to a latent size of $9\times 64\times 64$ . Comparing the decoded videos at the same frames, our model achieved the best results. Check Appendx A.2 and A.3 for more reconstruction results.

4.3 Compatibility with Existing Models

Image Models

We tested the compatibility of our CV-VAE by integrating it into the pretrained SD2.1 [26], replacing the original 2D VAE without any finetuning. We evaluated it on the COCO-Val [18] dataset and compared the results with the SD2.1 model using PID, CLIP score, and PIC metrics. The data (see Tab. 2) suggest that both models perform similarly in text-to-image generation.

We also visualized the text-to-image generation results of both models in Fig. 5. In each pair, the left side depicts the results of SD2.1, while the right side shows the results generated by our CV-VAE, which replaced the original VAE, using the same random seed and prompt. The results show that both models generate images with almost identical content and texture, with only slight differences in color. This further validates the feasibility of building a compatible VAE via latent constraint.

SVD	\animategraphics [width=]4figures/svd_xt_frames_512/video1/frame_024	\animategraphics [width=]4figures/svd_xt_frames_512/video2/frame_024	\animategraphics [width=]4figures/svd_xt_frames_512/video3/frame_024
SVD $+$ CV-VAE	\animategraphics [width=]16figures/output_layer_frames_512/video1/frame_096	\animategraphics [width=]16figures/output_layer_frames_512/video2/frame_096	\animategraphics [width=]16figures/output_layer_frames_512/video3/frame_096
SVD	\animategraphics [width=]4figures/svd_xt_frames_512/video4/frame_024	\animategraphics [width=]4figures/svd_xt_frames_512/video5/frame_024	\animategraphics [width=]4figures/svd_xt_frames_512/video6/frame_024
SVD $+$ CV-VAE	\animategraphics [width=]16figures/output_layer_frames_512/video4/frame_096	\animategraphics [width=]16figures/output_layer_frames_512/video5/frame_096	\animategraphics [width=]16figures/output_layer_frames_512/video6/frame_096

Figure 6: Comparison between the image VAE and our video VAE on image-to-video generation of SVD [2]. ‘SVD’ means using the image VAE. ‘SVD

+

CV-VAE’ means using our video VAE and tuning the output layer of SVD. Click to play the video clips with Adobe or Foxit PDF Reader.

	Trainable	COCO2017-Val
	Trainable	FID( $\downarrow$ )	CLIP( $\downarrow$ )	PIC( $\uparrow$ )
SD2.1 [26]	$\times$	57.3	0.312	0.354
SD2.1+CV-VAE	$\times$	57.6	0.311	0.360

Table 2: Quantitative results of text-to-image generation.

Video Models

The primary objective of CV-VAE is to train a model that can compress both time and space, while also being compatible with the existing 2D VAE. In this section, we validate the compatibility of CV-VAE with existing video generation models. We integrate CV-VAE into SVD [2], replacing the original VAE, and decoded the generated video latents. CV-VAE offers the flexibility to decode either in image mode (CV-VAE-I) or video mode (CV-VAE-V); the former decodes n frames of latent into n frames of video, while the latter decodes n frames of latent into $1+(n-1)\times 4$ frames of video. We tested the video generation quality of both models. Furthermore, we fine-tuning the SVD for better alignment.

Method	Trainable	FCR	Frames	UCF-101			MSR-VTT
				FVD( $\downarrow$ )	KVD( $\downarrow$ )	PIC( $\uparrow$ )	FVD( $\downarrow$ )	KVD( $\downarrow$ )	PIC( $\uparrow$ )
SVD [2]	$\times$	1x	25	402	8.20	0.791	310	1.30	0.588
SVD+CV-VAE-I	$\times$	1x	25	419	6.73	0.763	262	1.67	0.609
\hdashlineSVD+CV-VAE-V	$\times$	4x	97	762	15.7	0.791	319	3.31	0.696
SVD+CV-VAE-V	Output layer	4x	97	681	13.1	0.858	295	2.26	0.734

Table 3: Evaluation results of image-to-video generation. FCR denotes the frame compression rate.

Constraint	COCO-Val			Webvid-Val
	PNSR( $\uparrow$ )	SSIM( $\uparrow$ )	LPIPS( $\downarrow$ )	PNSR( $\uparrow$ )	SSIM( $\uparrow$ )	LPIPS( $\downarrow$ )
2D Enc.	26.0	0.759	0.205	26.0	0.748	0.222
2D Dec.	27.5	0.801	0.151	28.0	0.803	0.158
2D Dnc. + Dec.	27.9	0.808	0.150	27.6	0.795	0.176

Table 4: Comparison of different regularization types.

Map** Function	COCO-Val			Webvid-Val
Map** Function	PNSR( $\uparrow$ )	SSIM( $\uparrow$ )	LPIPS( $\downarrow$ )	PNSR( $\uparrow$ )	SSIM( $\uparrow$ )	LPIPS( $\downarrow$ )
1st Frame	27.3	0.797	0.156	26.6	0.771	0.191
Average	27.5	0.801	0.151	27.9	0.801	0.172
Slice	27.5	0.802	0.152	27.7	0.799	0.168
Random	27.6	0.803	0.138	28.4	0.811	0.153

Table 5: Comparison of different map** functions.

As shown in Tab. 3, incorporating CV-VAE-I into a frozen SVD immediately yields video generation quality comparable to the original VAE. Using CV-VAE in video mode can also decode videos generated by SVD, and further improvements in video decoding quality can be achieved by fine-tuning only the output layer (approximately 12k parameters). One of the reasons for the noticeable gap in test metrics between SVD+CV-VAE-I and SVD+CV-VAE-V is that they use different numbers of frames, making a direct comparison challenging.

In Fig. 6, we also display the comparison results with SVD [2]. The top row shows the generated results by SVD, and the bottom row shows the generated results after inserting CV-VAE into SVD and fine-tuning the output layer. We use the first frame as a condition and generate with the same random seed. The U-Net generates 25 frames of latent, which are decoded by CV-VAE into a 97-frame video. As can be seen, compared to the original SVD, our results exhibit smoother motion. It is worth noting that both models have the same computational complexity during the diffusion process, which means that our model is more scalable.

4.4 Ablation Study

Influence of Regularization Type

We evaluated the impact of three types of latent regularization, which are: (1) 2D Enc. , i.e., $\lambda_{1}=0$ and $\lambda_{2}=1$ in Eq. 3; (2) 2D Dec. , i.e., $\lambda_{1}=1$ and $\lambda_{2}=0$ in Eq. 3; (3) 2D Enc. + Dec. , i.e., $\lambda_{1}=1$ and $\lambda_{2}=1$ in Eq. 3.

Tab. 4 shows the impact of various latent regularization methods. Using the 2D decoder for latent regularization results in better reconstruction for both image and video test sets compared to the 2D encoder. This is likely because the gradient backpropagation through the 2D decoder provides better guidance for the 3D VAE’s learning, while the frozen 2D encoder doesn’t propagate gradients. The ‘2D Enc. + Dec.’ method performs slightly better on image test sets but worse on video datasets compared to ‘2D Enc.’ Since our main goal is video reconstruction and for simplicity, we use the 2D decoder for regularization.

Influence of Map** Functions

The 2D decoder decodes $n$ frames of latents into $n$ frames of video, while the 3D decoder decodes the same $n$ frames of latents into $1+(n-1)\times 4$ frames of video. Therefore, we need to map** the input video to $n$ frames to calculate the regularization loss in Eq. 2. We evaluated four map** functions mentioned in Sec. 3.1.

As shown in Tab. 5, the four methods have similar effects on image reconstruction, with the main differences being in video reconstruction. The ‘1st Frame’ approach yields the worst video reconstruction results due to the lack of regularization and guidance for subsequent frames. The ‘Slice’ method results in poor reconstruction quality for the three unsampled middle frames. The ‘Average’ method is inferior to ‘Random’ in video reconstruction, primarily because calculating the mean for multiple consecutive frames leads to motion blur in the target.

5 Conclusion and Limitations

We propose a novel method to train a video VAE that is compatible with existing image and video models trained with SD image VAE. The video VAE provides a truly spatio-temporally compressed latent space for latent generative video models, as opposed to uniform frame sampling. Due to the latent space compatibility, a new video model can be trained efficiently with the pretrained image or video models as initialization. Besides, existing video models such as SVD can generate smoother videos with four times more frame using our video VAE by slightly fine-tuning a few parameters. Extensive experiments are performed to demonstrate the effectiveness of the proposed VAE.

Limitations.

The performance of the proposed video VAE relies on the channel dimension of the latent space. A higher dimension may yield better reconstruction accuracy. Since we pursue the latent space compatibility with existing image and video models trained with SD image VAE, the channel dimension of our video VAE is limited to be the same as the image VAE. This can be improved if an image VAE with a higher channel dimension becomes available, e.g., the VAE of SD3 [10].

References

Bain et al. [2021] M. Bain, A. Nagrani, G. Varol, and A. Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision, 2021.
Blattmann et al. [2023a] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023a.
Blattmann et al. [2023b] A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023b.
Chen et al. [2023] H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
Chen et al. [2024] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
Chen et al. [2016] T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
Christoph et al. [2022] S. Christoph, K. Andreas, V. Richard, C. Theo, and B. Romain. Laion-coco: 600m synthetic captions from laion2b-en, 2022. URL https://laion.ai/blog/laion-coco/.
Doersch [2016] C. Doersch. Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908, 2016.
Esser et al. [2021] P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
Esser et al. [2024] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024.
Ge et al. [2022] S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pages 102–118, 2022.
Guo et al. [2023] Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Ho et al. [2022] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
Kingma and Welling [2013] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kondratyuk et al. [2023] D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y. Alon, V. Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
Lab and etc. [2024] P.-Y. Lab and T. A. etc. Open-sora-plan, 2024. URL https://doi.org/10.5281/zenodo.10948109.
Lin et al. [2014] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
Loshchilov and Hutter [2017] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Luke Chesser [2023] A. Z. Luke Chesser, Timothy Carbone. Unsplash. https://github.com/unsplash/datasets, 2023.
Ma et al. [2024] X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
OpenAI [2024] OpenAI. Openai sora. Accessed May 15, 2024 [Online], 2024. URL https://openai.com/index/video-generation-models-as-world-simulators/.
Peebles and Xie [2023] W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763, 2021.
Rajbhandari et al. [2020] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020.
Rombach et al. [2022] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Singer et al. [2022] U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
Soomro et al. [2012] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
Unterthiner et al. [2018] T. Unterthiner, S. Van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
Van Den Oord et al. [2017] A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
Villegas et al. [2022] R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2022.
Wang et al. [2023a] J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023a.
Wang et al. [2023b] Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P. Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
Wang et al. [2004] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
Xing et al. [2023] J. Xing, M. Xia, Y. Zhang, H. Chen, X. Wang, T.-T. Wong, and Y. Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
Xu et al. [2016] J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
Yan et al. [2021] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
Yu et al. [2023] L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10459–10469, 2023.
Zhang et al. [2023] D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
Zhang et al. [2018] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
Zhou et al. [2022] D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.

Appendix A Appendix

A.1 CV-VAE Model Architecture

As illustrated in Fig. 7, we introduce the structure of the CV-VAE. The architecture of CV-VAE is primarily derived from the VAE in Stable Diffusion [26], with several notable differences: (1) Some or all 2D convolutions within the network are transformed into 3D convolutions, while retaining their weights. (2) Temporal downsampling is executed in the encoder through the use of strides. (3) Temporal upsampling is accomplished by increasing the output channel number of 3D convolutions by a specific factor. (4) A discriminator, comprising 3D convolutions, is utilized. The main differences are marked in red text in Fig. 7.

A.2 Qualitative Examples of Image Reconstruction

In Fig. 8, we showcase additional image reconstruction results using CV-VAE. we use the version of ‘2D + 3D’. These images are sourced from the COCO2017 [18] dataset with a resolution of $512\times 512$ . The reconstructed image precisely shares the same colors and textures as the original, demonstrating the high fidelity of our CV-VAE in encoding and reconstructing images. Interestingly, in Fig. 5, slight color differences can be observed between the images decoded by the Image VAE and CV-VAE, given the same latent generated by the Image Diffusion Model. This suggests that there is still a minor discrepancy between the latent spaces of the video VAE trained with latent regularization and the Diffusion Model. This gap can be bridged with minimal additional training.

A.3 Qualitative Examples of Video Reconstruction

As shown in Fig. 9, we present the reconstruction results of 4 consecutive frames from a video clip ( $33\times 576\times 1024$ ) using CA-VAE. The reconstructed video frames maintain consistency in color, structure, and motion with the ground truth. According to CA-VAE, these continuous frames are condensed into a single latent frame, signifying that even a single latent frame encapsulates motion information.

A.4 Compatibility with Existing Text-to-video Model

We tested the compatibility of CV-VAE with existing text-to-video diffusion models, such as Videocrafter2 [5], which also employs a 2D VAE from the SD as its first-stage model. We adopted a strategy similar to the training of ‘SVD + CV-VAE’ in Sec. 4.3, by fine-tuning the last layer of the U-Net in VC2 to adapt it to CV-VAE. We finetuned the model using in-house data at a resolution of $61\times 320\times 512$ , which is equivalent to a latent size of $16\times 40\times 64$ .

As shown in Fig. 10, compared to the original VC2, the ‘VC2 + CV-VAE generates videos approximately four times longer, resulting in smoother motion. This further validates the feasibility of obtaining a compatible video VAE through latent regularization, thereby avoiding the massive computational power required to train a video diffusion model from scratch.

Appendix B Society Impacts

The CV-VAE can be seamlessly integrated into existing diffusion models, replacing the original 2D VAE for image or video generation, which may result in potential societal implications. While it proves beneficial in fields such as entertainment and advertising, by providing more realistic and immersive content, it also raises ethical and safety concerns. The ease of generating high-quality synthetic images and videos could lead to a surge in the production of harmful or misleading content, such as deepfakes, potentially exacerbating issues of misinformation and privacy invasion. We condemn the misuse of generative AI that harms individuals or spreads misinformation.