OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Junke Wang1,2, Yi Jiang3♠, Zehuan Yuan3, Binyue Peng3, Zuxuan Wu1,2†♠, Yu-Gang Jiang

1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
2Shanghai Collaborative Innovation Center on Intelligent Visual Computing, 3Bytedance Inc.

Code available at https://github.com/FoundationVision/OmniTokenizer
Abstract

Tokenizer, serving as a translator to map the intricate visual data into a compact latent space, lies at the core of visual generative models. Based on the finding that existing tokenizers are tailored to image or video inputs, this paper presents OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. OmniTokenizer is designed with a spatial-temporal decoupled architecture, which integrates window and causal attention for spatial and temporal modeling. To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics. OmniTokenizer, for the first time, handles both image and video inputs within a unified framework and proves the possibility of realizing their synergy. Extensive experiments demonstrate that OmniTokenizer achieves state-of-the-art (SOTA) reconstruction performance on various image and video datasets, e.g., 1.11 reconstruction FID on ImageNet and 42 reconstruction FVD on UCF-101, beating the previous SOTA methods by 13% and 26%, respectively. Additionally, we also show that when integrated with OmniTokenizer, both language model-based approaches and diffusion models can realize advanced visual synthesis performance, underscoring the superiority and versatility of our method. \spadesuit: project leaders, \dagger: corresponding author.

1 Introduction

The development of generative models [25, 52, 14, 17, 10, 39] has been one of the most exhilarating developments in artificial intelligence, offering the potential to revolutionize the way we generate visual content. In recent years, visual generation approaches have emerged as two dominant paradigms: language model-based methods [52, 12, 64, 46] and diffusion models [17, 43]. The former exploits the superior sequence modeling capability of language models (LMs) [34, 35, 50] for visual generation by formulating it as a next-token prediction process, while the latter gradually transforms noise into coherent visual structures through a carefully crafted reverse diffusion process.

Core to both approaches is the tokenizer, which translates visual signals into latent representations, with LM tokenizers, also known as VQVAE, discretizing inputs into sequences of latent codes [12, 62, 64], and diffusion tokenizers, i.e., VAE, modeling their probability distributions within a latent space [25, 39]. Analogous to the role of the lexicon in a written language, tokenizers for visual synthesis dictate the upper bound of the generative models, thus attracting increasing attention in the community [12, 61, 19].

Existing tokenizers are designed specifically for either image [12, 62] or video [61, 13, 64] inputs, resulting in inherent limitations regarding their application flexibility and data scalability for the following generative models. Although MAGVITv2 [65] have explored causal 3D convolution to process both modalities, they still have to train separate models for the image and video data, without achieving the synergy between them. This work highlights the critical need for a joint image-video tokenizer with two primary considerations: firstly, a joint image-video tokenizer enables joint learning from image and video data [56, 58], which mitigates the scarcity of data in a single modality (particularly video data) and facilitates the tokenizer to learn more general representations. In addition, a unified tokenization framework inherently enjoys better versatility and scalability. For instance, its performance can be improved by incorporating the data from either modality for training. This further promotes the efficacy of generative models tailored to image or video generation.

With this in mind, we present OmniTokenizer, a transformer-based tokenizer for joint image-video tokenization. As intuitive as it may seem, the simple unification of image and video data could not lead to the reciprocal effects between both modalities. To address this challenge, we turn to a spatial-temporal decoupled architecture [2], where window attention [27] is employed in the spatial dimension owing to its local aggregation capacity and efficiency, and causal attention is used in the temporal dimension to capture the motion in videos and ensure temporal coherence. Complementing the model design, we introduce a progressive training strategy that begins with image pretraining on a fixed resolution to establish a fundamental understanding of static visual information. After this, we integrate video data for joint training on variable resolutions to capture the dynamics in more complex scenes. The progressive training strategy allows our method to bridge the gap between disparate forms of visual input and capitalize on the rich spectrum of visual data.

To empirically validate the effectiveness of the proposed method, we separately implement the LM and diffusion tokenizers, i.e., OmniTokenizer-VQVAE and OmniTokenizer-VAE, and conduct experiments on a wide range of datasets including ImageNet [9], CelebA-HQ [21], FFHQ [22], UCF-101 [44], Kinetics-600 [6], etc. The results demonstrate our model outperforms existing methods in terms of reconstruction FID on both image datasets (e.g., 1.11 rFID for OmniTokenizer-VQVAE and 0.69 rFID for OmniTokenizer-VAE on ImageNet) and video datasets (e.g., 42 rFVD for OmniTokenizer-VQVAE and 23 rFVD for OmniTokenizer-VAE on UCF-101). In addition, employing our approach for tokenization, we also show that both language model-based generative models and diffusion models could achieve competitive results on class-conditional, unconditional generation, and frame prediction tasks.

In summary, our work makes the following key contributions:

  • We introduce OmniTokenizer, a transformer-based tokenizer for joint image and video tokenization. For the first time, OmniTokenizer employs a shared framework and weight to handle both types of visual data.

  • We propose a progressive training strategy that begins with image pre-training at a fixed resolution and then transits to image-video joint training at multiple resolutions. Such an approach capitalizes on the synergies between image and video data, facilitating OmniTokenizer to achieve better performance than solo image or video training.

  • We conduct extensive experiments across various datasets like ImageNet, CelebA-HQ, FFHQ, UCF-101, and Kinetics-600. The results showcase the state-of-the-art reconstruction performance of OmniTokenizer on both image and video datasets. Furthermore, equipped with OmniTokenizer, both language model-based generative models and diffusion models could achieve superior generation results.

2 Related Work

2.1 Language Models for Visual Generation

Language models have emerged as powerful contenders in the visual generation field, drawing inspiration from their unparalleled success in natural language processing [34, 35, 49, 50] and visual understanding [11, 5, 47, 57, 55]. These methods [12, 7, 13, 64] recast visual synthesis as a sequence prediction problem, similar to constructing sentences in human language.

Depending on whether the tokens are predicted sequentially or in parallel, LM-based methods can be further categorized into autoregressive models [12, 63] and non-autoregressive models [7, 65]. Autoregressive (AR) models have been the initial foray into visual generation, utilizing the inherent sequential nature of language models to generate images [62, 63] and videos [61, 13] in a step-wise fashion. These models, such as DALL-E [37] and its preceding variants, typically work by predicting one token at a time and are characterized by their high-quality outputs and precise control over the generation process. VAR[46]redefines the autoregressive learning framework on images as coarse-to-fine "next-scale prediction" paradigm. Non-autoregressive (Non-AR) models, on the other hand, have been developed to allow for a faster generation process by predicting multiple tokens independently and in parallel. Models like MaskGIT [7] leverage this parallelism to significantly reduce generation time while maintaining high fidelity in synthesized images. The non-AR approaches have also demonstrated promise in video generation, featured by MAGVIT series [64, 65]. Both AR and non-AR methods have significantly advanced the field of visual generation, offering novel methods to synthesize high-quality images and videos.

2.2 Diffusion Models for Visual Generation

Diffusion models [17, 31, 3, 60] represent an alternative avenue for visual generation, benefiting from their probabilistic nature that iteratively denoise a random signal into structured images or videos. These models stand out for their flexibility in generating visual outputs that not only exhibit coherent global structures but are also rich with intricate textures [30, 32]. Unlike language models that discretize visual inputs as latent codes, diffusion models directly generate visual samples in continuous pixel space [43, 10]. While effective, this approach demands significant computational resources given the high dimensionality of visual data.

The advent of latent diffusion models (LDMs) [39] seeks to mitigate these issues by compressing the high-dimensional visual data into latent space with a pretrained Variational Autoencoder (VAE) [25, 39]. LDM preserves the desirable properties of pixel-space diffusion models, such as high-quality image synthesis and the ability to incorporate conditional information, while drastically reducing the training and sampling overhead. After that, the rise of LDMs [69, 33, 32, 28] continues to push visual generation toward higher quality, larger resolution, and more complex scenes.

3 Methodology

3.1 Joint Image and Video Tokenization

Refer to caption
Figure 1: Architecture of OmniTokenizer, which consists of patch embedding layers, and separate spatial-temporal attention blocks. To obtain the latent representations, OmniTokenizer-VQVAE looks up a codebook to quantize the encoder embeddings, while OmniTokenizer-VAE samples from a Gaussian distribution. We omit the decoder and only show the tokenization process.

We aim to enable image and video tokenization in a unified framework and achieve mutual benefits between them. To accomplish this, we employ a transformer-based architecture with decoupled spatial and temporal blocks (Sec. 3.1.1). Complementing this, we also propose a progressive training strategy consisting of two consecutive stages to learn the visual encoding in an incremental way (Sec. 3.1.2). The overall framework of our method is illustrated in Figure 1.

3.1.1 Space-Time Transformer

Patchify. Given a visual input x(1+T)×H×W×3𝑥superscript1𝑇𝐻𝑊3x\in\mathbb{R}^{(1+T)\times H\times W\times 3}italic_x ∈ blackboard_R start_POSTSUPERSCRIPT ( 1 + italic_T ) × italic_H × italic_W × 3 end_POSTSUPERSCRIPT, where (1+T)1𝑇(1+T)( 1 + italic_T ) is the number of frames (T𝑇Titalic_T = 0 for image) and H×W𝐻𝑊H\times Witalic_H × italic_W denotes the spatial resolution, we always process the first frame x01×H×W×3subscript𝑥0superscript1𝐻𝑊3x_{0}\in\mathbb{R}^{1\times H\times W\times 3}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × italic_H × italic_W × 3 end_POSTSUPERSCRIPT and following frames x1:TT×H×W×3subscript𝑥:1𝑇superscript𝑇𝐻𝑊3x_{1:T}\in\mathbb{R}^{T\times H\times W\times 3}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × 3 end_POSTSUPERSCRIPT separately for the joint encoding of videos and static images [65]. Specifically, both x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and x1:Tsubscript𝑥:1𝑇x_{1:T}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT are split into non-overlap** patches, with a patch size of p×p𝑝𝑝p\times pitalic_p × italic_p and t×p×p𝑡𝑝𝑝t\times p\times pitalic_t × italic_p × italic_p, respectively. After that, we project the image and video patches with two linear layers, obtaining the patch embeddings e0L1×csubscript𝑒0superscriptsubscript𝐿1𝑐e_{0}\in\mathbb{R}^{L_{1}\times c}italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_c end_POSTSUPERSCRIPT and e1:TL2×csubscript𝑒:1𝑇superscriptsubscript𝐿2𝑐e_{1:T}\in\mathbb{R}^{L_{2}\times c}italic_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_c end_POSTSUPERSCRIPT, where L1=Hp×Wpsubscript𝐿1𝐻𝑝𝑊𝑝L_{1}=\frac{H}{p}\times\frac{W}{p}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = divide start_ARG italic_H end_ARG start_ARG italic_p end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_p end_ARG and L2=Tt×Hp×Wpsubscript𝐿2𝑇𝑡𝐻𝑝𝑊𝑝L_{2}=\frac{T}{t}\times\frac{H}{p}\times\frac{W}{p}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG italic_T end_ARG start_ARG italic_t end_ARG × divide start_ARG italic_H end_ARG start_ARG italic_p end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_p end_ARG. e0subscript𝑒0e_{0}italic_e start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and e1:Tsubscript𝑒:1𝑇e_{1:T}italic_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT are then concatenated along the sequence dimension as the spatial-temporal embedding e𝑒eitalic_e. In this way, we compress the input resolution from (1+T)×H×W1𝑇𝐻𝑊(1+T)\times H\times W( 1 + italic_T ) × italic_H × italic_W to (1+Tt)×Hp×Wp1𝑇𝑡𝐻𝑝𝑊𝑝(1+\frac{T}{t})\times\frac{H}{p}\times\frac{W}{p}( 1 + divide start_ARG italic_T end_ARG start_ARG italic_t end_ARG ) × divide start_ARG italic_H end_ARG start_ARG italic_p end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_p end_ARG.

Encoder and Decoder. To have better compatibility with image and video inputs, we adopt a spatial-temporal factorized encoder consisting of separate spatial and temporal blocks. In the spatial dimension, window attention [27] is employed as it exhibits superior local aggregation capability and efficiency. While in the temporal dimension, we use causal attention to align with the autoregressive visual generation in the second stage. Next, the latent code z𝑧zitalic_z could be obtained by looking up a codebook [52] for LM tokenizer (i.e., quantization in VQVAE), or sampling from a Gaussian distribution for diffusion tokenizer.

The architecture of the decoder is symmetric with the encoder. Finally, we map the spatial-temporal tokens to the pixel space with two linear projection layers without any activation function.

Refer to caption
Figure 2: Illustration of the proposed progressive training paradigm. With this, OmniTokenizer could tokenize both image and video inputs with the same architecture and weight.

3.1.2 Progressive Training

Unlike existing image tokenizers that conduct training on image data only [12, 62] or video tokenizers that train with image counterparts as intialization [64, 65]. We leverage a progressive training paradigm that involves two consecutive stages of VQ training to facilitate spatial-temporal representation learning of our LM tokenizer OmniTokenizer-VQVAE. After this, it could be fine-tuned as a diffusion tokenizer, OmniTokenizer-VAE, with KL fine-tuning.

Two-stage VQ Training, as depicted in Figure 2, aims to learn the visual reconstruction with the discrete latent codes. It includes two stages, the initial stage focuses on fixed-resolution image data to lay a foundation for spatial understanding. Building upon this, the second stage introduces video data to learn the modeling of temporal dynamics alongside static image features. This image-video joint training stage is critical for the model to learn a universal embedding that accurately captures both the spatial intricacies of individual frames and the temporal relationships of sequential video data.

During both stages, the model is trained with vector-quantization objective:

VQ=λ1sg[E(e)]zq22+λ2E(e)sg[zq]22,subscript𝑉𝑄subscript𝜆1superscriptsubscriptnormsgdelimited-[]𝐸𝑒subscript𝑧𝑞22subscript𝜆2superscriptsubscriptnorm𝐸𝑒sgdelimited-[]subscript𝑧𝑞22\mathcal{L}_{VQ}=\lambda_{1}||\mathrm{sg}[E(e)]-z_{q}||_{2}^{2}+\lambda_{2}||E% (e)-\mathrm{sg}[z_{q}]||_{2}^{2},caligraphic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | roman_sg [ italic_E ( italic_e ) ] - italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | italic_E ( italic_e ) - roman_sg [ italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ] | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)

where sgsg\mathrm{sg}roman_sg denotes the stop-gradient operation, λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the balancing hyperparameters, E𝐸Eitalic_E and zqsubscript𝑧𝑞z_{q}italic_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT represent the encoder of OmniTokenizer and codebook vectors, respectively. Factorized codes and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-normalized codes [62] are also used to boost the codebook usage.

KL fine-tuning. After the VQ training, we further fine-tune our model as a diffusion tokenizer (i.e., OmniTokenizer-VAE) by replacing the above VQsubscript𝑉𝑄\mathcal{L}_{VQ}caligraphic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT with Kullback-Leibler (KL) loss:

KL=λ3DKL(Q(z|x)||P(z)),\mathcal{L}_{KL}=\lambda_{3}D_{KL}(Q(z|x)||P(z)),caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_Q ( italic_z | italic_x ) | | italic_P ( italic_z ) ) , (2)

where P(z)𝑃𝑧P(z)italic_P ( italic_z ) is Gaussian distribution, Q(z|x)𝑄conditional𝑧𝑥Q(z|x)italic_Q ( italic_z | italic_x ) represents the inferred posterior configurations of the latent code given the observed input.

Besides VQsubscript𝑉𝑄\mathcal{L}_{VQ}caligraphic_L start_POSTSUBSCRIPT italic_V italic_Q end_POSTSUBSCRIPT or KLsubscript𝐾𝐿\mathcal{L}_{KL}caligraphic_L start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT, both VQ training and KL fine-tuning also employs L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT reconstruction loss reconsubscript𝑟𝑒𝑐𝑜𝑛\mathcal{L}_{recon}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT and GAN loss GANsubscript𝐺𝐴𝑁\mathcal{L}_{GAN}caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_N end_POSTSUBSCRIPT.

3.2 Visual Generation

As mentioned in Sec. 3.1.2, after the progressive training and KL fine-tuning, we can obtain two tokenizers: OmniTokenizer-VQVAE and OmniTokenizer-VAE which separately encode the visual inputs into latent codes in a discrete codebook or the continuous latent space. With this, we further train language models or diffusion models for visual generation.

Language models-based generation approaches formulate visual synthesis as a token prediction problem. Specifically, after OmniTokenizer-VQVAE tokenizes image or video inputs into a sequence of discrete latent codes, we first flatten them in the raster order [8, 12] to obtain the code indices y𝑦yitalic_y. Then a transformer language model [34] is trained to maximize the log-likelihood between the predicted tokens y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG and the target tokens y𝑦yitalic_y with cross-entropy loss:

maximizei=1LlogP(y^i|c,y1:i1;θ).maximizesuperscriptsubscript𝑖1𝐿logPconditionalsubscript^𝑦𝑖𝑐subscript𝑦:1𝑖1𝜃\mathrm{maximize}\sum_{i=1}^{L}\mathrm{log}\mathrm{P}(\hat{y}_{i}|c,y_{1:i-1};% \theta).roman_maximize ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_logP ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_c , italic_y start_POSTSUBSCRIPT 1 : italic_i - 1 end_POSTSUBSCRIPT ; italic_θ ) . (3)

where c𝑐citalic_c represents the condition (e.g., label for class-conditional image and video generation), θ𝜃\thetaitalic_θ is the learnable parameters of the language model, PP\mathrm{P}roman_P and L𝐿Litalic_L denote the softmax probability and the length of y𝑦yitalic_y. During inference, we predict each token according to the model likelihood.

Latent diffusion models (LDMs) [39] perform diffusion process in the latent space to enable high-quality image synthesis with improved computational efficiency. Specifically, with the 2D latent representation from OmniTokenizer-VAE, the diffusion process gradually applies Gaussian noise to the latent code to generate a perturbed sample, while the denoising process trains a diffusion model to predict the noise that has been added. During inference, the well-trained diffusion model could generate a coherent visual sample from the noise by iteratively reversing the noising process.

Table 1: Reconstruction FID on ImageNet validation split, CelebA-HQ, and FFHQ. denotes models trained with Gumbel-Softmax reparameterization [37]. For our method, the results that are jointly trained with UCF-101 are reported.
Method Dataset Lat. shape Codebook rFID
ViT-VQGAN [62] CelebAHQ 32 ×\times× 32 8192 4.66
Ours-VQVAE CelebAHQ 32 ×\times× 32 8192 1.93
ViT-VQGAN [62] FFHQ 32 ×\times× 32 8192 3.13
Ours-VQVAE FFHQ 32 ×\times× 32 8192 1.91
DALL-E [37] ImageNet 32 ×\times× 32 8192 32.01
VQGAN [12] ImageNet 32 ×\times× 32 8192 1.49
ViT-VQGAN [62] ImageNet 32 ×\times× 32 8192 1.28
Ours-VQVAE ImageNet 32 ×\times× 32 8192 1.11
Ours-VAE ImageNet 32 ×\times× 32 \infty 0.69
Table 2: Reconstruction FVD on UCF-101 and Moments-in-Time val. split. denotes training image tokenizer with video loss.
Method Type UCF MiT
MaskGIT [7] Img 240 -
VQGAN [12] Img 299 306
ViT-VQGAN [62] Img - 167
ViT-VQGAN [62] Img - 173
CViViT [53] Vid - 66
TATS [13] Vid 162 -
MAGVIT [64] Vid 58 -
Ours-VQVAE Joint 42 20
Ours-VAE Joint 23 13

4 Experiments

Datasets. We evaluate the visual tokenization performance of OmniTokenizer on both image and video datasets, including ImageNet [9], CelebA-HQ [21], FFHQ [22], Kinetics [23, 6], UCF-101 [44], Moments-in-Time (MiT) [29], and Something-Something v2 (SSV2) [15]. We adopt a subset of the above datasets for visual generation to compare with previous works [12, 62, 53, 13].

Implementation Details. OmniTokenizer adopts a decoupled spatial-temporal architecture consisting of 4 window attention-based spatial layers (window size = 8) and 4 causal attention-based temporal layers. The hidden dimension is 512 and the latent dimension is 8, following ViT-VQGAN [62]. λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are set to 1, 1, 1e-6, respectively. As mentioned in Sec. 3.1.2, the training of OmniTokenizer follows a progressive training strategy, where both stages last 500K iterations. The learning rate is warmed up to 1e-3 and decayed to 0 using a cosine scheduler. Adam [24] is employed for optimization (β𝛽\betaitalic_β1 = 0.9 and β𝛽\betaitalic_β2 = 0.99). During the image training stage, we train the model with a fixed image resolution of 256×\times×256. For the joint training stage, we forward the model with image and video data iteratively, with the video sequence length being 17 frames. The spatial resolutions are randomly chosen from 128, 192, 256, 320, and 384. Only random horizontal flip is adopted for data augmentation. We train our model using 8 NVIDIA A100 GPUs for 2 weeks. Unless otherwise stated, the results reported in this paper are jointly trained on ImageNet and UCF-101.

We try both the language models and diffusion models for visual generation with OmniTokenizer as the tokenizer. The configuration for the language model follows VQGAN [12], and for a fair comparison with previous methods, we also scale up the model size by increasing the hidden dimension to 1535, following ViT-VQGAN [62]. The training of image and video diffusion transformers follows DiT [32] and Latte [28], respectively.

4.1 Visual Tokenization

We first evaluate the visual tokenization capability of OmniTokenizer on ImageNet and two high-quality face datasets, CelebA-HQ and FFHQ. Reconstruction FID is used following the previous methods [12, 62]. We can observe from Table 2 that with the same compression rate and codebook size, OmniTokenizer outperforms existing methods by a large margin on all these datasets. Especially, OmniTokenizer-VQVAE achieves 1.11 FID on ImageNet, beating ViT-VQGAN, the previous state-of-the-art method by 13%. When fine-tuned as OmniTokenizer-VAE, the FID is further reduced to 0.69. We hypothesize the improved performance is because KL training provides smoother gradients than VQ training and avoids loss of information in the quantization process.

In addition, we also conduct video reconstruction experiments and report the results in Table 2. We can see that on both UCF-101 and Moments-in-Time datasets, OmniTokenizer achieves the best results. The video reconstruction results on more datasets can be found in the ablation study.

Table 3: Comparions of class-conditional results on ImageNet 256×\times×256 using language models. “\downarrow” (“\uparrow”) indicates lower (higher) is better. Metrics include Fréchet inception distance (FID) and inception score (IS). NAR and AR: non-autoregressive and autoregressive. : taken from MaskGIT [7].
Type Method ##\##Param FID\downarrow IS\uparrow
AR VQGAN [12] 227M 18.65 80.4
AR RQ-Transformer [26] 488M 15.72 86.8
AR Ours 227M 10.13 94.5
AR VQVAE-2 [38] 13.5B 31.11 similar-to\sim45
AR VQGAN [12] 1.4B 15.78 74.3
AR RQ-Transformer [26] 821M 13.11 104.3
AR ViT-VQGAN [62] 650M 8.81 110.8
AR Ours 650M 7.45 146.7
Table 4: Comparions of class-conditional generation results on UCF-101 and frame prediction results on Kinetics-600. Fréchet video distance (FVD) is reported.
Type Method ##\##Param FVD\downarrow
UCF K600
NAR Phenaki [53] 227M - 36.4
NAR MAGVIT [64] 306M 76 9.9
NAR MAGVITv2 [65] 307M 58 4.3
AR LVT [36] 50M - 224.7
AR ViTrans [59] 373M - 170.0
AR CogVideo [19] 9.4B 626 109.2
AR ViVQVAE [54] NA - 64.3
AR TATS [13] 321M 332 -
AR Ours 227M 314 34.2
AR Ours 650M 191 32.9
Table 5: Class-conditional results on ImageNet 256×\times×256 using GAN and diffusion models.
Method FID\downarrow IS\uparrow Prec\uparrow Rec\uparrow
BigGAN [4] 6.95 171.4 0.87 0.28
StyleGAN-XL [40] 2.30 265.12 0.78 0.53
ADM [10] 10.94 100.98 0.69 0.63
LDM-4 10.56 103.49 0.71 0.62
CDM [18] 4.88 158.71 - -
DiT-XL/2 [32] 9.62 121.50 0.67 0.67
DiT-XL/2-CFG [32] 2.27 278.24 0.83 0.57
Ours-DiT-XL/2 12.25 109.94 0.73 0.64
Ours-DiT-XL/2-CFG 3.48 244.23 0.89 0.52
Table 6: Comparisons of unconditional results on UCF-101 256×\times×256 using GAN and diffusion models.
Method Lat. Comp. FVD\downarrow
MoCoGAN [51] - 2886.9
VideoGPT [61] 4 ×\times× 4 ×\times× 4 2880.6
MoCoGAN-HD [48] - 1729.6
DIGAN [67] - 1630.2
StyleGAN-V [42] - 1431.0
PVDM [66] 1 ×\times× 4 ×\times× 4 1141.9
MoStGAN-V [41] - 1380.3
Latte [28] 1 ×\times× 8 ×\times× 8 478.0
Ours-Latte 4 ×\times× 8 ×\times× 8 209.2

4.2 Visual Generation with AutoRegressive Transformers

Using OmniTokenizer-VQVAE for tokenization, we train language models to predict latent code indices in the codebook in an autoregressive manner for image and video synthesis. The class-conditional 256×\times×256 generation results on ImageNet, presented in Table 4, demonstrate that our model surpasses existing autoregressive image generation methods with significant margins. Remarkably, with a model comprising only 227M parameters, we achieve 10.13 FID and 94.5 IS, outperforming VQGAN [12] by 32% and 25%, respectively. Upon scaling up to a larger model with 650M parameters, the FID is further reduced to 7.45.

In the domain of video generation, as illustrated in Table 4, our model beats the previous state-of-the-art autoregressive model, TATS [13] for class-conditional video generation on UCF-101 with much lower FVD (283 v.s.formulae-sequence𝑣𝑠v.s.italic_v . italic_s . 314). Moreover, for frame prediction tasks on the Kinetics-600 dataset, our model not only achieves the best performance compared to other autoregressive models but also surpasses Phenaki [53], a non-autoregressive method.

4.3 Visual Generation with Diffusion Models

In parallel to language model-based methods, diffusion model [17, 43, 10], especially latent diffusion model [39], is another promising technique for visual synthesis. Therefore, we also evaluate the effectiveness of our method on diffusion model-based image and video generation with OmniTokenizer-VAE as the tokenizer. Here we employ the same architecture of DiT [32] and Latte [28] and replace their VAE [1] with OmniTokenizer-VAE. DiT [32] first applies the transformer architecture to latent diffusion models and exhibits appealing scalability properties. Following this, Latte [28] extends the transformer to the latent video diffusion model by alternating spatial and temporal attention blocks.

The experimental results, as depicted in Table 6, indicate that when equipped with OmniTokenizer-VAE, DiT-XL/2 with classifier-free guidance (CFG) achieves a better inception score of 244.23, underscoring the efficacy of our tokenizer within diffusion model frameworks for image synthesis. For unconditional video generation on the UCF-101 dataset, our method not only offers the advantage of reduced training costs by realizing a higher compression rate, but also exhibits a much lower FVD than previous methods.

4.4 Ablation Study

Training Paradigms. To verify the effect of the proposed progressive training paradigm, we compare different training strategies and show the results in Table 7. The results in lines 3-4 and line 6 indicate that joint training outperforms video training on all video datasets remarkably, demonstrating the importance of image pre-training for the following video training. In addition, although joint training on a fixed resolution (line 5) could achieve much better results on video datasets than video training, the reconstruction FID on ImageNet gets worse, i.e., from 1.28 to 1.35. Comparatively, the progressive training paradigm leads to the best performance on video datasets and surprisingly improves the image reconstruction performance.

Table 7: Comparison of rFID on ImageNet and rFVD on various video datasets.
  Method ImageNet K600 UCF MiT SSV2
256 128 256 128 256 128 256 128 256
1   Ours-Image (Fix) 1.28 - - - - - - - -
2   Ours-Image (Multi) 1.44 - - - - - - - -
3   Ours-Video (Fix) - 211.51 48.89 214.83 118.52 211.07 64.47 162.53 22.82
4   Ours-Video (Multi) - 194.51 54.89 211.83 114.52 238.07 26.47 193.35 38.82
5   Ours-Joint (Fix) 1.35 113.51 26.89 186.83 62.52 140.07 21.47 108.35 20.82
6   Ours-Joint (Multi) 1.11 84.38 25.97 107.80 42.35 59.47 19.87 84.78 20.30

Architecture and Efficiency Analysis. In Table 3, we compare the inference cost (GFLOPs, i.e., giga floating-point operations, a hardware-independent metric) and reconstruction FID of different architectures on ImageNet. Compared to spatial-temporal joint attention (JointAttn) and decoupled plain attention (DePlainAttn), our decoupled architecture with spatial window attention and temporal causal attention leads to the lowest inference overhead and best rFID.

Table 8: Inference cost and rFID. iGFLOPs/vGFLOPs: GFLOPs for image and video inputs.
Method iGFLOPs vGFLOPs FID
VQGAN 167 167 ×\times× 17 1.49
JointAttn 72 358 1.89
DePlainAttn 72 299 1.33
Ours 51 262 1.28
Refer to caption
Figure 3: Ablation on compression rate / latent dimension.

Latent Dimension and Compression Rate. Figure 3 shows the reconstruction FID with different compression rates and latent dimensions. We can observe that increasing the compression rate always hurts the reconstruction performance since more information is lost during the encoding process. Moreover, latent dimension = 8 leads to the best trade-off between rFID and codebook usage.

4.5 Visualizations

Visual Reconstruction. We visualize the reconstruction results by OmniTokenizer, VQGAN [12] and TATS [13] in Figure 4. Our method works significantly better than baselines for face and text reconstruction, which are typically regarded as the most challenging reconstruction cases.

Refer to caption
Figure 4: Image and video reconstruction results of VQGAN [12], TATS [13], and our method.

Class-conditional Image and Video Generation. The class-conditional generation results are shown in Figure 5-8. Our model could synthesize visually coherent and contextually accurate images and videos, showcasing the strengths of OmniTokenizer in facilitating generative tasks.

Refer to caption
Figure 5: Class-conditional ImageNet generation results using language models, with OmniTokenizer-VQVAE as tokenizer.
Refer to caption
Figure 6: Class-cond. ImageNet generation using diffusion models, OmniTokenizer-VAE as tokenizer.
Refer to caption
Figure 7: Class-cond. UCF-101 generation using LMs, OmniTokenizer-VQVAE as tokenizer.
Refer to caption
Figure 8: Unconditional UCF-101 generation using diffusion models (and OmniTokenizer-VAE).

Frame Prediction and Arbitrary Long Video Generation. The frame prediction results by our method are presented in Figure 9, from which we can see that our model could forecast subsequent frames with high clarity and temporal coherence. Moreover, we exhibit the potential of our method for generating videos of arbitrary lengths by employing a cyclical process, where each newly generated frame is recursively used as a condition for the subsequent frame generation.

Refer to caption
Figure 9: Visualization of the frame prediction results by OmniTokenizer. The frames marked in red are given during inference, while the following frames are generated.

5 Conclusion and Discussion of Broader Impact

This paper presented OmniTokenizer, a transformer-based tokenizer for joint image-video tokenization. OmniTokenizer adopts a spatial-temporal decoupled architecture, employing the window and causal attention in the spatial and temporal dimensions. To realize the synergy between images and video data, we proposed a progressive training strategy that starts with image training on a fixed resolution to acquire the spatial encoding capability and then incorporates video data for multi-resolution joint training to learn temporal modeling. Extensive experimental results substantiate the state-of-the-art performance of OmniTokenizer in visual reconstruction tasks. Further, when equipped with OmniTokenizer, both language model-based methods and diffusion models could achieve superior visual generation results.

Previous literature [20, 16, 68, 46, 45] has revealed that the performance of transformer models improves significantly as the model size increases, also known as scaling law. In the future, we will explore scaling the model capacity of OmniTokenizer for more advanced tokenization performance.

References

  • [1] S. AI. Stable diffusion v1-4. https://huggingface.co/CompVis/stable-diffusion-v1-4, 2022.
  • [2] G. Bertasius, H. Wang, and L. Torresani. Is space-time attention all you need for video understanding? In ICML, 2021.
  • [3] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  • [4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. In ICLR, 2019.
  • [5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
  • [6] J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  • [7] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman. Maskgit: Masked generative image transformer. In CVPR, 2022.
  • [8] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining from pixels. In ICML, 2020.
  • [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [10] P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  • [11] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  • [12] P. Esser, R. Rombach, and B. Ommer. Taming transformers for high-resolution image synthesis. In CVPR, 2021.
  • [13] S. Ge, T. Hayes, H. Yang, X. Yin, G. Pang, D. Jacobs, J.-B. Huang, and D. Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV, 2022.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. Communications of the ACM, 2020.
  • [15] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. In ICCV, 2017.
  • [16] T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  • [17] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  • [18] J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high fidelity image generation. JMLR, 2022.
  • [19] W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. In ICLR, 2023.
  • [20] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • [21] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [22] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In CVPR, 2019.
  • [23] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [24] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
  • [25] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • [26] D. Lee, C. Kim, S. Kim, M. Cho, and W.-S. Han. Autoregressive image generation using residual quantization. In CVPR, 2022.
  • [27] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  • [28] X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048, 2024.
  • [29] M. Monfort, A. Andonian, B. Zhou, K. Ramakrishnan, S. A. Bargal, T. Yan, L. Brown, Q. Fan, D. Gutfreund, C. Vondrick, et al. Moments in time dataset: one million videos for event understanding. TPAMI, 2019.
  • [30] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. PMLR, 2022.
  • [31] A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In ICML, 2021.
  • [32] W. Peebles and S. Xie. Scalable diffusion models with transformers. In CVPR, 2023.
  • [33] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • [34] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al. Improving language understanding by generative pre-training. OpenAI Blog, 2018.
  • [35] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. Language models are unsupervised multitask learners. OpenAI Blog, 2019.
  • [36] R. Rakhimov, D. Volkhonskiy, A. Artemov, D. Zorin, and E. Burnaev. Latent video transformer. arXiv preprint arXiv:2006.10704, 2020.
  • [37] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever. Zero-shot text-to-image generation. In ICML, 2021.
  • [38] A. Razavi, A. Van den Oord, and O. Vinyals. Generating diverse high-fidelity images with vq-vae-2. In NeurIPS, 2019.
  • [39] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  • [40] A. Sauer, K. Schwarz, and A. Geiger. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH, 2022.
  • [41] X. Shen, X. Li, and M. Elhoseiny. Mostgan-v: Video generation with temporal motion styles. In CVPR, 2023.
  • [42] I. Skorokhodov, S. Tulyakov, and M. Elhoseiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR, 2022.
  • [43] J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In ICLR, 2021.
  • [44] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [45] P. Sun, Y. Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
  • [46] K. Tian, Y. Jiang, Z. Yuan, B. Peng, and L. Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction. arXiv preprint arXiv:2404.02905, 2024.
  • [47] R. Tian, Z. Wu, Q. Dai, H. Hu, Y. Qiao, and Y.-G. Jiang. Resformer: Scaling vits with multi-resolution training. In CVPR, 2023.
  • [48] Y. Tian, J. Ren, M. Chai, K. Olszewski, X. Peng, D. N. Metaxas, and S. Tulyakov. A good image generator is what you need for high-resolution video synthesis. In ICLR, 2021.
  • [49] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • [50] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • [51] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
  • [52] A. Van Den Oord, O. Vinyals, et al. Neural discrete representation learning. In NeurIPS, 2017.
  • [53] R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan. Phenaki: Variable length video generation from open domain textual descriptions. In ICLR, 2022.
  • [54] J. Walker, A. Razavi, and A. v. d. Oord. Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2021.
  • [55] J. Wang, D. Chen, C. Luo, B. He, L. Yuan, Z. Wu, and Y.-G. Jiang. Omnivid: A generative framework for universal video understanding. In CVPR, 2024.
  • [56] J. Wang, D. Chen, Z. Wu, C. Luo, L. Zhou, Y. Zhao, Y. Xie, C. Liu, Y.-G. Jiang, and L. Yuan. Omnivl: One foundation model for image-language and video-language tasks. NeurIPS, 2022.
  • [57] J. Wang, Z. Wu, J. Chen, X. Han, A. Shrivastava, S.-N. Lim, and Y.-G. Jiang. Objectformer for image manipulation detection and localization. In CVPR, 2022.
  • [58] R. Wang, D. Chen, Z. Wu, Y. Chen, X. Dai, M. Liu, Y.-G. Jiang, L. Zhou, and L. Yuan. Bevt: Bert pretraining of video transformers. In CVPR, 2022.
  • [59] D. Weissenborn, O. Täckström, and J. Uszkoreit. Scaling autoregressive video models. In ICLR, 2020.
  • [60] Z. Xing, Q. Dai, H. Hu, Z. Wu, and Y.-G. Jiang. Simda: Simple diffusion adapter for efficient video generation. In CVPR, 2024.
  • [61] W. Yan, Y. Zhang, P. Abbeel, and A. Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  • [62] J. Yu, X. Li, J. Y. Koh, H. Zhang, R. Pang, J. Qin, A. Ku, Y. Xu, J. Baldridge, and Y. Wu. Vector-quantized image modeling with improved vqgan. In ICLR, 2022.
  • [63] J. Yu, Y. Xu, J. Y. Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku, Y. Yang, B. K. Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. In ICLR, 2024.
  • [64] L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essa, et al. Magvit: Masked generative video transformer. In CVPR, 2023.
  • [65] L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y. Cheng, A. Gupta, X. Gu, A. G. Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual generation. In ICLR, 2024.
  • [66] S. Yu, K. Sohn, S. Kim, and J. Shin. Video probabilistic diffusion models in projected latent space. In CVPR, 2023.
  • [67] S. Yu, J. Tack, S. Mo, H. Kim, J. Kim, J.-W. Ha, and J. Shin. Generating videos with dynamics-aware implicit generative adversarial networks. In ICLR, 2022.
  • [68] X. Zhai, A. Kolesnikov, N. Houlsby, and L. Beyer. Scaling vision transformers. In CVPR, 2022.
  • [69] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In CVPR, 2023.