What Matters in Detecting AI-Generated Videos like Sora?

Chirui Chang1    Zhengzhe Liu2    Xiaoyang Lyu1    Xiaojuan Qi1   
1The University of Hong Kong    2The Chinese Univerisity of Hong Kong   
Abstract

Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models.

1 Introduction

Video diffusion models, such as Stable Video Diffusion (SVD) [1] and Sora [3], have recently demonstrated the ability to generate photorealistic videos with remarkable temporal consistency, even capable of deceiving human observers. Consequently, Sora, a leading video generation model, has been referred to as a “world simulator”. However, this raises the question of whether video generation models can truly function as world simulators, prompting us to study and analyze the gap between videos generated by video diffusion models and real videos.

In our investigation, we utilize the state-of-the-art open-source video diffusion model, Stable Video Diffusion [1], to generate synthetic videos and collect real-world in-the-wild videos from Pexels [31] for comparison. These videos are used to examine the limitations of current video generation techniques in accurately simulating real-world videos. Since videos essentially are projections of a dynamic 3D world onto 2D image planes, three aspects of our visual world are reflected in videos: appearance (describing texture, color, and lighting environment), motion (showing object movement in space), and geometry (reflecting scale, shape, and spatial structures). Our study analyzes the gaps between synthetic and real videos across these three dimensions. As it is impossible to isolate these aspects in a video, we use raw video frames and features from visual foundation models [29], optical flow from RAFT [44], and monocular depth from Marigold [20] and UniDepth [32] as proxies for appearance, motion, and geometry, respectively.

We construct video classifiers based on a 3D convolutional neural network using these three aspects of proxies separately. Then, we employ Grad-CAM [42] to analyze the cues that the classifier relies on for classification, allowing us to identify meaningful areas for better understanding. As shown in Table 2, we find these classifiers exhibit impressive performance in detecting fake videos, achieving an accuracy of more than 90% with appearance and geometry cues in the in-domain setting. They can even be directly transferred to detect Sora-generated videos with an accuracy of 70% without having seen such videos during training. This reveals that AI-generated videos still fall short of accurately replicating real-world videos across these three critical dimensions, which remain consistent across different generative models. Through an in-depth investigation using Grad-CAM, we reveal several intriguing findings related to the disparities in appearance, motion, and geometry between real and AI-generated videos: (1) Generated videos continue to struggle with color inconsistency and texture distortions across frames (see Figure 1 and supplementary videos); (2) Unnatural motions are commonly observed in AI-generated videos (see Figure 2 and supplementary videos); (3) Real-world geometric rules are often violated in generated videos, such as the occlusion order between objects and relative object scales changing across frames (see Figure 3 and supplementary videos); (4) Some of the phenomena persist in videos generated by Pika Labs [23], Runway [10], and Sora [3]. Building upon our analysis and findings, we further develop an advanced fake video detector by ensembling the predictions from these three aspects to leverage their complementarity, focusing on enhancing its generalization ability to identify generated videos unseen during training. Remarkably, even though our model is trained exclusively on fake videos from SVD, our approach attains an accuracy of over 80% in detecting videos from stronger video generators, including Sora [3], Pika [23], and Runway-Gen2 [10], without any exposure to these videos during training.

Besides, the promising performance also implies the identified gaps are transferable across different video generation models, warranting future exploration to advance video generation toward world simulators.

2 Related Work

Diffusion models for video generation.

The recent success of diffusion models [15, 43] in the field of image synthesis [5, 38, 11, 28, 39, 34] has inspired a series of research on video generation [2, 9, 12, 14, 16, 21, 26, 50, 49, 51]. Stable Video Diffusion [1], Runway Gen-2 and Pika Labs offer latent video diffusion models trained on large amount of data for high-quality video generation. More recently, Sora [3] from OpenAI has demonstrated the potential to simulate human beings, animals, and environments in the physical world, suggesting that video generation could be a promising path towards a world simulator. In this work, we aim to help the community gain more understanding about diffusion-generated videos, especially the gap between real-world videos and generated videos.

Generated image detection.

The problem of detecting generated images has been widely investigated over the past years. Early research focused on the generation of images using GAN-based models. Wang et al. [46] train a simple classifier on images generated by ProGAN [8] which can generalized to images generated by other GAN-based models. Liu et al. [24] also find that a simple CNN-based classifier can achieve plausible performance on fake face detection and generalize to other GAN-based models. With the advance of diffusion models, more recent work [4, 47, 27, 25, 48, 7] have shifted towards how to detect images generated by diffusion models. Corvi et al. [4] find that previous state-of-the-art GAN-based detectors cannot generalize to diffusion-generated images. Some following work [47, 27, 25] leverage reconstruction error from the diffusion model to detect diffusion-generated images in a generalizable way. Sarkar et al. [41] leverage projective geometry for generated image detection. However, with the significant advancement of video generative models, a robust and general diffusion-generated video detector is still an underexplored area. Different from the above, we analyzed the diffusion-generated videos and propose a simple yet effective framework for robust and generalized generated video detection.

Generated video evaluation.

A lot of metrics are used to evaluate the generated videos by generative models, such as Inception Score (IS) [40], Fréchet Inception Distance (FID) [13], Fréchet Video Distance (FVD) [45], Kernel Video Distance (KVD) [45], Perceptual Input conformity(PIC) [49] and CLIPSIM [35]. Huang et al. [17] introduce VBench, which is a benchmark suite, for evaluating video generation model performance with fine-grained evaluation dimensions and human alignment. In contrast, our work takes a data-driven approach to identify the differences between real and generated videos using a simple yet effective method. Additionally, we provide low-level statistics analysis as a verification measure to investigate the crucial factors contributing to the suitability of existing video generation as a world simulator.

Video feature extraction.

In the realm of video analysis and generation, extracting and analyzing appearance, motion, and geometry information is crucial for evaluating the realism and authenticity of generated content. Recently, many vision foundation models [22, 29] have shown promising capabilities in representing image appearance. For motion feature extraction, optical flow is a highly important representation. Recently, neural-network-based optical flow methods [6, 18, 44] have demonstrated precise and robust capabilities. Rich geometric information is embedded in the depth cues of monocular images, such as occlusion and relative size. Metric depth can further provide scale information for different objects in the scene, while maintaining video consistency. In the past years, neural-network-based monocular depth estimation [37, 36] has received widespread attention, with recent works like Marigold [20] demonstrating plausible details, and UniDepth [32] achieving precise metric depth estimation while exhibiting excellent consistency in videos. In this work, we leveraged these tools to efficiently and effectively acquire the corresponding information to assist our research.

Table 1: Statistical overview of the Dataset. ID refers to In-domain, CD stands for Cross-domain, Synth. denotes Synthetic, and Res. represents Resolution.
Source Synth. Split # Videos # Frames Res. Prompt Type
Pexel [31] Training Set 6946 173650 Variable
SVD [1] Training Set 6946 173650 1024*576 I2V
Pexel [31] ID Test Set 698 17450 Variable
SVD [1] ID Test Set 698 17450 1024*576 I2V
Pixabay [33] CD Test Set 175 59400 Variable
Sora [3] CD Test Set 75 34175 Variable Hybrid
Pika Labs [23] CD Test Set 50 3500 Variable Hybrid
Runway-Gen2 [10] CD Test Set 50 3750 1280*720 Hybrid

3 Empirical Studies and Analysis

3.1 Data Curation

To better understand AI-generated fake videos, we first study the differences between real and fake videos. To do this, we set up a dataset consisting of real-world videos and AI-generated videos as shown in Table 1. Specifically, we first collect real-world in-the-wild videos from Pexels [31], covering a wide variety of tags, ranging from animals, people, natural scenes, city scenes, indoor scenes, and so on. Then, to collect fake videos generated by AI, we adopt the state-of-the-art open-source video generation model Stable Video Diffusion (SVD) [1] to study the in domain setting, where our detection model is trained and tested on fake videos from the same AI model. To further study the more challenging cross-domain setting, where our model is trained on fake videos from one AI model (e.g., SVD) and evaluated on other AI models, we augment our test set with videos generated by Sora [3], Pika Labs [23], and Runway-Gen2 [10]. The cross-domain setting aims to simulate the challenging real-world scenario without prior knowledge of the AI model adopted to generate the testing video. This is also employed to examine whether the observed gaps between real and fake videos are transferable across different AI models.

In Table 1, we present the details of our real and synthetic video datasets. Given a real-world video from Pexels [31], we conduct keyframe extraction and then segment it into short clips. To create fake videos, we use the keyframes as the conditioned images to generate synthetic videos with SVD [1]. This strategy reduces the content gap between real and fake videos, avoiding our model relying on the video content to make predictions. We randomly sample 90% videos for training and the remaining for in-domain testing. To avoid the trivial solution of predicting all the testing videos to be fake, we construct the in-domain test set with an equal number of real video clips from Pexels and synthetic videos generated by SVD. As for cross-domain testing videos, we incorporate videos generated by Sora [3], Pika Labs [23], and Runway-Gen2 [10] with hybrid prompt, including Image2video (I2V) & Text2video (T2V), into our test set. Similar to the in-domain setting, we incorporate the same number of real videos from Pixabay [33] with diversified video contents into cross-domain test sets.

Table 2: Quantitative results for the appearance, motion, and geometry classifier for in-domain and cross-domain settings.
Setting In-domain Cross-domain
Input SVD Sora Pika Runway-Gen2
DINOv2 (Appearance) 96.77 72.00 81.00 77.00
Flow (Motion) 86.60 64.67 54.00 54.00
Depth (Geometry) 92.26 80.00 80.00 70.00

3.2 Understanding Differences between Real and Generated Videos

To study the difference between fake and real videos, we conduct an empirical study on them in three key aspects, appearance, motion, and geometry.

  • \bullet

    Appearance: Appearance refers to the visual attributes of the video frames, including color, texture, and lighting. In our study, we analyze how these attributes differ between real and generated videos, focusing on factors such as color consistency, texture realism, and the handling of lighting conditions.

  • \bullet

    Motion: Motion encompasses the dynamics and temporal changes within videos. In real-world videos, objects should exhibit consistent motion over time and space, devoid of sudden stops, jitter, or discontinuous movements. We examine the differences in motion patterns, tracking the fluidity and realism of movements in real versus generated videos.

  • \bullet

    Geometry: Geometry pertains to the spatial structure and shape of objects within the video frames. This includes aspects such as the consistency of object scale and spatial relationships. Our study evaluates how well the geometric properties are preserved or distorted in generated videos compared to their real counterparts.

To allow the decomposition of a video into individual components, we design a comprehensive video representation (CVR) for video analysis and fake video detection.

3.2.1 Comprehensive Video Representation

We decompose a video into a new representation CVR, which is composed of three components and detailed as follows.

  • \bullet

    Appearance representation: Instead of simply using the original RGB information, we additionally extract visual feature information from each frame with DINOv2 [29], a vision foundation model, to extract rich high-level features as the appearance representation. DINOv2 features are capable of cross-image dense and sparse matching, which enhances the potential to capture differences between video frames in terms of visual attributes.

  • \bullet

    Motion representation: We leverage optical flow to study the motion patterns between synthetic and real videos due to its ability to capture subtle variations in pixel movement, enabling precise analysis of dynamics within the video sequences. Specifically, we employ RAFT [44], a state-of-the-art model, to extract optical flow between the adjacent frames.

  • \bullet

    Geometry representation: To investigate the geometric properties of generated videos, we estimate the depth map of each frame. The depth map conveys many 2.5D geometric cues, such as occlusion, spatial relationships, scale, and so on. To do so, we leverage both relative depth and metric depth. Specifically, we trained geometry classifiers using the relative depth from Marigold [20] and the metric depth from UniDepth [32] as inputs. Relative depth measures the perceived or comparative distance between features or points within an object or scene, which is not consistent across frames. It does not provide an absolute measure but instead gives a sense of which features are closer or farther away relative to each other. In contrast, metric depth has a uniform scale and provides better consistency across videos, which aids the network in perceiving changes in the geometric structure of the video.

3.2.2 How do AI-generated Videos Differ from Real Ones?

Our comprehensive video representation (CVR) decomposes a video into three key components, allowing us to study whether AI models can faithfully simulate the real world in terms of each component and how fake videos differ from real ones. To do this, we adopt 3D ConvNets [19] to predict whether the video is real or fake with only one of the components of our CVR as input, namely, appearance, motion, and geometry.

The 3D ConvNet is composed of multiple layers which are interleaved with ReLU activation functions to introduce non-linearity. After the convolutional layers, the feature maps are flattened and passed through fully connected layers for final classification. The network outputs a probability score using a sigmoid activation function, enabling binary real/fake classification.

The results in Table 2 show that even a simple 3D ConvNet with only one of the three input components can achieve descent performance for fake video detection in the in-domain setting (see the column “in domain” of Table 2), demonstrating that the state-of-the-art open source video generative model SVD cannot fully simulate the real world in terms of appearance, motion, and geometry, despite of its descent generative quality.

More importantly, the above approach can even generalize to detect unseen AI-generated videos, including Sora, Pika, and Runway-Gen2; see the “cross-domain” column in Table 2. Specifically, our model consistently achieves more than 70% fake video detection accuracy with either the appearance or the geometry input in the challenging cross-domain setting. This reveals that existing video generation models, including the SOTA model Sora, share some common limitations in replicating real-world videos. Also, even though the most advanced AI models still struggle to accurately simulate these three critical dimensions of our real world.

Refer to caption
Figure 1: Appearance analysis on generated videos. The figure presents the qualitative results of the appearance classifier on three videos. For each video, we sampled 5 frames (first row) and their corresponding Grad-CAM visualizations (second row) to illustrate the results. The top video is SVD-generated videos from our in-domain test set. The bottom two are from our cross-domain test set, generated by Runway-Gen2 and Pika Labs. Note: In the first video, Grad-CAM identifies that the face of the man gets blurred, the water bottle and books on the desk gets distorted and an unnatural groove appears on the desk surface; in the second video, Grad-CAM identifies the problem of the inconsistent color changes of the grass around the dog, and the unnatural changes of the appearance of the dog’s head; in the third video, Grad-CAM identifies that there are abrupt changes in both the texture and color of the ground around the car. Please refer to our videos in supplementary.

3.2.3 Analysis on Appearance, Geometry, and Motion of AI-Generated Videos

Further, to deepen our comprehension of the intrinsic difference between real and fake videos, we employ Grad-CAM [42] to identify the most discriminative region that our 3DConvNet detector leverages as evidence for AI-generated video detection. Note that our Grad-CAM-based strategy can visualize the regions both spatially and temporally, facilitating the diagnosis in each aspect, i.e., appearance, geometry, and motion.

In the following, we outline the various sub-studies that comprise our approach.

Analysis on Appearance

First, we investigate how appearance differs between fake and real videos. As shown in Figure 1, AI-generated fake videos suffer from blurry objects (face of the man in the first video), inconsistent color and textures among frames (grass in the second video and the sandy ground in the third video). Please refer to the videos in the supplementary material for better visualization.

With the above extensive study, we make the following findings regarding appearance: Generated videos suffer from color inconsistency and texture distortion.

Analysis on Motion

Our motion classifier utilizes optical flow to assess the motion dynamics within the videos. Again, we visualize the key region that our model leverages to detect fake videos based on motions. As shown in Figure 2, the AI-generated videos cannot fully reproduce the motion in our real world (see the head of the animal in the first example), and may produce irregular motion artifacts on some objects (see the wolves in the second example). Please refer to the videos in the supplementary material for better visualization. Based on the above observations, we have come up with another finding: Video generation models cannot fully reproduce real-world motion patterns and may create unrealistic motion patterns.

Refer to caption
Figure 2: Motion dynamics analysis on generated videos. The figure presents the qualitative results of the motion classifier on two videos. For each video, we present the sampled 5-frame RGB images (first row), the optical flow visualizations (second row), and the corresponding Grad-CAM results (third row). The first video is from an SVD-generated video in our in-domain test set. The second video is from a Sora-generated video in our cross-domain test set. Note: In the first video, Grad-CAM identifies an irregular shaking pattern of the animal’s head, which defies physical rules; in the videos generated by Sora, Grad-CAM identifies unnatural pixel movements in some detailed structure regions (e.g., the legs of the wolves) caused by an abnormal cloning phenomenon. Please refer to our videos in supplementary.
Analysis on Geometry

Further, we investigate the differences in geometry between real and fake videos. As shown in Figure 3, the AI-generated videos can produce intersections that do not conform to physical constraints (see the first example in Figure 3) and create objects/humans with size mutations (see the bottom example in Figure 3). Please refer to the videos in the supplementary material for better visualization. Based on the above analysis, we present the following findings regarding geometry: AI-generated videos still cannot fully follow real-world geometry rules with unreal occlusion patterns and inconsistent object scales.

Refer to caption
Figure 3: Geometry analysis on generated videos. The figure illustrates the qualitative results of the geometry classifier on two videos. Each video showcases sampled 5-frame RGB images (first row), metric depth visualizations (second row), and corresponding Grad-CAM results (third row). The first video is from an SVD-generated video in our in-domain test set. The second video is from a Sora-generated video in our cross-domain test set. Note: For the first video, the abrupt changes in the occlusion relationships of the two characters in the video cause unnatural variations in metric depth estimation results between frames in some areas (e.g., the head of the farther person, the elbow of the closer person, and the luggage). The network captures these abnormal changes, and Grad-CAM identifies the corresponding regions. In the second video, the scale of individuals in different regions of the scene varies significantly, resulting in changes in monocular metric depth estimation between frames in areas with larger-scale individuals. The regions where such abnormal and sudden changes occurred are highlighted by Grad-CAM. Please refer to our videos in supplementary.

4 Ensembled Experts for Cross-Domain AI-generated Video Detection

4.1 Ensembled-Experts Model

Despite the decent performance of our 3D ConvNet detector based on a single input modality, the cross-domain performance is still not very satisfactory. As shown in Table 2 ”cross-domain”, the detection accuracy is dropped by 10% to 30% when the testing videos are generated by different AI models from the training ones.

This is partially due to the limitation of each detection model that only focuses on one input modality and can only address particular synthetic video patterns. To develop a more robust fake video detector and further improve the cross-domain fake video detection performance, we ensemble all three expert models into one, referred to as the ensembled-experts model, in a simple yet efficient manner. As illustrated in Figure 4, each CVR classifier assesses the input video in terms of its corresponding representation. Subsequently, each classifier produces logits indicating its confidence in determining the authenticity of the input videos. Then, We combine the logits from the appearance, motion, and geometry classifiers to obtain the final logits lfinalsubscript𝑙finall_{\text{final}}italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT:

lfinal=αala+αmlm+αglg,subscript𝑙finalsubscript𝛼asubscript𝑙asubscript𝛼msubscript𝑙msubscript𝛼gsubscript𝑙gl_{\text{final}}=\alpha_{\text{a}}l_{\text{a}}+\alpha_{\text{m}}l_{\text{m}}+% \alpha_{\text{g}}l_{\text{g}},italic_l start_POSTSUBSCRIPT final end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT a end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT a end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT m end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT m end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT g end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT g end_POSTSUBSCRIPT , (1)

where αasubscript𝛼a\alpha_{\text{a}}italic_α start_POSTSUBSCRIPT a end_POSTSUBSCRIPT, αmsubscript𝛼m\alpha_{\text{m}}italic_α start_POSTSUBSCRIPT m end_POSTSUBSCRIPT, and αgsubscript𝛼g\alpha_{\text{g}}italic_α start_POSTSUBSCRIPT g end_POSTSUBSCRIPT represent the combinations for appearance, motion, and geometry, respectively.

Refer to caption
Figure 4: An overview for the overall framework of Ensembled Experts Model. For each of the CVR classifiers, it assesses the authenticity of the input video independently. The logits produced by the classifiers are ensembled to construct the ensembled-experts model.
Table 3: Results of our Ensembled-Experts Model. Top two rows: comparison with existing works. Bottom four rows: ablation studies.
Method Input Sora Pika Labs Runway-Gen2 Average
CNNDetection [46] Images 50.00 50.00 50.00 50.00
DIRE [47] Images 34.67 53.00 53.00 46.89
DINOv2 classfier Videos 72.00 81.00 77.00 76.67
DINOv2+Flow classfier Videos 74.00 83.00 79.00 78.67
DINOv2+Depth classfier Videos 79.33 83.00 77.00 79.77
Ours Videos 81.33 83.00 82.00 82.11

4.2 Results

Results in Table 3 demonstrate the superior fake video detection capability of our Ensembled-Experts Model. In the following, we will compare our Ensembled-Experts Model with existing works and conduct ablation studies on each input modality.

Implementation Details

We implement our 3D ConvNet with Pytorch[30]. We train the models with a learning rate of 1e-4, a batch size of 20. We utilize 8 NVIDIA A100 GPUs to generate videos with SVD-XT [1] and 4 NVIDIA 4090 GPUs to train CVR classifiers. In the testing process, considering that videos from different sources have varying numbers of frames, we segment the test videos into multiple clips, each containing 25 frames. For each video clip, we extract optical flow based on adjacent frames, obtaining a 24-frame optical flow for training the motion classifier. For training the appearance classifier and geometry classifier, we select the first 24 frames of the video clips to extract DINOv2 features and perform depth estimation for consistency among CVR classifiers. During evaluation, if the proportion of clips identified as fake in an input video exceeds ε𝜀\varepsilonitalic_ε=0.05, we classify the entire input video as AI-generated; otherwise, the prediction for the video is a real video.

Comparison with Existing Works

As shown in Table 3, existing works [46, 47] can hardly detect AI-generated videos in the challenging cross-domain setting and the results are almost random guesses (50% accuracy). It is because their approaches are struggling to capture the common artifacts among AI-generated videos, such as the appearance, motion, and geometry. On the contrary, thanks to the effectiveness of our appearance, motion, and geometry classifiers and the Ensembled-Experts strategy, our approach (“Ours” in Table 3) consistently outperforms existing works [46, 47] by a large margin (more than 30%).

Ablation Studies

In the ablation study, we evaluate how each classifier contributes to our Ensembled-Experts Model. As shown in Table 3, with our DINOv2 classifier with the RGB input modality (see the row “DINOv2 classifier”), our model can already achieve decent performance in cross-domain fake video detection. With the ensembled appearance and motion classifiers (“DINOv2 classifier+Flow classifier”), the performance is consistently improved by 2% for all the testing models. Then we study the performance gain brought by the geometry classifier (“DINOv2 classifier+Depth classifier”), which improves the accuracy by 2% to 8%. These experiments show that our classifiers cannot fully capture the motion and geometry information from the RGB input, namely, DinoV2; thus, even with the RGB input and DINOv2 features, our motion and geometry modalities are still necessary for robust fake video detection.

At last, we ensemble all of the three classifiers, and the results (“Ours” in Table 3) demonstrate further improvement in accuracy, manifesting that our Ensembled-Experts Model can effectively join the forces of each exported model and effectively detect AI-generated video including the state-of-the-art video generative model Sora, achieving state-of-the-art performance in AI-generated video detection.

4.3 Limitations

While our study provides valuable insights into the differences between AI-generated and real videos, we acknowledge the following limitation: Although we selected some of the most advanced and widely-used models available, our findings may not generalize to future models due to the rapid pace of advancements in AI and video generation techniques. Due to the limitations of current video generation models, which cannot produce videos of comparable length to real ones, our model struggles to utilize long-range information for fake video detection. This highlights an important direction for future development in video generation models.

5 Conclusion and Discussion

We conduct a comprehensive study on the gap between AI-generated videos and real videos, examining three fundamental aspects: appearance, motion, and geometry. To achieve this, we develop fake/real classifiers and utilize Grad-CAM to identify discriminative regions that influence the classifiers’ decisions. Our study reveals the following findings: 1) AI-generated videos still cannot deceive a simple 3D CNN detector in an in-domain setting, indicating gaps between AI-generated videos and real ones. 2) AI-generated videos exhibit inconsistencies in appearance and distortions, unrealistic motion patterns, and geometric clues that violate real-world rules. 3) The trained fake/real detectors can generalize to detect videos from leading generative models with an accuracy of 70%, suggesting that some of the identified gaps are transferable across different generative models, as also demonstrated in visual illustrations. Lastly, by employing a simple ensemble strategy to capitalize on the complementary appearance, motion, and geometry cues, we develop a real/fake detector capable of detecting videos unseen during training with an accuracy of over 80%. We hope our investigation serves as a catalyst for future research focused on develo** diffusion models for real-world simulations and designing deepfake detectors to enhance social security.

References

  • [1] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  • [2] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  • [3] Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li **g, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024.
  • [4] Riccardo Corvi, Davide Cozzolino, Giada Zingarini, Giovanni Poggi, Koki Nagano, and Luisa Verdoliva. On the detection of synthetic images generated by diffusion models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  • [5] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • [6] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
  • [7] David C Epstein, Ishan Jain, Oliver Wang, and Richard Zhang. Online detection of ai-generated images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 382–392, 2023.
  • [8] Hongchang Gao, Jian Pei, and Heng Huang. Progan: Network embedding via proximity generative adversarial network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1308–1316, 2019.
  • [9] Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941, 2023.
  • [10] Runway Gen-2. https://research.runwayml.com/gen2, 2023.
  • [11] Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  • [12] Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  • [13] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [14] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  • [15] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [16] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.
  • [17] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang **, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023.
  • [18] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
  • [19] Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 3d convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1):221–231, 2012.
  • [20] Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145, 2023.
  • [21] Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023.
  • [22] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  • [23] Pika Labs. https://pika.art/, 2023.
  • [24] Zhengzhe Liu, Xiaojuan Qi, and Philip HS Torr. Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8060–8069, 2020.
  • [25] Yunpeng Luo, Junlong Du, Ke Yan, and Shouhong Ding. Lare^ 2: Latent reconstruction error based method for diffusion-generated image detection. arXiv preprint arXiv:2403.17465, 2024.
  • [26] Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, **gren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10209–10218, 2023.
  • [27] Ruipeng Ma, **hao Duan, Fei Kong, Xiaoshuang Shi, and Kaidi Xu. Exposing the fake: Effective diffusion-generated images detection. arXiv preprint arXiv:2307.06272, 2023.
  • [28] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • [29] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  • [30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [31] Pexels. https://www.pexels.com/, 2023.
  • [32] Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. arXiv preprint arXiv:2403.18913, 2024.
  • [33] Pixabay. https://pixabay.com/, 2023.
  • [34] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • [36] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
  • [37] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  • [38] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [39] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • [40] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  • [41] Ayush Sarkar, Hanlin Mai, Amitabh Mahapatra, Svetlana Lazebnik, David A Forsyth, and Anand Bhattad. Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry… for now. arXiv preprint arXiv:2311.17138, 2023.
  • [42] Ramprasaath R Selvaraju, Abhishek Das, Ramakrishna Vedantam, Michael Cogswell, Devi Parikh, and Dhruv Batra. Grad-cam: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.
  • [43] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • [44] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer, 2020.
  • [45] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019.
  • [46] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020.
  • [47] Zhendong Wang, Jianmin Bao, Wengang Zhou, Weilun Wang, Hezhen Hu, Hong Chen, and Houqiang Li. Dire for diffusion-generated image detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22445–22455, 2023.
  • [48] Haiwei Wu, Jiantao Zhou, and Shile Zhang. Generalizable synthetic image detection via language-guided contrastive learning. arXiv preprint arXiv:2305.13800, 2023.
  • [49] **bo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors. arXiv preprint arXiv:2310.12190, 2023.
  • [50] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  • [51] Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. Magicvideo: Efficient video generation with latent diffusion models. arXiv preprint arXiv:2211.11018, 2022.