Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Mingxiang Liao1Hannan Lu2Xinyu Zhang3,4Fang Wan1Tianyu Wang1
Yuzhong Zhao1Wangmeng Zuo2Qixiang Ye1**gdong Wang41University of Chinese Academy of Sciences  2Harbin Institute of Technology
3The University of Adelaide  4Baidu Inc
 Equal contribution. \dagger Corresponding Author.
Abstract

Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignoring the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at github.com/MingXiangL/DEVIL.

1 Introduction

With the rapid progress of video generation technology, the demand of comprehensively evaluating model performance continues to grow. Recent benchmarks [26, 23] have included various metrics, e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., generation quality, video-text alignment degree, and video content continuity, to evaluate text-to-video (T2V) generation models. Despite the great efforts made, a fundamental characteristic of video—dynamics remains overlooked.

Dynamics refers to the degree of visual change and interaction in the content of videos over time, encompassing object motion, action diversity, scene transitions, etc.𝑒𝑡𝑐etc.italic_e italic_t italic_c . It is a crucial index for evaluating video generation models for the following two reasons: (i)𝑖(i)( italic_i ) Dynamics of generated video content should be honest to text prompts in practical applications. For example, it is expected that dramatic text prompts result in videos with high dynamics. (ii)𝑖𝑖(ii)( italic_i italic_i ) Generated videos usually show negative correlations between dynamics and quality scores [23, 26], i.e.,formulae-sequence𝑖𝑒i.e.,italic_i . italic_e . , videos with higher dynamics tend to receive lower quality scores. This allows T2V models to “cheat” to achieve high-quality scores by generating low-dynamic video content in many cases.

Refer to caption
Figure 1: Flowchart to calculate dynamics metrics based on dynamics scores and text prompts.

To fully reveal the dynamics of generated videos, in this paper, we introduce a new evaluation protocol, named DEVIL. DEVIL treats dynamics as the primary dimension for evaluating the performance of T2V models. Here, we consider three types of metrics to represent dynamics: (i) Dynamics Range, which measures the extent of variations in video content that the model can generate; (ii) Dynamics Controllability, which assesses the model’s ability to manipulate video dynamics in response to text prompts; and (iii) Dynamics-based Quality, which evaluates the visual quality of videos with varying dynamics generated by the model.

To produce the evaluation, we first establish a benchmark comprising text prompts categorized by multiple dynamics grades. These text prompts are collected from commonly used datasets [7, 6, 45, 39] and categorized according to their dynamics using a Large Language Model (LLM), GPT-4 [29], followed by further manual refinement. Based on the constructed text prompt benchmark, we calculate an overall dynamic score for each generated video, which is defined as a weighted sum of a series of dynamics scores at different temporal granularities.

The prompt benchmark and the overall dynamics scores of all generated videos are then utilized to evaluate T2V models with three dynamics metrics. This evaluation goes beyond simply maximizing dynamics scores for each video; it emphasizes the model’s ability to produce high-quality videos across various dynamics following the instructions from text prompts. (i𝑖iitalic_i) Dynamics Range is calculated as the range of dynamics scores for all generated videos, indicating the ability of T2V models to generate videos with both subtle and dramatic temporal variations. (ii𝑖𝑖iiitalic_i italic_i) For Dynamics Controllability, we adopt a ranking consistency-based methodology to check whether the dynamics scores of generated videos align with the dynamics of their corresponding text prompts. (iii𝑖𝑖𝑖iiiitalic_i italic_i italic_i) Dynamics-based Quality is defined by integrating several quality metrics with dynamics scores. It avoids biases caused by negative correlations between video dynamics and video quality [23, 26], resulting in a more comprehensive evaluation of video quality. Finally, noting that video naturalness decreases with increasing dynamics, we also propose a naturalness metric based on a multimodal large language model, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e . Gemini-1.5 Pro [1].

Upon DEVIL, we evaluate and revisit the state-of-the-art T2V models, and find: (i) Existing datasets have biased dynamics distribution, resulting in that current generation models (especially top-ranking models like GEN-2 [2]) typically generate slow-motion videos to obtain high quality scores. (ii) Existing training datasets have biased text prompts on dynamics. Training on this prompts will inevitably limit the dynamics controllability of T2V models. (iii) Through the statistical analyses of dynamics scores, especially the naturalness metric score, existing methods display limited real-world simulation ability. Based on these finds, we believe, a more elaborate training data with better methods will improve the T2V performance on both quality and dynamics scores.

In summary, our contributions are:

  1. 1.

    We propose a novel evaluation protocol, termed DEVIL, which benchmarks T2V generation models by integrating dynamics metrics. Together with existing evaluation metrics, DEVIL builds a more comprehensive evaluation protocol.

  2. 2.

    We establish a new text prompt benchmark w.r.t.formulae-sequence𝑤𝑟𝑡w.r.t.italic_w . italic_r . italic_t . dynamics grades as well as a set of metrics to evaluate video dynamics across temporal granularities, facilitating the assessment of dynamics range, dynamics controllability, and dynamics-based quality.

  3. 3.

    Extensive evaluation of existing T2V generation models allows us to thoroughly analyze the capabilities of T2V models through the proposed protocol and benchmarks. The results would inspire sophisticated T2V generation methods.

2 Related Work

2.1 Text-to-Video Generation Model

As a recent breakthrough in artificial intelligence, diffusion models have pushed video generation technology to a new height. Earlier studies [21, 20] explored the 3D U-Net and cascaded models for diffusion within pixel space. Recent solutions [12, 32] employed latent diffusion models to efficiently manage the diffusion process within a compressed latent space. Following these studies, a variety of approaches [38, 9, 25, 41, 15, 40, 46, 28, 24] updated and improved this paradigm. Building on these advancements, subsequent methods further explored generating videos of higher quality and extended duration. The Videocrafter approach [13] pursued high-quality video generation through disentangling spatial and temporal learning and tuning spatial modules using high-quality images. In a similar way, commercial models such as Pika [4] and GEN-2 [2] demonstrated substantial improvements, showcasing videos with exceptional visual clarity. For longer video generation, Gen-L-Video [37] aggregated short clips generated by base T2V models using temporal co-denoising to enhance continuity. Freenoise [30] extended pre-trained T2V models through rescheduling noise for longer-duration video inference. StreamingT2V [18] enhanced long-term content consistency by integrating short-term and long-term memory blocks.

The rapid development of T2V models poses a growing demand for quality evaluation protocols. Unfortunately, existing protocols primarily focus on temporal consistency and content continuity, yet largely ignore temporal dynamics. This hinders the exploitation of video content vividness and the honesty of video content to text prompts.

2.2 Evaluation Protocol

Early evaluation protocols [34] primarily relied on class labels to evaluate the performance of T2V generation models. For example, they commonly used video clips from the UCF-101 dataset and human-annotated video captions from the MSR-VTT [45] dataset as the evaluation data. For a more specific assessment, FETV [27] assigned fine-grained category labels to prompts and calculated the CLIP-SIM score for each category.

However, conventional quality assessment metrics such as Inception Score (IS) [33], Fréchet Inception Distance (FID) [19], Frechet Video Distance (FVD) [35], and CLIP-SIM typically operate on a single dimension while can not provide a comprehensive evaluation. When addressing the limitation, EvalCrafter [31] expanded both the prompt scale and the number of evaluation metrics so that the text-video alignment degree and the quality of generated videos can be better evaluated. Additionally, VBench [23] proposed a multi-dimensional, multi-category evaluation suite that not only considered the diversity of prompts but also encompassed a variety of assessment metrics.

Despite of the evolution of evaluation metrics, we argue an essential characteristic of video, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., dynamics, remains ignored. In this study, we introduce the dynamics dimension to evaluate T2V generation models, as well as enhance the completeness of existing metrics.

3 Dynamics Evaluation Protocol

Table 1: Symbol Definitions.
Symbol Definition
𝐃<name>subscript𝐃expectation𝑛𝑎𝑚𝑒\mathbf{D}_{<name>}bold_D start_POSTSUBSCRIPT < italic_n italic_a italic_m italic_e > end_POSTSUBSCRIPT <name𝑛𝑎𝑚𝑒nameitalic_n italic_a italic_m italic_e>-type dynamic metric of T2V models.
𝒯𝒯\mathcal{T}caligraphic_T Text prompt benchmark of our DEVIL.
M𝑀Mitalic_M The number of text prompts 𝒯𝒯\mathcal{T}caligraphic_T.
Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT i𝑖iitalic_i-th text prompt in 𝒯𝒯\mathcal{T}caligraphic_T, where i{1,,M}𝑖1𝑀i\in\{1,\cdots,M\}italic_i ∈ { 1 , ⋯ , italic_M }.
Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT Dynamic grade of text prompt Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.
Sisuperscript𝑆𝑖S^{i}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT Dynamics score of video generated by Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT.
fjsubscript𝑓𝑗f_{j}italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT The j𝑗jitalic_j-th video frame.
Fjsubscript𝐹𝑗F_{j}italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT Feature of j𝑗jitalic_j-th video frame.
N𝑁Nitalic_N The number of video frames.

In this section, we first provide an overview of the proposed DEVIL protocol in Section 3.1 and then introduce the dynamics metrics proposed within DEVIL in Section 3.2 Finally, we detail the prompt benchmark(Section 3.3) and dynamics scores(Section 3.4 and  3.5) constructed to evaluate the dynamics metrics of T2V generation models.

3.1 Overview

Refer to caption
Figure 2: Distributions of video quantity and quality scores along the dynamics score for various video generation models including: GEN-2 [2], Pika [4], VideoCrafter2(VC-2) [13], Open-Sora(OS) [22], StreamingT2V [18] and FreeNoise-Lavie(FN) [30]. Subplot (a) shows video quantity distribution. Subplots (b) display the distribution of quality score of generated videos in terms of Background Consistency, Motion Smoothness, and Naturalness, respectively. All videos are generated based on our text prompt benchmark.

Fig. 1 shows the evaluation workflow of the DEVIL protocol. We aim to calculate the three dynamics metrics, dynamics range (𝐃rangesubscript𝐃𝑟𝑎𝑛𝑔𝑒\mathbf{D}_{range}bold_D start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT), dynamics controllability (𝐃controlsubscript𝐃𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathbf{D}_{control}bold_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l end_POSTSUBSCRIPT), and dynamics-based quality (𝐃qualitysubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦\mathbf{D}_{quality}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT) for each T2V model. To achieve this, we establish a text prompts benchmark 𝒯={(Ti,Gi)}i=1M𝒯superscriptsubscriptsuperscript𝑇𝑖superscript𝐺𝑖𝑖1𝑀\mathcal{T}=\{(T^{i},G^{i})\}_{i=1}^{M}caligraphic_T = { ( italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, where each prompt Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT has a dynamic grade Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, classified by GPT-4 [29], followed by further manual refinement. M𝑀Mitalic_M is the number of prompts, for which we collect around 800 text prompts for our benchmark. Subsequently, we generate videos using 𝒯𝒯\mathcal{T}caligraphic_T, and assess the dynamics of each generated video using an overall dynamics score S𝑆Sitalic_S. To calculate S𝑆Sitalic_S, we define a series of dynamics scores at different temporal granularities, including inter-frame, inter-segment, and video levels, to reveal the video characteristics at multiple temporal levels as shown in Table 3. These scores are combined to obtain S𝑆Sitalic_S using weights derived from fitting human ratings. Subsequently, the dynamics scores of all generated videos are utilized to calculate the three dynamics metrics, which represent the overall performance of T2V models. In simplification, we provide the symbol definitions in Table 1.

3.2 Dynamics Metrics

We introduce three key metrics, dynamics range (𝐃rangesubscript𝐃𝑟𝑎𝑛𝑔𝑒\mathbf{D}_{range}bold_D start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT), dynamics controllability (𝐃controlsubscript𝐃𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathbf{D}_{control}bold_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l end_POSTSUBSCRIPT), and dynamics-based quality (𝐃qualitysubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦\mathbf{D}_{quality}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT), to evaluate T2V models from the perspective of dynamics. Each of these metrics evaluates the overall benchmark (described in Section 3.3), which is calculated using the per-video dynamics scores (detailed in Sections 3.2 and  3.5).

(i) Dynamics Range demonstrates the model’s versatility in handling both subtle and dramatic changes. An ideal T2V generation model is expected to display a large dynamics range, reflecting various temporal variations described in text prompts.

In detail, we determine the dynamics range metric 𝐃rangesubscript𝐃𝑟𝑎𝑛𝑔𝑒\mathbf{D}_{range}bold_D start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT by identifying the extremes of the dynamic scores over the benchmark, while excluding the top and bottom 1% scores to mitigate the influence of outliers. This is formulated as

𝐃range=𝐐0.99𝐐0.01,subscript𝐃𝑟𝑎𝑛𝑔𝑒subscript𝐐0.99subscript𝐐0.01\mathbf{D}_{range}=\mathbf{Q}_{0.99}-\mathbf{Q}_{0.01},bold_D start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT = bold_Q start_POSTSUBSCRIPT 0.99 end_POSTSUBSCRIPT - bold_Q start_POSTSUBSCRIPT 0.01 end_POSTSUBSCRIPT , (1)

where 𝐐0.99subscript𝐐0.99\mathbf{Q}_{0.99}bold_Q start_POSTSUBSCRIPT 0.99 end_POSTSUBSCRIPT and 𝐐0.01subscript𝐐0.01\mathbf{Q}_{0.01}bold_Q start_POSTSUBSCRIPT 0.01 end_POSTSUBSCRIPT denote the 99999999-th and 1111-st percentile values of the dynamics scores for videos generated with our proposed text prompt benchmark, respectively. This metric reflects a realistic spread of dynamics, excluding atypical extremes.

(ii) Dynamics Controllabiliy assesses the ability of T2V models to manipulate video dynamics with text prompts. Objectively, it is challenging to obtain an exact correspondence between text prompts and videos. Therefore, we adopt a ranking consistency-based methodology to derive a Dynamics Controllability metric 𝐃controlsubscript𝐃𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathbf{D}_{control}bold_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l end_POSTSUBSCRIPT.

Specifically, for two text prompts Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and Tjsuperscript𝑇𝑗T^{j}italic_T start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT in benchmark 𝒯={(Ti,Gi)}𝒯superscript𝑇𝑖superscript𝐺𝑖\mathcal{T}=\{(T^{i},G^{i})\}caligraphic_T = { ( italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) }, their corresponding generated videos have dynamics scores Sisuperscript𝑆𝑖S^{i}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT and Sjsuperscript𝑆𝑗S^{j}italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT (the dynamics scores are detailed in Section 3.5). Provided that the dynamics grades are ranked Gi>Gjsuperscript𝐺𝑖superscript𝐺𝑗G^{i}>G^{j}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT > italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, the dynamics scores should consequently be consistently ranked Si>Sjsuperscript𝑆𝑖superscript𝑆𝑗S^{i}>S^{j}italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT > italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Accordingly, we calculate 𝐃controlsubscript𝐃𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathbf{D}_{control}bold_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l end_POSTSUBSCRIPT as follows:

𝐃control=1Mi=1M1MMij:GjGi𝕀((SiSj)(GiGj)),subscript𝐃𝑐𝑜𝑛𝑡𝑟𝑜𝑙1𝑀superscriptsubscript𝑖1𝑀1𝑀superscript𝑀𝑖subscript:𝑗superscript𝐺𝑗superscript𝐺𝑖𝕀superscript𝑆𝑖superscript𝑆𝑗superscript𝐺𝑖superscript𝐺𝑗\mathbf{D}_{control}=\frac{1}{M}\sum_{i=1}^{M}{\frac{1}{{M}-{M^{i}}}\sum_{j:G^% {j}\neq G^{i}}{\mathbb{I}\big{(}(S^{i}-S^{j})(G^{i}-G^{j})\big{)}}},bold_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_M - italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_j : italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≠ italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_I ( ( italic_S start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_S start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ( italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT - italic_G start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ) , (2)
Table 2: Correlation between the overall dynamic score and the existing quality metrics, including Naturalness (Nat), Visual Quality [44] (VQ), Motion Smoothness (MS) [23], Subject Consistency (SC) [23] and Background Consistency (BC) [23]. “PC” denotes Pearson’s correlation, and “KC” denotes Kendall’s correlation.
Evaluation Metrics PC KC
Naturalness (Nat) -51.8 -44.2
Visual Quality (VQ) -24.8 -18.6
Motion Smoothness (MS) -64.0 -54.6
Subject Consistency (SC) -88.9 -74.9
Background Consistency (BC) -79.4 -61.4

where M𝑀Mitalic_M is the number of all text prompts and Misuperscript𝑀𝑖M^{i}italic_M start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT denotes the set of prompts with a dynamics grade of Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. 𝕀()𝕀\mathbb{I}(\cdot)blackboard_I ( ⋅ ) denotes the indicator function.

(iii) Dynamics-based Quality. Existing evaluations of generated visual quality do not account for the dynamics of the videos. Previous studies [23, 26] have shown that videos with higher dynamics tend to receive lower quality scores. In Table 2, we calculate the correlation between the overall dynamics score of each generated video (as detailed in Section 3.5) and its quality metrics. In detail, quality metrics such as Naturalness (Nat., elaborated in Section 2) , Motion Smoothness (MS) [23], Subject Consistency (SC) [23], and Background Consistency (BC) [23] exhibit a strong negative correlation with dynamics. This indicates that T2V models tend to generate low-dynamics videos for most text prompts to “cheat” to achieve higher scores on these metrics, as shown in Fig. 2.

To address this, we propose the Dynamics-based Quality metric DqualitysubscriptD𝑞𝑢𝑎𝑙𝑖𝑡𝑦\textbf{D}_{quality}D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT, assessing generated visual quality considering dynamics. For each video, we synthesize a composite quality score by averaging the scores of the identified quality metrics correlated with dynamics (i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., Nat, MS, SC, and BC). We then divide the entire range of dynamics score into L=12𝐿12L=12italic_L = 12 equal intervals and assign videos to their corresponding intervals based on their dynamics scores. Within each interval l𝑙litalic_l, we calculate the average of the composite quality scores, denoted as Clsubscript𝐶𝑙C_{l}italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. Ultimately, the dynamic quality is defined as the overall average of these interval averages:

𝐃quality=1Ll=1LClsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦1𝐿superscriptsubscript𝑙1𝐿subscript𝐶𝑙\mathbf{D}_{quality}=\frac{1}{L}\sum_{l=1}^{L}C_{l}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (3)

Except for dynamics-based quality on the entire range of dynamics score, we also evaluate dynamics-based quality at dynamics levels of high, medium, and low by modifying the range of intervals for a comprehensive evaluation (refer to Section 4.3). Upon the dynamics-based quality, to have a high score, the generated videos should spread all dynamics intervals, which implies a large dynamics range. Additionally, for detailed results that integrate the dynamics score with individual metrics, please refer to Appendix  C.

Naturalness. We propose Naturalness metric to evaluate the ability of T2V models to generate realistic videos. In video generation, increased video dynamics often lead to unnatural phenomena, like a cat with an extra leg or water flowing uphill. Existing metrics focus on visual effects, ignoring video naturalness. However, a model’s ability to generate natural videos reflects its real-world simulating ability. To assess this, we use the multi-modal model, Gemini 1.5 Pro [1], to grade each video’s naturalness into five levels 111Please refer to Appendix E for more details. : “Almost Real”, “Slightly Unrealistic”, “Moderately Unrealistic”, “Noticeably Unrealistic,” and “Completely Fictitious”. The overall naturalness is the average score of all videos. Experiments (see Table 4) show a high correlation between our scores and human ratings, validating the metric’s effectiveness.

3.3 Text Prompt Benchmark

To evaluate the proposed dynamics metrics, we need a benchmark consisting of text prompts that fully represent multiple dynamic grades. Existing benchmarks [23, 26] can not explicitly reflect various dynamics. To this end, we establish a new benchmark. Let 𝒯={(Ti,Gi)}i=1N𝒯superscriptsubscriptsuperscript𝑇𝑖superscript𝐺𝑖𝑖1𝑁\mathcal{T}=\{(T^{i},G^{i})\}_{i=1}^{N}caligraphic_T = { ( italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT denote the benchmark, where each text prompt Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is assigned a dynamic grade Gisuperscript𝐺𝑖G^{i}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Here, Gi{1,2,3,4,5}superscript𝐺𝑖12345G^{i}\in\{1,2,3,4,5\}italic_G start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ∈ { 1 , 2 , 3 , 4 , 5 } that is categorized into a coarse range. The dynamic grades are defined based on the level of dynamics described in the text prompts: "1111" represents Static video, where the video content is nearly stationary; "2222" represents Low dynamics, indicating slow and slight changes in the video content; "3333" represents Medium dynamics, characterized by noticeable activity and changes but relatively smooth overall; "4444" represents High dynamics, with fast actions and changes; and "5555" represents Very high dynamics, indicating extremely rapid and frequent changes in the video content.

Refer to caption
Figure 3: Dynamics distribution and Word cloud of text prompts from DEVIL, Vbench [23], and EvalCrafter [26].

In the coarse categorization step, we collect about 50,000 text prompts from existing benchmarks, including VidProm [39], WebVid [8], MSR-VTT [45], and Didemo [17]. The initial dynamic grades for each text prompt Tisuperscript𝑇𝑖T^{i}italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT are assigned by GPT-4. Then we recruit six human annotators for refinement for the post-processing step. Finally, we sample 800 text prompts evenly across different dynamic grades to ensure a uniform distribution.

Fig. 3(b) shows the statistics of the DEVIL benchmark, which contains approximately 800 text prompts, and each dynamics grade includes 19 object categories and 4 scene categories. For comparison, we further assign dynamic grades to the text prompts from existing benchmarks [23, 26] following the same procedure. As shown in Fig. 3(a), these benchmarks are heavily skewed towards lower dynamic content, while our benchmark demonstrates a more balanced distribution across all dynamic grades. Unless otherwise specified, all experiments in this paper are conducted on the DEVIL benchmark.

Table 3: Formulations of dynamics scores at different temporal granularities.
Granularity Dynamics scores Formulation
Inter-frame Optical Flow Strength Sofs=1N1i=1N1FLOW(fi)subscript𝑆𝑜𝑓𝑠1𝑁1superscriptsubscript𝑖1𝑁1FLOWsubscript𝑓𝑖S_{ofs}=\frac{1}{N-1}\sum_{i=1}^{N-1}\text{FLOW}(f_{i})italic_S start_POSTSUBSCRIPT italic_o italic_f italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT FLOW ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
Structural Dynamics Score Ssd=11N1i=1N1SSIM(fi,fi+1)subscript𝑆𝑠𝑑11𝑁1superscriptsubscript𝑖1𝑁1SSIMsubscript𝑓𝑖subscript𝑓𝑖1S_{sd}=1-\frac{1}{N-1}\sum_{i=1}^{N-1}\text{SSIM}(f_{i},f_{i+1})italic_S start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT SSIM ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT )
Perceptual Dynamics Score Spd=1N1i=1N1PHASHD(fi,fi+1)subscript𝑆𝑝𝑑1𝑁1superscriptsubscript𝑖1𝑁1PHASHDsubscript𝑓𝑖subscript𝑓𝑖1S_{pd}=\frac{1}{N-1}\sum_{i=1}^{N-1}\text{PHASHD}(f_{i},f_{i+1})italic_S start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT PHASHD ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT )
Inter-segment Patch-level Aperiodicity Spa=11HWh,w𝐀𝐂𝐅({Fi,h,w}i=1N)subscript𝑆𝑝𝑎11𝐻𝑊subscript𝑤𝐀𝐂𝐅superscriptsubscriptsubscript𝐹𝑖𝑤𝑖1𝑁S_{pa}=1-\frac{1}{HW}\sum_{h,w}\mathbf{ACF}(\{F_{i,h,w}\}_{i=1}^{N})italic_S start_POSTSUBSCRIPT italic_p italic_a end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT bold_ACF ( { italic_F start_POSTSUBSCRIPT italic_i , italic_h , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT )
Global Aperiodicity Sga=11rNi=1rNji𝐒𝐈𝐌(Fir,Fjr)subscript𝑆𝑔𝑎11𝑟𝑁superscriptsubscript𝑖1𝑟𝑁subscript𝑗𝑖𝐒𝐈𝐌superscriptsubscript𝐹𝑖𝑟superscriptsubscript𝐹𝑗𝑟S_{ga}=1-\frac{1}{\lfloor rN\rfloor}\sum_{i=1}^{\lfloor rN\rfloor}\sum_{j\neq i% }\mathbf{SIM}(F_{i}^{r},F_{j}^{r})italic_S start_POSTSUBSCRIPT italic_g italic_a end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG ⌊ italic_r italic_N ⌋ end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_r italic_N ⌋ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT bold_SIM ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT )
Video Temporal Entropy Ste=𝐇(f1,f2,,fN|f1)subscript𝑆𝑡𝑒𝐇subscript𝑓1subscript𝑓2conditionalsubscript𝑓𝑁subscript𝑓1S_{te}=\mathbf{H}(f_{1},f_{2},\cdots,f_{N}|f_{1})italic_S start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT = bold_H ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )
Temporal Semantic Diversity Stsd=1Ni=1NFiF¯2subscript𝑆𝑡𝑠𝑑1𝑁superscriptsubscript𝑖1𝑁superscriptnormsubscript𝐹𝑖¯𝐹2S_{tsd}=\frac{1}{N}\sum_{i=1}^{N}\|F_{i}-\bar{F}\|^{2}italic_S start_POSTSUBSCRIPT italic_t italic_s italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_F end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

3.4 Dynamics Scores for Generated Videos

Refer to caption
Figure 4: Video dynamics at different temporal granularities: (a) Inter-frame Dynamics, (b) Inter-segment Dynamics, and (c) Video-level Dynamics.

To evaluate the proposed dynamics metrics, we generate videos using the text prompts from 𝒯={(Ti,Pi)}i=1N𝒯superscriptsubscriptsuperscript𝑇𝑖superscript𝑃𝑖𝑖1𝑁\mathcal{T}=\{(T^{i},P^{i})\}_{i=1}^{N}caligraphic_T = { ( italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_P start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and assess the dynamics of each generated video using a set of dynamics scores designed at different temporal granularities. Specifically, we evaluate dynamics at three levels: inter-frame, inter-segment, and the entire video. By combining these evaluations, we derive an overall dynamics score. For simplicity, we omit the superscripts from the dynamics scores in this section.

(i) Inter-frame Dynamics Scores. These scores describe variations between successive frames and are further divided into: optical flow strength, structural dynamics, and perceptual dynamics.

Optical flow strength. We first employ RAFT [48] to estimate the optical flow for each video frame. The mean optical flow magnitudes of each frame are averaged to calculate the optical flow strength of this frame. Averaging the optical flow strength values of all video frames, we have the optical flow strength Sofssubscript𝑆𝑜𝑓𝑠S_{ofs}italic_S start_POSTSUBSCRIPT italic_o italic_f italic_s end_POSTSUBSCRIPT of the video, as

Sofs=1N1i=1N1FLOW(fi),subscript𝑆𝑜𝑓𝑠1𝑁1superscriptsubscript𝑖1𝑁1FLOWsubscript𝑓𝑖S_{ofs}=\frac{1}{N-1}\sum_{i=1}^{N-1}\text{FLOW}(f_{i}),italic_S start_POSTSUBSCRIPT italic_o italic_f italic_s end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT FLOW ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (4)

where FLOW calculate the mean optical flow strength values of frame fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Structural dynamics score. While optical flow excels in capturing motion, it is less effective when detecting structural dynamics such as lighting conditions. To capture such information, we calculate the average structural similarity index metric (SSIM) [43] between consecutive frames from all frame pairs to quantify inter-frame structural variations of the video, as

Ssd=11N1i=1N1SSIM(fi,fi+1).subscript𝑆𝑠𝑑11𝑁1superscriptsubscript𝑖1𝑁1SSIMsubscript𝑓𝑖subscript𝑓𝑖1S_{sd}=1-\frac{1}{N-1}\sum_{i=1}^{N-1}\text{SSIM}(f_{i},f_{i+1}).italic_S start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT SSIM ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) . (5)

Perceptual dynamics. The human visual system is sensitive to changes in low-frequency regions of video frames. To reflect this characteristic, we introduce a perceptual dynamics score that measures the difference between the perceptual hashes [36] of consecutive frames. The perceptual distance Dpasubscript𝐷𝑝𝑎D_{pa}italic_D start_POSTSUBSCRIPT italic_p italic_a end_POSTSUBSCRIPT is defined as the mean perceptual hash distance of all frame pairs, as

Spd=1N1i=1N1PHASHD(fi,fi+1),subscript𝑆𝑝𝑑1𝑁1superscriptsubscript𝑖1𝑁1PHASHDsubscript𝑓𝑖subscript𝑓𝑖1S_{pd}=\frac{1}{N-1}\sum_{i=1}^{N-1}\text{PHASHD}(f_{i},f_{i+1}),italic_S start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT PHASHD ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) , (6)

where PHASHD(fi,fi+1)PHASHDsubscript𝑓𝑖subscript𝑓𝑖1\text{PHASHD}(f_{i},f_{i+1})PHASHD ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT ) denotes the Hamming distance [16] between the perceptual hash of fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and fi+1subscript𝑓𝑖1f_{i+1}italic_f start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT.

(ii) Inter-segment Dynamics Scores. These scores refer to the changes between video segments, each containing multiple frames. They capture the patterns of video content changes and are further categorized into patch-level aperiodicity and global aperiodicity, which measure the dynamics between video segments.

Patch-level aperiodicity. We first calculate inter-segment dynamics at the patch level using the auto-correlation factor [10](𝐀𝐂𝐅𝐀𝐂𝐅\mathbf{ACF}bold_ACF), to evaluate the scene and temporal pattern dynamics. The auto-correlation factor measures the feature similarity of a time series, revealing periodicity and changing trends of features. Given features at position (h,w)𝑤(h,w)( italic_h , italic_w ) across N𝑁Nitalic_N frames, {Fi,h,w}i=1Nsuperscriptsubscriptsubscript𝐹𝑖𝑤𝑖1𝑁\{F_{i,h,w}\}_{i=1}^{N}{ italic_F start_POSTSUBSCRIPT italic_i , italic_h , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, the auto-correlation factor of the features is defined as

𝐀𝐂𝐅({Fi,h,w}i=1N)=1NK0k=K0Ni=1k1k𝐒𝐈𝐌(Fi,h,w,FNk+i,h,w),𝐀𝐂𝐅superscriptsubscriptsubscript𝐹𝑖𝑤𝑖1𝑁1𝑁subscript𝐾0superscriptsubscript𝑘subscript𝐾0𝑁superscriptsubscript𝑖1𝑘1𝑘𝐒𝐈𝐌subscript𝐹𝑖𝑤subscript𝐹𝑁𝑘𝑖𝑤\mathbf{ACF}(\{F_{i,h,w}\}_{i=1}^{N})=\frac{1}{N-K_{0}}\sum_{k=K_{0}}^{N}\sum_% {i=1}^{k}\frac{1}{k}\mathbf{SIM}(F_{i,h,w},F_{N-k+i,h,w}),bold_ACF ( { italic_F start_POSTSUBSCRIPT italic_i , italic_h , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N - italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_k = italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_k end_ARG bold_SIM ( italic_F start_POSTSUBSCRIPT italic_i , italic_h , italic_w end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_N - italic_k + italic_i , italic_h , italic_w end_POSTSUBSCRIPT ) , (7)

where K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the minimal segment length. 𝐒𝐈𝐌𝐒𝐈𝐌\mathbf{SIM}bold_SIM represents the cosine similarity between two feature vectors. It is empirically set to N/8𝑁8\lfloor N/8\rfloor⌊ italic_N / 8 ⌋ because most generated videos have more than 8 frames. H𝐻Hitalic_H and W𝑊Witalic_W are the height and width of the feature map, respectively. With auto-correlation factors of all patches, we define the patch-level aperiodicity of the video, as

Spa=11HWh,w𝐀𝐂𝐅({Fi,h,w}i=1N}).S_{pa}=1-\frac{1}{HW}\sum_{h,w}\mathbf{ACF}(\{F_{i,h,w}\}_{i=1}^{N}\}).italic_S start_POSTSUBSCRIPT italic_p italic_a end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG italic_H italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_h , italic_w end_POSTSUBSCRIPT bold_ACF ( { italic_F start_POSTSUBSCRIPT italic_i , italic_h , italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT } ) . (8)

Global aperiodicity. In addition to patch-level dynamics, we employ a global aperiodicity score to measure the diversity of patterns between video segments. Specifically, we divide the video into segments. Each segment has a length rN𝑟𝑁rNitalic_r italic_N, where r𝑟ritalic_r is a proportion factor, empirically set to 0.25. We use ViCLIP [42] to extract the spatial-temporal features for each segment. The features are denoted as {Fir}i=1rNsuperscriptsubscriptsuperscriptsubscript𝐹𝑖𝑟𝑖1𝑟𝑁\{F_{i}^{r}\}_{i=1}^{\lfloor rN\rfloor}{ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_r italic_N ⌋ end_POSTSUPERSCRIPT. We then calculate the similarity of these features to assess the variation in spatial-temporal patterns across segments, as

Sga=11rNi=1rNji𝐒𝐈𝐌(Fir,Fjr).subscript𝑆𝑔𝑎11𝑟𝑁superscriptsubscript𝑖1𝑟𝑁subscript𝑗𝑖𝐒𝐈𝐌superscriptsubscript𝐹𝑖𝑟superscriptsubscript𝐹𝑗𝑟S_{ga}=1-\frac{1}{\lfloor rN\rfloor}\sum_{i=1}^{\lfloor rN\rfloor}\sum_{j\neq i% }\mathbf{SIM}(F_{i}^{r},F_{j}^{r}).italic_S start_POSTSUBSCRIPT italic_g italic_a end_POSTSUBSCRIPT = 1 - divide start_ARG 1 end_ARG start_ARG ⌊ italic_r italic_N ⌋ end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⌊ italic_r italic_N ⌋ end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT bold_SIM ( italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ) . (9)

(iii) Video-level Dynamics Scores. These scores encompass the overall content diversity and the frequency of changes throughout the video. The dynamics scores at video-level are defined by temporal entropy and temporal semantic dynamics.

Temporal entropy. To evaluate the dynamics at the video level, we first measure the temporal information of each video. The temporal information 𝐇𝐇\mathbf{H}bold_H is defined as the conditional entropy of the entire video sequence given the first frame

Ste=𝐇(f1,f2,,fN|f1).subscript𝑆𝑡𝑒𝐇subscript𝑓1subscript𝑓2conditionalsubscript𝑓𝑁subscript𝑓1S_{te}=\mathbf{H}(f_{1},f_{2},\cdots,f_{N}|f_{1}).italic_S start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT = bold_H ( italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_f start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT | italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (10)

To estimate the conditional entropy Stesubscript𝑆𝑡𝑒S_{te}italic_S start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT, we employ the video encoding toolbox FFmpeg [14].

Temporal Semantic Dynamics. Beyond low-level dynamics, we further introduce a semantic diversity score to assess high-level dynamics across the whole video. The semantic diversity score Stsdsubscript𝑆𝑡𝑠𝑑S_{tsd}italic_S start_POSTSUBSCRIPT italic_t italic_s italic_d end_POSTSUBSCRIPT is computed to reflect semantic-level dynamics and is defined as the variance of DINO [11] features {Fi}i=1Nsuperscriptsubscriptsubscript𝐹𝑖𝑖1𝑁\{F_{i}\}_{i=1}^{N}{ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT of each frame, as

Stsd=1Ni=1NFiF¯2,subscript𝑆𝑡𝑠𝑑1𝑁superscriptsubscript𝑖1𝑁superscriptnormsubscript𝐹𝑖¯𝐹2S_{tsd}=\frac{1}{N}\sum_{i=1}^{N}\|F_{i}-\bar{F}\|^{2},\ italic_S start_POSTSUBSCRIPT italic_t italic_s italic_d end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∥ italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - over¯ start_ARG italic_F end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (11)

where F¯=1Ni=1NFi¯𝐹1𝑁superscriptsubscript𝑖1𝑁subscript𝐹𝑖\bar{F}=\frac{1}{N}\sum_{i=1}^{N}{F_{i}}over¯ start_ARG italic_F end_ARG = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the mean feature vector of all frames.

3.5 Overall Dynamics Score

To establish a reliable and robust assessment, we integrate dynamics scores into one with a human alignment procedure, Fig. 1, to refine the empirically defined dynamics score. It utilizes human ratings to provide ground-truth, based on which we fit a linear regression model at each temporal granularity, as

Sfsubscript𝑆𝑓\displaystyle S_{f}italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT =𝐋𝐢𝐧𝐞𝐚𝐫θf(Dofs,Dsd,Dpd),absentsubscript𝐋𝐢𝐧𝐞𝐚𝐫subscript𝜃𝑓subscript𝐷𝑜𝑓𝑠subscript𝐷𝑠𝑑subscript𝐷𝑝𝑑\displaystyle=\mathbf{Linear}_{\theta_{f}}(D_{ofs},D_{sd},D_{pd}),= bold_Linear start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_o italic_f italic_s end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT ) , (12)
Sssubscript𝑆𝑠\displaystyle S_{s}italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT =𝐋𝐢𝐧𝐞𝐚𝐫θs(Dpa,Dga),absentsubscript𝐋𝐢𝐧𝐞𝐚𝐫subscript𝜃𝑠subscript𝐷𝑝𝑎subscript𝐷𝑔𝑎\displaystyle=\mathbf{Linear}_{\theta_{s}}(D_{pa},D_{ga}),= bold_Linear start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_p italic_a end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_g italic_a end_POSTSUBSCRIPT ) , (13)
Svsubscript𝑆𝑣\displaystyle S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT =𝐋𝐢𝐧𝐞𝐚𝐫θv(Dte,Dtsd),absentsubscript𝐋𝐢𝐧𝐞𝐚𝐫subscript𝜃𝑣subscript𝐷𝑡𝑒subscript𝐷𝑡𝑠𝑑\displaystyle=\mathbf{Linear}_{\theta_{v}}(D_{te},D_{tsd}),= bold_Linear start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_D start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_t italic_s italic_d end_POSTSUBSCRIPT ) , (14)

where θf,θs,θvsubscript𝜃𝑓subscript𝜃𝑠subscript𝜃𝑣\theta_{f},\theta_{s},\theta_{v}italic_θ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT respectively denote the model parameters of linear regression at each granularity. The overall dynamics score of the video is then defined as the average of aligned dynamics scores from all three levels, as

S=13(Sf+Ss+Sv).𝑆13subscript𝑆𝑓subscript𝑆𝑠subscript𝑆𝑣S=\frac{1}{3}(S_{f}+S_{s}+S_{v}).italic_S = divide start_ARG 1 end_ARG start_ARG 3 end_ARG ( italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) . (15)

Through this learnable human alignment procedure, the empirically defined dynamics scores are more consistent with human perception, as validated in Sec. 4.1.

Table 4: Human alignment by correlation between dynamics scores and human ratings on the proposed DEVIL benchmark. Video generation is based on text prompts in DEVIL. “PC” denotes Pearson’s correlation, “KC” Kendall’s correlation, and “WR” the win ratio.
Scores PC \uparrow KC \uparrow WR \uparrow
Inter-frame Sofssubscript𝑆𝑜𝑓𝑠S_{ofs}italic_S start_POSTSUBSCRIPT italic_o italic_f italic_s end_POSTSUBSCRIPT 93.1 89.9 79.2
Ssdsubscript𝑆𝑠𝑑S_{sd}italic_S start_POSTSUBSCRIPT italic_s italic_d end_POSTSUBSCRIPT 91.7 88.0 78.1
Spdsubscript𝑆𝑝𝑑S_{pd}italic_S start_POSTSUBSCRIPT italic_p italic_d end_POSTSUBSCRIPT 96.4 93.2 86.1
Sfsubscript𝑆𝑓S_{f}italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT 96.5 93.5 86.5
Inter-segment Spasubscript𝑆𝑝𝑎S_{pa}italic_S start_POSTSUBSCRIPT italic_p italic_a end_POSTSUBSCRIPT 95.1 94.3 87.0
Sgsubscript𝑆𝑔S_{g}italic_S start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT 94.6 93.0 85.6
Sssubscript𝑆𝑠S_{s}italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT 95.8 94.8 87.7
Video level Stesubscript𝑆𝑡𝑒S_{te}italic_S start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT 96.4 93.7 83.5
Stsdsubscript𝑆𝑡𝑠𝑑S_{tsd}italic_S start_POSTSUBSCRIPT italic_t italic_s italic_d end_POSTSUBSCRIPT 97.7 96.4 90.5
Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT 98.0 97.2 91.4
Naturalness 79.0 75.5 52.4

4 Experiments

4.1 Human Alignment Assessment

To evaluate the plausibility of the proposed dynamics metrics and the naturalness metric, we conduct the following human alignment experiments.

Ground-truth Annotation. We first generate videos using six state-of-the-art (SOTA) T2V models, including GEN-2 [2], Pika [4], VideoCrafter2 [13], Open-Sora [22], StreamingT2V [18] and FreeNoise-Lavie [30] and DEVIL text prompts. For the generated videos, we collect human evaluated dynamics and naturalness as the ground-truth. Six persons are recruited to assess each video’s grade of dynamics under three temporal levels (Frame, Segment and Video). For each dynamics metric, evaluators are required to rate the grade of dynamics from “static” to “very high dynamics” defined in Section 3.3. To guide the annotation process, we provide specific prompts for each temporal level. 222Please refer to Appendix  F for details. The evaluation of the naturalness metric follows the same process, where a higher human assigned grade indicates a greater degree of naturalness.

Evaluation of Scores. We calculate dynamics grades and naturalness for generated videos on the proposed DEVIL benchmark. For dynamics metrics at multiple temporal levels, we integrate them using the linear regression model defined by Eq. 15. For each linear regression model, it takes the human evaluation results as ground-truths, trained upon 75% of the randomly selected videos and tests on the remaining 25% videos. During testing, the human alignment performance is reflected by the correlation e.g.formulae-sequence𝑒𝑔e.g.italic_e . italic_g ., Pearson and Kendall’s correlation coefficients and win ratio, between predicted and human evaluated dynamics grades. The win ratio involves comparing each video against others with different grades of dynamics. For instance, a video rated as “high dynamics” by evaluators should score lower in dynamics than any video rated as “Very high dynamics” but higher than those rated as “static”.

Table 4 shows the assessment results of the six T2V generation models. It can be seen that the dynamics metrics and the naturalness metric exhibit a strong alignment with human evaluation. The improved metrics (Sfsubscript𝑆𝑓S_{f}italic_S start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, Sssubscript𝑆𝑠S_{s}italic_S start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, Svsubscript𝑆𝑣S_{v}italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT defined in Sec. 3.5) further enhance the alignment with human evaluations.

Table 5: Evaluation of T2V models on dynamics range (𝐃rangesubscript𝐃𝑟𝑎𝑛𝑔𝑒\mathbf{D}_{range}bold_D start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT), dynamics controllability (𝐃controlsubscript𝐃𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathbf{D}_{control}bold_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l end_POSTSUBSCRIPT), and dynamics quality (𝐃qualitysubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦\mathbf{D}_{quality}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT) using our text prompt benchmark. All metrics are normalized with maximum values of 100% and minimum values of 0%, higher scores indicate better performance. Dynamics quality is also assessed at low (𝐃qualityLsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐿\mathbf{D}_{quality}^{L}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT), medium (𝐃qualityMsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝑀\mathbf{D}_{quality}^{M}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT), and high (𝐃qualityHsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐻\mathbf{D}_{quality}^{H}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT) levels.
T2V models 𝐃rangesubscript𝐃𝑟𝑎𝑛𝑔𝑒\mathbf{D}_{range}bold_D start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT 𝐃controlsubscript𝐃𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathbf{D}_{control}bold_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l end_POSTSUBSCRIPT 𝐃qualitysubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦\mathbf{D}_{quality}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT 𝐃qualityLsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐿\mathbf{D}_{quality}^{L}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT 𝐃qualityMsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝑀\mathbf{D}_{quality}^{M}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT 𝐃qualityHsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐻\mathbf{D}_{quality}^{H}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT
GEN-2 [2] 30.8 82.5 43.6 93.4 45.4 0.0
Pika [4] 43.2 72.0 52.1 90.0 66.4 0.0
VideoCrafter2 [13] 34.1 57.0 43.6 89.1 41.7 0.0
OpenSora [22] 61.2 62.4 63.7 84.4 84.5 22.2
StreamingT2V [18] 65.9 62.8 60.8 61.1 80.8 40.6
FreeNoise-Lavie [30] 66.9 58.7 66.3 65.9 87.7 45.5
Hotshot-XL [3] 34.7 58.9 52.2 92.8 63.9 0.0
Show-1 [47] 45.1 73.9 57.7 92.6 80.3 0.0
ModelScope [38] 52.9 63.6 62.6 91.2 79.1 17.5
ZeroScope [5] 26.4 66.4 44.8 90.9 43.6 0.0

4.2 Dynamic-Quality Bi-variate Analysis

To investigate the relationship between video dynamics and quality, we calculated the correlation coefficients between various quality metrics and the overall dynamics score (S𝑆Sitalic_S), as well as the distribution of video quality scores along S𝑆Sitalic_S. As shown in Table 6, Naturalness, Motion Smoothness, Subject Consistency, and Background Consistency all have Pearson correlation coefficients above 50% with S𝑆Sitalic_S, indicating the significant impact of dynamics on these metrics. Fig. 2 shows the distribution of video quantity and quality scores along S𝑆Sitalic_S. Most models, especially high-ranking ones like GEN-2 [2], Pika [4], and VideoCrafter2 [13], generate videos concentrated in low dynamic regions. As dynamics increase, quality metrics significantly decline. This suggests that models can improve benchmark quality scores by generating low-dynamic videos. In conclusion, video dynamics significantly impact quality evaluation, and quality metrics design should account for dynamics.

4.3 Evaluation of Dynamics Metrics

We evaluate the dynamics range 𝐃rangesubscript𝐃𝑟𝑎𝑛𝑔𝑒\mathbf{D}_{range}bold_D start_POSTSUBSCRIPT italic_r italic_a italic_n italic_g italic_e end_POSTSUBSCRIPT, dynamics controllability 𝐃controlsubscript𝐃𝑐𝑜𝑛𝑡𝑟𝑜𝑙\mathbf{D}_{control}bold_D start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_r italic_o italic_l end_POSTSUBSCRIPT and dynamics quality 𝐃qualitysubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦\mathbf{D}_{quality}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT of T2V models on our text prompt benchmark. All metrics are normalized with maximum values of 100% and minimum values of 0%. To assess dynamics quality, we consider low, medium, and high levels, obtaining 𝐃qualityLsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐿\mathbf{D}_{quality}^{L}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, 𝐃qualityMsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝑀\mathbf{D}_{quality}^{M}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, and 𝐃qualityHsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐻\mathbf{D}_{quality}^{H}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT. The score ranges for these levels are [0, 33.3%], [33.4%, 66.7%], and [66.8%, 100%] respectively, where higher scores indicate better performance. The results are shown in Table 5. In addition to six models that are annotated, we also evaluate another five SOTA T2V models to provide a comprehensive comparison of the latest models. It can be observed that the GEN-2 [2] and Pika [4] models achieve high dynamics alignment scores, but low dynamics range scores. This is because these methods generate videos with low dynamics. In contrast, the FreeNoise-Lavie [30] and StreamingT2V [18] achieve a high dynamics range but a low dynamics controllability score, indicating that it generates video dynamics misaligned with the text prompts. 333Please refer to Appendix A for details.

Origal Quality Metric v.s. Dynamics-based Quality Metric. Fig. 5 shows the comparison between the original quality metric and various dynamics-based quality metrics, including the overall dynamics-based quality metric (𝐃qualitysubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦\mathbf{D}_{quality}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT) and metrics at low (𝐃qualityLsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐿\mathbf{D}_{quality}^{L}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT), medium (𝐃qualityMsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝑀\mathbf{D}_{quality}^{M}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT), and high (𝐃qualityHsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐻\mathbf{D}_{quality}^{H}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT) dynamics levels. It shows that the original quality metric aligns closely with 𝐃qualityLsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐿\mathbf{D}_{quality}^{L}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, indicating that it primarily reflects quality in low dynamics scenarios. Moreover, T2V models typically lack the ability to generate high-dynamics videos, resulting in lower scores for 𝐃qualityHsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐻\mathbf{D}_{quality}^{H}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT.

Refer to caption
Figure 5: Bar chart illustrating the original quality metric, overall dynamics-based quality metric (𝐃qualitysubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦\mathbf{D}_{quality}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT) and dynamics-based quality metrics at low(𝐃qualityLsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐿\mathbf{D}_{quality}^{L}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT), medium(𝐃qualityMsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝑀\mathbf{D}_{quality}^{M}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT) and high(𝐃qualityHsuperscriptsubscript𝐃𝑞𝑢𝑎𝑙𝑖𝑡𝑦𝐻\mathbf{D}_{quality}^{H}bold_D start_POSTSUBSCRIPT italic_q italic_u italic_a italic_l italic_i italic_t italic_y end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT) dynamics levels. (Best viewed in color)
Refer to caption
Figure 6: Video quantity density w.r.t.formulae-sequence𝑤𝑟𝑡w.r.t.italic_w . italic_r . italic_t . dynamics score of the WebVid-2M dataset.

4.4 Insights from Video Dynamics Analysis

Existing datasets have biased dynamics distribution. The distribution of dynamics of the video datasets (such as WebVid2M [8]) is biased. The statistical result is shown in Fig. 6. It can be seen that most of the videos have a small dynamics score (\leq 0.4). The limited number of videos with high dynamics scores restricts the model’s ability to generate dynamics-rich videos which are common in practical applications. Therefore, existing datasets should be expanded in terms of dynamics, and the proposed metrics can provide guidance for this expansion.

Existing datasets have biased text prompts on dynamics for training. We use the dynamics controllability metric to evaluate two popular datasets, i.e.formulae-sequence𝑖𝑒i.e.italic_i . italic_e ., WebVid2M [8] and MSR-VTT [45], by using the ground-truth text prompts and videos. Unfortunately, they respectively achieve dynamics controllability scores of 36.31% and 52.98%. The poor performance indicates that the two datasets can not provide sufficient information/guidance while training the video generation models. To train better video generation models, the text prompts of these datasets requires to be elaborated on aspects of dynamics.

Existing T2V methods have limited real-world simulation ability. As shown in Fig. 2, we performed a statistical analysis of video quantity distribution, visual quality, motion smoothness, and naturalness metric scores for SOTA methods based on the distribution of dynamics score. When the dynamics score is small, the videos generated by these SOTA models have high scores under the aforementioned four metrics. As the dynamics score increases, these scores (especially the naturalness) significantly decrease. This might be caused by the fact that these models primarily focus on optimizing the generation of simple and slow-motion content, while dynamics are totally ignored in the evaluation metrics. Therefore, T2V models should be optimized on large range of dynamics to truly reflect real-world simulation.

5 Conclusion

We proposed DEVIL, a comprehensive and constructive evaluation protocol for T2V generation models. In the protocol, we defined a set of dynamics metrics corresponding to multiple temporal granularities, and a new benchmark of text prompts under multiple levels of dynamics. Based on the distribution of dynamics scores over the benchmark, we assessed the generation capacity of T2V models, characterized by dynamic ranges and degree of T2V alignment. Experiments show that DEVIL enjoys 90% consistency with human evaluation results, demonstrating the potential to be a powerful tool for advancing T2V generation models.

Limitations. At present, the grades of dynamics remain limited, which should be improved to more fine-grained grades. Furthermore, only a limited number of T2V models are evaluated using the proposed protocol. A more comprehensive evaluation of T2V models should be done in future work.

Social impacts. The positive impact can be that the proposed evaluation protocol may promote the development of T2V models. The negative impact can be a risk that advanced T2V models could be misused to create realistic but misleading video content, such as deepfakes.

References

  • gem [2024] Gemini. https://gemini.google.com/, 2024. Accessed: 2024-05-21.
  • gen [2024] Gen-2. https://research.runwayml.com/gen2, 2024. Accessed: 2024-05-21.
  • hot [2024] Hotshot-xl. https://huggingface.co/hotshotco/Hotshot-XL, 2024. Accessed: 2024-05-21.
  • pik [2024] Pika labs. https://pika.art, 2024. Accessed: 2024-05-21.
  • zer [2024] Zeroscope. https://huggingface.co/cerspense/zeroscope_v2_576w, 2024. Accessed: 2024-05-21.
  • Anne Hendricks et al. [2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  • Bain et al. [2021a] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021a.
  • Bain et al. [2021b] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021b.
  • Blattmann et al. [2023] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  • Box et al. [2015] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
  • Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In IEEE ICCV, pages 9630–9640, 2021.
  • Chen et al. [2023] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, **bo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
  • Chen et al. [2024] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
  • Developers [2024] FFmpeg Developers. Ffmpeg: A complete, cross-platform solution to record, convert and stream audio and video, 2024. URL https://ffmpeg.org/. Accessed: 2024-05-21.
  • Girdhar et al. [2023] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
  • Hamming [1950] Richard W Hamming. Error detecting and error correcting codes. The Bell system technical journal, 29(2):147–160, 1950.
  • Hendricks et al. [2018] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with temporal language. In Empirical Methods in Natural Language Processing (EMNLP), 2018.
  • Henschel et al. [2024] Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. arXiv preprint arXiv:2403.14773, 2024.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
  • Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
  • HPC-AI Technology Inc. [2023] HPC-AI Technology Inc. Open-sora: Democratizing efficient video production for all. https://github.com/hpcaitech/Open-Sora, 2023.
  • Huang et al. [2023] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang **, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023.
  • Li et al. [2021] Yuntao Li, Bei Chen, Qian Liu, Yan Gao, Jian-Guang Lou, Yan Zhang, and Dongmei Zhang. Keep the structure: A latent shift-reduce parser for semantic parsing. In IJCAI, pages 3864–3870, 2021.
  • Lin et al. [2024] Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal. Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model. arXiv preprint arXiv:2404.09967, 2024.
  • Liu et al. [2023] Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
  • Liu et al. [2024] Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36, 2024.
  • Mei and Patel [2023] Kangfu Mei and Vishal Patel. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9117–9125, 2023.
  • OpenAI [2023] OpenAI. Chatgpt: A large language model. https://www.openai.com/chatgpt, 2023. Accessed: 2024-05-21.
  • Qiu et al. [2023] Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling. arXiv preprint arXiv:2310.15169, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  • Su et al. [2009] Danying Su, Zhiqiang Su, Jiaye Wang, Shanshan Yang, and **g Ma. Ucf-101, a novel omi/htra2 inhibitor, protects against cerebral ischemia/reperfusion injury in rats. The Anatomical Record: Advances in Integrative Anatomy and Evolutionary Biology: Advances in Integrative Anatomy and Evolutionary Biology, 292(6):854–861, 2009.
  • Unterthiner et al. [2019] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019.
  • Venkatesan et al. [2000] Ramarathnam Venkatesan, S-M Koon, Mariusz H Jakubowski, and Pierre Moulin. Robust image hashing. In Proceedings 2000 International Conference on Image Processing (Cat. No. 00CH37101), volume 3, pages 664–666. IEEE, 2000.
  • Wang et al. [2023a] Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023a.
  • Wang et al. [2023b] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
  • Wang and Yang [2024] Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. arXiv preprint arXiv:2403.06098, 2024.
  • Wang et al. [2023c] Wen**g Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, and Jiaying Liu. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023c.
  • Wang et al. [2023d] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023d.
  • Wang et al. [2023e] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023e.
  • Wang et al. [2004] Zhou Wang, Alan Conrad Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004. URL https://api.semanticscholar.org/CorpusID:207761262.
  • Wu et al. [2023] Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, **gwen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In International Conference on Computer Vision (ICCV), 2023.
  • Xu et al. [2016] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  • Yu et al. [2023] Sihyun Yu, Kihyuk Sohn, Subin Kim, and **woo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023.
  • Zhang et al. [2023] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
  • Zhang et al. [2024] Tianjun Zhang, Shishir G Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E Gonzalez. Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024.

Appendix

Appendix A Dynamics Scores

Refer to caption
Figure 7: Evauation of the state-of-the-art models using dynamics scores proposed in Section LABEL:sec:dynamic_metric.

For the dynamics scores proposed in Section LABEL:sec:dynamic_metric, we present the detailed results of T2V models in Figure 7. It can be seen that ModelScope [38] excels in generating rapid inter-frame motions, while StreamingT2V [18] performs exceptionally well across most dynamics score metrics. StreamingT2V achieves high scores for the inter-segment dyanmics scores at video levels. This indicates that it has significant advantages in generating complex dynamic content. In contrast, GEN-2 [2] and VideoCrafter2 [13] perform poorly on several metrics, highlighting their deficiencies in dynamics.

Appendix B Correlation Between Existing Metrics and Dynamics

In Section 3.2, to identify the relevance between existing metrics with the dynamics metrics, we provide a bi-variate analysis strategy. Based on bi-variate analysis, we provide detailed correlation results for the models. In Table 6, the Pearson correlation coefficients between the dynamics scores and existing metrics, including aesthetic score, technical score, visual quality, motion smoothness, subject consistency, background consistency, and naturalness, are detailed.

The results indicate a clear trade-off between video dynamics and various existing metrics in T2V models. As dynamic complexity increases, there tends to be a decline in motion smoothness, subject consistency, background consistency, and naturalness. The aesthetic, technical, and visual quality metrics show relatively low correlation, which can be attributed to the fact that these metrics evaluate video frames independently, ignoring temporal relationships between frames.

Table 6: Pearson correlation coefficient between the dynamics metrics and the existing metrics including aesthetic score [44], technical score [44] visual quality [44], motion smoothness [23], subject consistency [23] and background consistency [23] and our naturalness.
Aesthetic
Score
Technical
Score
Visual
Quality
Motion
Smoothness
Subject
Consistency
Background
Consistency
Naturalness
GEN-2 [2] -0.19 -0.09 -0.12 -0.54 -0.88 -0.73 -0.50
Pika [4] -0.40 -0.20 -0.28 -0.65 -0.88 -0.78 -0.47
VideoCrafter2 [13] -0.25 -0.20 -0.24 -0.59 -0.87 -0.76 -0.36
OpenSora [22] -0.20 -0.27 -0.26 -0.70 -0.90 -0.83 -0.43
StreamingT2V [18] -0.15 -0.21 -0.23 -0.57 -0.89 -0.81 -0.36
FreeNoise-Lavie [30] -0.37 -0.31 -0.35 -0.75 -0.91 -0.86 -0.48
Average -0.26 -0.21 -0.25 -0.63 -0.81 -0.79 -0.43

Appendix C Detail of Dynamics-based Quality

Let S(i)superscript𝑆𝑖S^{(i)}italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT denote a score of generated video i𝑖iitalic_i. Existing metrics simply average the scores of all videos to obtain the metric score S𝑆Sitalic_S of the T2V𝑇2𝑉T2Vitalic_T 2 italic_V model:

S=1|T|i=1|T|S(i),𝑆1𝑇superscriptsubscript𝑖1𝑇superscript𝑆𝑖S=\frac{1}{|T|}\sum_{i=1}^{|T|}S^{(i)},italic_S = divide start_ARG 1 end_ARG start_ARG | italic_T | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_T | end_POSTSUPERSCRIPT italic_S start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , (16)

where |T|𝑇|T|| italic_T | is the total number of generated videos. Considering that some existing metrics show a considerable negative correlation with the video’s dynamics score, they fail to prevent models from generating low-dynamic videos.

To address this issue, we enhance existing metrics by integrating human-aligned dynamics scores, preventing models from attaining high scores by producing low-dynamic videos. Specifically, we first equally divide the human-aligned dynamics score into L=12𝐿12L=12italic_L = 12 intervals. We then calculate the mean scores Slsubscript𝑆𝑙S_{l}italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at each interval l𝑙litalic_l. The improved metric Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is defined as the average of Slsubscript𝑆𝑙S_{l}italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT across all intervals:

S=1Ll=1LSl.superscript𝑆1𝐿superscriptsubscript𝑙1𝐿subscript𝑆𝑙S^{*}=\frac{1}{L}\sum_{l=1}^{L}S_{l}.italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_S start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT . (17)

Table  7 presents the scores of various models across four quality metrics: Motion Smoothness, Naturalness, Subject Consistency, and Background Consistency. FreeNoise and StreamingT2V achieve high overall scores due to their strong performance across a wide dynamic range. In contrast, Gen-2 and Pika excel in the low dynamic range, but their inability to generate high dynamic videos results in lower overall scores.

Table 7: Integrating dynamics scores with quality metrics, including Motion Smoothness, Naturalness, Subject Consistency, and Background Consistency. The table details scores across multiple models, with metrics divided into Overall, Low, Mid, and High categories based on modified dynamic intervals to achieve a comprehensive evaluation.
T2V Model MotionSmoothness Naturalness
Overall Low Mid High Overall Low Mid High
FreeNoise [30] 71.7 71.7 95.4 47.9 57.1 54.8 73.9 42.5
GEN-2 [2] 49.7 99.5 49.7 0.0 39.1 81.6 35.6 0.0
OpenSora [22] 71.5 95.5 95.3 23.7 49.8 62.8 64.2 22.5
Pika [4] 58.0 99.5 74.5 0.0 39.8 69.4 50.1 0.0
StreamingT2V [18] 71.2 70.9 95.0 47.8 42.2 44.2 55.4 27.0
VideoCrafter2 [13] 48.9 97.8 48.8 0.0 31.4 70.1 24.2 0.0
HotShot-XL [3] 47.0 83.7 57.3 0.0 54.4 95.7 67.4 0.0
ModelScope [38] 50.4 77.1 61.6 12.5 67.9 95.7 87.6 20.4
Show-1 [47] 47.4 81.6 60.6 0.0 62.0 95.5 90.5 0.0
ZeroScope [5] 38.3 75.0 40.0 0.0 46.5 95.3 44.2 0.0
T2V Model Subject Consistency Background Consistency
Overall Low Mid High Overall Low Mid High
FreeNoise [30] 66.0 66.3 87.3 44.3 70.6 70.7 94.0 45.5
GEN-2 [2] 47.7 95.2 47.8 0.0 48.6 97.3 48.5 0.0
OpenSora [22] 62.7 84.5 84.2 19.3 70.7 94.6 94.4 22.2
Pika [4] 54.6 94.6 69.1 0.0 56.1 96.5 71.8 0.0
StreamingT2V [18] 61.2 60.8 81.6 41.3 68.6 68.3 91.2 40.6
VideoCrafter2 [13] 45.5 90.8 45.6 0.0 48.7 97.5 48.4 0.0
HotShot-XL [3] 55.6 97.1 69.7 0.0 52.0 94.8 61.1 0.0
ModelScope [38] 70.1 97.1 91.6 21.6 62.0 94.8 75.5 17.5
Show-1 [47] 62.8 98.2 90.1 0.0 58.5 95.3 80.2 0.0
ZeroScope [5] 49.0 98.4 48.5 0.0 45.5 94.8 41.7 0.0

Appendix D Assigning Dynamics Grades to Text Prompts

As described in Section 3.3, we collect approximately 50,000 text prompts from existing benchmarks, including 19 object categories and 4 scene categories. Using GPT-4 coarse classification and human refinement, we construct the DEVIL prompt benchmark. The process of categorizing dynamics grades using GPT-4 is illustrated in Figure 8. In specific, we instruct GPT-4 to perform classification on the rate of content change. To enhance GPT-4’s classification accuracy, we further provide detailed criteria and examples for each dynamics grade. In the post-processing step, we recruit six human annotators to refine the dynamics grades over three months. Finally, we sample about 800 text prompts at different dynamics grades to ensure a uniform distribution across the grades.

Appendix E Details of Naturalness

We employed the advanced multi-modal large model, Gemini-1.5 Pro [1], equipped with video understanding capabilities, to assess and classify the naturalness of video content. As shown in Fig. 9, we demonstrate the process through which the model analyzes videos and assigns naturalness ratings. The figure details the five different levels used to evaluate video naturalness, ranging from “Completely Fantastical" to “Almost Realistic". Each level is defined based on how closely the video content aligns with the real world. Additionally, the figure includes two examples of video evaluations: the first video is rated as "Almost Realistic" due to its high conformity with reality, while the second video, due to minor distortions—such as the unrealistic number of legs on a dog—is rated as "Slightly Unrealistic". These examples validate the plausibility of the proposed naturalness metric.

Appendix F Human Annotation

To align human evaluations with automated metrics, we annotated a series of videos generated by SOTA T2V models. We initiated the process by generating videos using prompts from the DEVIL benchmark with six advanced T2V models including GEN-2, Pika, VideoCrafter2, OpenSora, StreamingT2V, and FreeNoise-Lavie. Subsequently, we developed a video annotation toolbox for evaluating the dynamics and naturalness of videos. As shown in Figure 10, the toolbox allows annotators to assess the dynamics of the videos across five grades, from almost static to very high dynamics, and the naturalness from almost real to completely unreal. To guarantee high-quality and consistent evaluations, we recruit six annotators who have undergraduate degrees and provided them with detailed training.

Refer to caption
Figure 8: Illustration of prompt coarse categorization using GPT-4 [29].
Refer to caption
Figure 9: Illustration of naturalness calculation for generated videos using Gemini-1.5 Pro [1].
Refer to caption
Figure 10: Toolbox for dynamics and naturalness annotation.

Appendix G Visual comparison

In Section 3, we use text prompts with different dynamics grades to generate videos with T2V models. Here, we provide visual results of the generated videos.

[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]