Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Mingxiang Liao¹ Hannan Lu²^∗ Xinyu Zhang^3,4^∗ Fang Wan¹ Tianyu Wang¹
Yuzhong Zhao¹ Wangmeng Zuo² Qixiang Ye¹^† **gdong Wang⁴ ¹University of Chinese Academy of Sciences ²Harbin Institute of Technology
³The University of Adelaide ⁴Baidu Inc Equal contribution.

\dagger

Corresponding Author.

Abstract

Comprehensive and constructive evaluation protocols play an important role in the development of sophisticated text-to-video (T2V) generation models. Existing evaluation protocols primarily focus on temporal consistency and content continuity, yet largely ignoring the dynamics of video content. Dynamics are an essential dimension for measuring the visual vividness and the honesty of video content to text prompts. In this study, we propose an effective evaluation protocol, termed DEVIL, which centers on the dynamics dimension to evaluate T2V models. For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video. Based on the new benchmark and the dynamics scores, we assess T2V models with the design of three metrics: dynamics range, dynamics controllability, and dynamics-based quality. Experiments show that DEVIL achieves a Pearson correlation exceeding 90% with human ratings, demonstrating its potential to advance T2V generation models. Code is available at github.com/MingXiangL/DEVIL.

1 Introduction

With the rapid progress of video generation technology, the demand of comprehensively evaluating model performance continues to grow. Recent benchmarks [26, 23] have included various metrics, $e.g.$ , generation quality, video-text alignment degree, and video content continuity, to evaluate text-to-video (T2V) generation models. Despite the great efforts made, a fundamental characteristic of video—dynamics remains overlooked.

Dynamics refers to the degree of visual change and interaction in the content of videos over time, encompassing object motion, action diversity, scene transitions, $etc.$ It is a crucial index for evaluating video generation models for the following two reasons: $(i)$ Dynamics of generated video content should be honest to text prompts in practical applications. For example, it is expected that dramatic text prompts result in videos with high dynamics. $(ii)$ Generated videos usually show negative correlations between dynamics and quality scores [23, 26], $i.e.,$ videos with higher dynamics tend to receive lower quality scores. This allows T2V models to “cheat” to achieve high-quality scores by generating low-dynamic video content in many cases.

Refer to caption — Figure 1: Flowchart to calculate dynamics metrics based on dynamics scores and text prompts.

To fully reveal the dynamics of generated videos, in this paper, we introduce a new evaluation protocol, named DEVIL. DEVIL treats dynamics as the primary dimension for evaluating the performance of T2V models. Here, we consider three types of metrics to represent dynamics: (i) Dynamics Range, which measures the extent of variations in video content that the model can generate; (ii) Dynamics Controllability, which assesses the model’s ability to manipulate video dynamics in response to text prompts; and (iii) Dynamics-based Quality, which evaluates the visual quality of videos with varying dynamics generated by the model.

To produce the evaluation, we first establish a benchmark comprising text prompts categorized by multiple dynamics grades. These text prompts are collected from commonly used datasets [7, 6, 45, 39] and categorized according to their dynamics using a Large Language Model (LLM), GPT-4 [29], followed by further manual refinement. Based on the constructed text prompt benchmark, we calculate an overall dynamic score for each generated video, which is defined as a weighted sum of a series of dynamics scores at different temporal granularities.

The prompt benchmark and the overall dynamics scores of all generated videos are then utilized to evaluate T2V models with three dynamics metrics. This evaluation goes beyond simply maximizing dynamics scores for each video; it emphasizes the model’s ability to produce high-quality videos across various dynamics following the instructions from text prompts. ( $i$ ) Dynamics Range is calculated as the range of dynamics scores for all generated videos, indicating the ability of T2V models to generate videos with both subtle and dramatic temporal variations. ( $ii$ ) For Dynamics Controllability, we adopt a ranking consistency-based methodology to check whether the dynamics scores of generated videos align with the dynamics of their corresponding text prompts. ( $iii$ ) Dynamics-based Quality is defined by integrating several quality metrics with dynamics scores. It avoids biases caused by negative correlations between video dynamics and video quality [23, 26], resulting in a more comprehensive evaluation of video quality. Finally, noting that video naturalness decreases with increasing dynamics, we also propose a naturalness metric based on a multimodal large language model, $i.e.$ Gemini-1.5 Pro [1].

Upon DEVIL, we evaluate and revisit the state-of-the-art T2V models, and find: (i) Existing datasets have biased dynamics distribution, resulting in that current generation models (especially top-ranking models like GEN-2 [2]) typically generate slow-motion videos to obtain high quality scores. (ii) Existing training datasets have biased text prompts on dynamics. Training on this prompts will inevitably limit the dynamics controllability of T2V models. (iii) Through the statistical analyses of dynamics scores, especially the naturalness metric score, existing methods display limited real-world simulation ability. Based on these finds, we believe, a more elaborate training data with better methods will improve the T2V performance on both quality and dynamics scores.

In summary, our contributions are:

1.

We propose a novel evaluation protocol, termed DEVIL, which benchmarks T2V generation models by integrating dynamics metrics. Together with existing evaluation metrics, DEVIL builds a more comprehensive evaluation protocol.
2.

We establish a new text prompt benchmark $w.r.t.$ dynamics grades as well as a set of metrics to evaluate video dynamics across temporal granularities, facilitating the assessment of dynamics range, dynamics controllability, and dynamics-based quality.
3.

Extensive evaluation of existing T2V generation models allows us to thoroughly analyze the capabilities of T2V models through the proposed protocol and benchmarks. The results would inspire sophisticated T2V generation methods.

2 Related Work

2.1 Text-to-Video Generation Model

As a recent breakthrough in artificial intelligence, diffusion models have pushed video generation technology to a new height. Earlier studies [21, 20] explored the 3D U-Net and cascaded models for diffusion within pixel space. Recent solutions [12, 32] employed latent diffusion models to efficiently manage the diffusion process within a compressed latent space. Following these studies, a variety of approaches [38, 9, 25, 41, 15, 40, 46, 28, 24] updated and improved this paradigm. Building on these advancements, subsequent methods further explored generating videos of higher quality and extended duration. The Videocrafter approach [13] pursued high-quality video generation through disentangling spatial and temporal learning and tuning spatial modules using high-quality images. In a similar way, commercial models such as Pika [4] and GEN-2 [2] demonstrated substantial improvements, showcasing videos with exceptional visual clarity. For longer video generation, Gen-L-Video [37] aggregated short clips generated by base T2V models using temporal co-denoising to enhance continuity. Freenoise [30] extended pre-trained T2V models through rescheduling noise for longer-duration video inference. StreamingT2V [18] enhanced long-term content consistency by integrating short-term and long-term memory blocks.

The rapid development of T2V models poses a growing demand for quality evaluation protocols. Unfortunately, existing protocols primarily focus on temporal consistency and content continuity, yet largely ignore temporal dynamics. This hinders the exploitation of video content vividness and the honesty of video content to text prompts.

2.2 Evaluation Protocol

Early evaluation protocols [34] primarily relied on class labels to evaluate the performance of T2V generation models. For example, they commonly used video clips from the UCF-101 dataset and human-annotated video captions from the MSR-VTT [45] dataset as the evaluation data. For a more specific assessment, FETV [27] assigned fine-grained category labels to prompts and calculated the CLIP-SIM score for each category.

However, conventional quality assessment metrics such as Inception Score (IS) [33], Fréchet Inception Distance (FID) [19], Frechet Video Distance (FVD) [35], and CLIP-SIM typically operate on a single dimension while can not provide a comprehensive evaluation. When addressing the limitation, EvalCrafter [31] expanded both the prompt scale and the number of evaluation metrics so that the text-video alignment degree and the quality of generated videos can be better evaluated. Additionally, VBench [23] proposed a multi-dimensional, multi-category evaluation suite that not only considered the diversity of prompts but also encompassed a variety of assessment metrics.

Despite of the evolution of evaluation metrics, we argue an essential characteristic of video, $i.e.$ , dynamics, remains ignored. In this study, we introduce the dynamics dimension to evaluate T2V generation models, as well as enhance the completeness of existing metrics.

3 Dynamics Evaluation Protocol

Table 1: Symbol Definitions.

Symbol	Definition
$\mathbf{D}_{<name>}$	< $name$ >-type dynamic metric of T2V models.
$\mathcal{T}$	Text prompt benchmark of our DEVIL.
$M$	The number of text prompts $\mathcal{T}$ .
$T^{i}$	$i$ -th text prompt in $\mathcal{T}$ , where $i\in\{1,\cdots,M\}$ .
$G^{i}$	Dynamic grade of text prompt $T^{i}$ .
$S^{i}$	Dynamics score of video generated by $T^{i}$ .
$f_{j}$	The $j$ -th video frame.
$F_{j}$	Feature of $j$ -th video frame.
$N$	The number of video frames.

In this section, we first provide an overview of the proposed DEVIL protocol in Section 3.1 and then introduce the dynamics metrics proposed within DEVIL in Section 3.2 Finally, we detail the prompt benchmark(Section 3.3) and dynamics scores(Section 3.4 and 3.5) constructed to evaluate the dynamics metrics of T2V generation models.

3.1 Overview

Fig. 1 shows the evaluation workflow of the DEVIL protocol. We aim to calculate the three dynamics metrics, dynamics range ( $\mathbf{D}_{range}$ ), dynamics controllability ( $\mathbf{D}_{control}$ ), and dynamics-based quality ( $\mathbf{D}_{quality}$ ) for each T2V model. To achieve this, we establish a text prompts benchmark $\mathcal{T}=\{(T^{i},G^{i})\}_{i=1}^{M}$ , where each prompt $T^{i}$ has a dynamic grade $G^{i}$ , classified by GPT-4 [29], followed by further manual refinement. $M$ is the number of prompts, for which we collect around 800 text prompts for our benchmark. Subsequently, we generate videos using $\mathcal{T}$ , and assess the dynamics of each generated video using an overall dynamics score $S$ . To calculate $S$ , we define a series of dynamics scores at different temporal granularities, including inter-frame, inter-segment, and video levels, to reveal the video characteristics at multiple temporal levels as shown in Table 3. These scores are combined to obtain $S$ using weights derived from fitting human ratings. Subsequently, the dynamics scores of all generated videos are utilized to calculate the three dynamics metrics, which represent the overall performance of T2V models. In simplification, we provide the symbol definitions in Table 1.

3.2 Dynamics Metrics

We introduce three key metrics, dynamics range ( $\mathbf{D}_{range}$ ), dynamics controllability ( $\mathbf{D}_{control}$ ), and dynamics-based quality ( $\mathbf{D}_{quality}$ ), to evaluate T2V models from the perspective of dynamics. Each of these metrics evaluates the overall benchmark (described in Section 3.3), which is calculated using the per-video dynamics scores (detailed in Sections 3.2 and 3.5).

(i) Dynamics Range demonstrates the model’s versatility in handling both subtle and dramatic changes. An ideal T2V generation model is expected to display a large dynamics range, reflecting various temporal variations described in text prompts.

In detail, we determine the dynamics range metric $\mathbf{D}_{range}$ by identifying the extremes of the dynamic scores over the benchmark, while excluding the top and bottom 1% scores to mitigate the influence of outliers. This is formulated as

\mathbf{D}_{range}=\mathbf{Q}_{0.99}-\mathbf{Q}_{0.01},

(1)

where $\mathbf{Q}_{0.99}$ and $\mathbf{Q}_{0.01}$ denote the $99$ -th and $1$ -st percentile values of the dynamics scores for videos generated with our proposed text prompt benchmark, respectively. This metric reflects a realistic spread of dynamics, excluding atypical extremes.

(ii) Dynamics Controllabiliy assesses the ability of T2V models to manipulate video dynamics with text prompts. Objectively, it is challenging to obtain an exact correspondence between text prompts and videos. Therefore, we adopt a ranking consistency-based methodology to derive a Dynamics Controllability metric $\mathbf{D}_{control}$ .

Specifically, for two text prompts $T^{i}$ and $T^{j}$ in benchmark $\mathcal{T}=\{(T^{i},G^{i})\}$ , their corresponding generated videos have dynamics scores $S^{i}$ and $S^{j}$ (the dynamics scores are detailed in Section 3.5). Provided that the dynamics grades are ranked $G^{i}>G^{j}$ , the dynamics scores should consequently be consistently ranked $S^{i}>S^{j}$ . Accordingly, we calculate $\mathbf{D}_{control}$ as follows:

\mathbf{D}_{control}=\frac{1}{M}\sum_{i=1}^{M}{\frac{1}{{M}-{M^{i}}}\sum_{j:G^% {j}\neq G^{i}}{\mathbb{I}\big{(}(S^{i}-S^{j})(G^{i}-G^{j})\big{)}}},

(2)

Table 2: Correlation between the overall dynamic score and the existing quality metrics, including Naturalness (Nat), Visual Quality [44] (VQ), Motion Smoothness (MS) [23], Subject Consistency (SC) [23] and Background Consistency (BC) [23]. “PC” denotes Pearson’s correlation, and “KC” denotes Kendall’s correlation.

Evaluation Metrics	PC	KC
Naturalness (Nat)	-51.8	-44.2
Visual Quality (VQ)	-24.8	-18.6
Motion Smoothness (MS)	-64.0	-54.6
Subject Consistency (SC)	-88.9	-74.9
Background Consistency (BC)	-79.4	-61.4

where $M$ is the number of all text prompts and $M^{i}$ denotes the set of prompts with a dynamics grade of $G^{i}$ . $\mathbb{I}(\cdot)$ denotes the indicator function.

(iii) Dynamics-based Quality. Existing evaluations of generated visual quality do not account for the dynamics of the videos. Previous studies [23, 26] have shown that videos with higher dynamics tend to receive lower quality scores. In Table 2, we calculate the correlation between the overall dynamics score of each generated video (as detailed in Section 3.5) and its quality metrics. In detail, quality metrics such as Naturalness (Nat., elaborated in Section 2) , Motion Smoothness (MS) [23], Subject Consistency (SC) [23], and Background Consistency (BC) [23] exhibit a strong negative correlation with dynamics. This indicates that T2V models tend to generate low-dynamics videos for most text prompts to “cheat” to achieve higher scores on these metrics, as shown in Fig. 2.

To address this, we propose the Dynamics-based Quality metric $\textbf{D}_{quality}$ , assessing generated visual quality considering dynamics. For each video, we synthesize a composite quality score by averaging the scores of the identified quality metrics correlated with dynamics ( $i.e.$ , Nat, MS, SC, and BC). We then divide the entire range of dynamics score into $L=12$ equal intervals and assign videos to their corresponding intervals based on their dynamics scores. Within each interval $l$ , we calculate the average of the composite quality scores, denoted as $C_{l}$ . Ultimately, the dynamic quality is defined as the overall average of these interval averages:

\mathbf{D}_{quality}=\frac{1}{L}\sum_{l=1}^{L}C_{l}

(3)

Except for dynamics-based quality on the entire range of dynamics score, we also evaluate dynamics-based quality at dynamics levels of high, medium, and low by modifying the range of intervals for a comprehensive evaluation (refer to Section 4.3). Upon the dynamics-based quality, to have a high score, the generated videos should spread all dynamics intervals, which implies a large dynamics range. Additionally, for detailed results that integrate the dynamics score with individual metrics, please refer to Appendix C.

Naturalness. We propose Naturalness metric to evaluate the ability of T2V models to generate realistic videos. In video generation, increased video dynamics often lead to unnatural phenomena, like a cat with an extra leg or water flowing uphill. Existing metrics focus on visual effects, ignoring video naturalness. However, a model’s ability to generate natural videos reflects its real-world simulating ability. To assess this, we use the multi-modal model, Gemini 1.5 Pro [1], to grade each video’s naturalness into five levels ¹¹1Please refer to Appendix E for more details. : “Almost Real”, “Slightly Unrealistic”, “Moderately Unrealistic”, “Noticeably Unrealistic,” and “Completely Fictitious”. The overall naturalness is the average score of all videos. Experiments (see Table 4) show a high correlation between our scores and human ratings, validating the metric’s effectiveness.

3.3 Text Prompt Benchmark

To evaluate the proposed dynamics metrics, we need a benchmark consisting of text prompts that fully represent multiple dynamic grades. Existing benchmarks [23, 26] can not explicitly reflect various dynamics. To this end, we establish a new benchmark. Let $\mathcal{T}=\{(T^{i},G^{i})\}_{i=1}^{N}$ denote the benchmark, where each text prompt $T^{i}$ is assigned a dynamic grade $G^{i}$ . Here, $G^{i}\in\{1,2,3,4,5\}$ that is categorized into a coarse range. The dynamic grades are defined based on the level of dynamics described in the text prompts: " $1$ " represents Static video, where the video content is nearly stationary; " $2$ " represents Low dynamics, indicating slow and slight changes in the video content; " $3$ " represents Medium dynamics, characterized by noticeable activity and changes but relatively smooth overall; " $4$ " represents High dynamics, with fast actions and changes; and " $5$ " represents Very high dynamics, indicating extremely rapid and frequent changes in the video content.

In the coarse categorization step, we collect about 50,000 text prompts from existing benchmarks, including VidProm [39], WebVid [8], MSR-VTT [45], and Didemo [17]. The initial dynamic grades for each text prompt $T^{i}$ are assigned by GPT-4. Then we recruit six human annotators for refinement for the post-processing step. Finally, we sample 800 text prompts evenly across different dynamic grades to ensure a uniform distribution.

Fig. 3(b) shows the statistics of the DEVIL benchmark, which contains approximately 800 text prompts, and each dynamics grade includes 19 object categories and 4 scene categories. For comparison, we further assign dynamic grades to the text prompts from existing benchmarks [23, 26] following the same procedure. As shown in Fig. 3(a), these benchmarks are heavily skewed towards lower dynamic content, while our benchmark demonstrates a more balanced distribution across all dynamic grades. Unless otherwise specified, all experiments in this paper are conducted on the DEVIL benchmark.

Table 3: Formulations of dynamics scores at different temporal granularities.

Granularity	Dynamics scores	Formulation
Inter-frame	Optical Flow Strength	$S_{ofs}=\frac{1}{N-1}\sum_{i=1}^{N-1}\text{FLOW}(f_{i})$
	Structural Dynamics Score	$S_{sd}=1-\frac{1}{N-1}\sum_{i=1}^{N-1}\text{SSIM}(f_{i},f_{i+1})$
	Perceptual Dynamics Score	$S_{pd}=\frac{1}{N-1}\sum_{i=1}^{N-1}\text{PHASHD}(f_{i},f_{i+1})$
Inter-segment	Patch-level Aperiodicity	$S_{pa}=1-\frac{1}{HW}\sum_{h,w}\mathbf{ACF}(\{F_{i,h,w}\}_{i=1}^{N})$
Inter-segment	Global Aperiodicity	$S_{ga}=1-\frac{1}{\lfloor rN\rfloor}\sum_{i=1}^{\lfloor rN\rfloor}\sum_{j\neq i% }\mathbf{SIM}(F_{i}^{r},F_{j}^{r})$
Video	Temporal Entropy	$S_{te}=\mathbf{H}(f_{1},f_{2},\cdots,f_{N}\|f_{1})$
	Temporal Semantic Diversity	$S_{tsd}=\frac{1}{N}\sum_{i=1}^{N}\\|F_{i}-\bar{F}\\|^{2}$

3.4 Dynamics Scores for Generated Videos

To evaluate the proposed dynamics metrics, we generate videos using the text prompts from $\mathcal{T}=\{(T^{i},P^{i})\}_{i=1}^{N}$ and assess the dynamics of each generated video using a set of dynamics scores designed at different temporal granularities. Specifically, we evaluate dynamics at three levels: inter-frame, inter-segment, and the entire video. By combining these evaluations, we derive an overall dynamics score. For simplicity, we omit the superscripts from the dynamics scores in this section.

(i) Inter-frame Dynamics Scores. These scores describe variations between successive frames and are further divided into: optical flow strength, structural dynamics, and perceptual dynamics.

Optical flow strength. We first employ RAFT [48] to estimate the optical flow for each video frame. The mean optical flow magnitudes of each frame are averaged to calculate the optical flow strength of this frame. Averaging the optical flow strength values of all video frames, we have the optical flow strength $S_{ofs}$ of the video, as

S_{ofs}=\frac{1}{N-1}\sum_{i=1}^{N-1}\text{FLOW}(f_{i}),

(4)

where FLOW calculate the mean optical flow strength values of frame $f_{i}$ .

Structural dynamics score. While optical flow excels in capturing motion, it is less effective when detecting structural dynamics such as lighting conditions. To capture such information, we calculate the average structural similarity index metric (SSIM) [43] between consecutive frames from all frame pairs to quantify inter-frame structural variations of the video, as

S_{sd}=1-\frac{1}{N-1}\sum_{i=1}^{N-1}\text{SSIM}(f_{i},f_{i+1}).

(5)

Perceptual dynamics. The human visual system is sensitive to changes in low-frequency regions of video frames. To reflect this characteristic, we introduce a perceptual dynamics score that measures the difference between the perceptual hashes [36] of consecutive frames. The perceptual distance $D_{pa}$ is defined as the mean perceptual hash distance of all frame pairs, as

S_{pd}=\frac{1}{N-1}\sum_{i=1}^{N-1}\text{PHASHD}(f_{i},f_{i+1}),

(6)

where $\text{PHASHD}(f_{i},f_{i+1})$ denotes the Hamming distance [16] between the perceptual hash of $f_{i}$ and $f_{i+1}$ .

(ii) Inter-segment Dynamics Scores. These scores refer to the changes between video segments, each containing multiple frames. They capture the patterns of video content changes and are further categorized into patch-level aperiodicity and global aperiodicity, which measure the dynamics between video segments.

Patch-level aperiodicity. We first calculate inter-segment dynamics at the patch level using the auto-correlation factor [10]( $\mathbf{ACF}$ ), to evaluate the scene and temporal pattern dynamics. The auto-correlation factor measures the feature similarity of a time series, revealing periodicity and changing trends of features. Given features at position $(h,w)$ across $N$ frames, $\{F_{i,h,w}\}_{i=1}^{N}$ , the auto-correlation factor of the features is defined as

\mathbf{ACF}(\{F_{i,h,w}\}_{i=1}^{N})=\frac{1}{N-K_{0}}\sum_{k=K_{0}}^{N}\sum_% {i=1}^{k}\frac{1}{k}\mathbf{SIM}(F_{i,h,w},F_{N-k+i,h,w}),

(7)

where $K_{0}$ is the minimal segment length. $\mathbf{SIM}$ represents the cosine similarity between two feature vectors. It is empirically set to $\lfloor N/8\rfloor$ because most generated videos have more than 8 frames. $H$ and $W$ are the height and width of the feature map, respectively. With auto-correlation factors of all patches, we define the patch-level aperiodicity of the video, as

S_{pa}=1-\frac{1}{HW}\sum_{h,w}\mathbf{ACF}(\{F_{i,h,w}\}_{i=1}^{N}\}).

(8)

Global aperiodicity. In addition to patch-level dynamics, we employ a global aperiodicity score to measure the diversity of patterns between video segments. Specifically, we divide the video into segments. Each segment has a length $rN$ , where $r$ is a proportion factor, empirically set to 0.25. We use ViCLIP [42] to extract the spatial-temporal features for each segment. The features are denoted as $\{F_{i}^{r}\}_{i=1}^{\lfloor rN\rfloor}$ . We then calculate the similarity of these features to assess the variation in spatial-temporal patterns across segments, as

S_{ga}=1-\frac{1}{\lfloor rN\rfloor}\sum_{i=1}^{\lfloor rN\rfloor}\sum_{j\neq i% }\mathbf{SIM}(F_{i}^{r},F_{j}^{r}).

(9)

(iii) Video-level Dynamics Scores. These scores encompass the overall content diversity and the frequency of changes throughout the video. The dynamics scores at video-level are defined by temporal entropy and temporal semantic dynamics.

Temporal entropy. To evaluate the dynamics at the video level, we first measure the temporal information of each video. The temporal information $\mathbf{H}$ is defined as the conditional entropy of the entire video sequence given the first frame

S_{te}=\mathbf{H}(f_{1},f_{2},\cdots,f_{N}|f_{1}).

(10)

To estimate the conditional entropy $S_{te}$ , we employ the video encoding toolbox FFmpeg [14].

Temporal Semantic Dynamics. Beyond low-level dynamics, we further introduce a semantic diversity score to assess high-level dynamics across the whole video. The semantic diversity score $S_{tsd}$ is computed to reflect semantic-level dynamics and is defined as the variance of DINO [11] features $\{F_{i}\}_{i=1}^{N}$ of each frame, as

S_{tsd}=\frac{1}{N}\sum_{i=1}^{N}\|F_{i}-\bar{F}\|^{2},\

(11)

where $\bar{F}=\frac{1}{N}\sum_{i=1}^{N}{F_{i}}$ denotes the mean feature vector of all frames.

3.5 Overall Dynamics Score

To establish a reliable and robust assessment, we integrate dynamics scores into one with a human alignment procedure, Fig. 1, to refine the empirically defined dynamics score. It utilizes human ratings to provide ground-truth, based on which we fit a linear regression model at each temporal granularity, as

$\displaystyle S_{f}$	$\displaystyle=\mathbf{Linear}_{\theta_{f}}(D_{ofs},D_{sd},D_{pd}),$	(12)
$\displaystyle S_{s}$	$\displaystyle=\mathbf{Linear}_{\theta_{s}}(D_{pa},D_{ga}),$	(13)
$\displaystyle S_{v}$	$\displaystyle=\mathbf{Linear}_{\theta_{v}}(D_{te},D_{tsd}),$	(14)

where $\theta_{f},\theta_{s},\theta_{v}$ respectively denote the model parameters of linear regression at each granularity. The overall dynamics score of the video is then defined as the average of aligned dynamics scores from all three levels, as

S=\frac{1}{3}(S_{f}+S_{s}+S_{v}).

(15)

Through this learnable human alignment procedure, the empirically defined dynamics scores are more consistent with human perception, as validated in Sec. 4.1.

Table 4: Human alignment by correlation between dynamics scores and human ratings on the proposed DEVIL benchmark. Video generation is based on text prompts in DEVIL. “PC” denotes Pearson’s correlation, “KC” Kendall’s correlation, and “WR” the win ratio.

Scores		PC $\uparrow$	KC $\uparrow$	WR $\uparrow$
Inter-frame	$S_{ofs}$	93.1	89.9	79.2
	$S_{sd}$	91.7	88.0	78.1
	$S_{pd}$	96.4	93.2	86.1
	$S_{f}$	96.5	93.5	86.5
Inter-segment	$S_{pa}$	95.1	94.3	87.0
	$S_{g}$	94.6	93.0	85.6
	$S_{s}$	95.8	94.8	87.7
Video level	$S_{te}$	96.4	93.7	83.5
	$S_{tsd}$	97.7	96.4	90.5
	$S_{v}$	98.0	97.2	91.4
Naturalness		79.0	75.5	52.4

4 Experiments

4.1 Human Alignment Assessment

To evaluate the plausibility of the proposed dynamics metrics and the naturalness metric, we conduct the following human alignment experiments.

Ground-truth Annotation. We first generate videos using six state-of-the-art (SOTA) T2V models, including GEN-2 [2], Pika [4], VideoCrafter2 [13], Open-Sora [22], StreamingT2V [18] and FreeNoise-Lavie [30] and DEVIL text prompts. For the generated videos, we collect human evaluated dynamics and naturalness as the ground-truth. Six persons are recruited to assess each video’s grade of dynamics under three temporal levels (Frame, Segment and Video). For each dynamics metric, evaluators are required to rate the grade of dynamics from “static” to “very high dynamics” defined in Section 3.3. To guide the annotation process, we provide specific prompts for each temporal level. ²²2Please refer to Appendix F for details. The evaluation of the naturalness metric follows the same process, where a higher human assigned grade indicates a greater degree of naturalness.

Evaluation of Scores. We calculate dynamics grades and naturalness for generated videos on the proposed DEVIL benchmark. For dynamics metrics at multiple temporal levels, we integrate them using the linear regression model defined by Eq. 15. For each linear regression model, it takes the human evaluation results as ground-truths, trained upon 75% of the randomly selected videos and tests on the remaining 25% videos. During testing, the human alignment performance is reflected by the correlation $e.g.$ , Pearson and Kendall’s correlation coefficients and win ratio, between predicted and human evaluated dynamics grades. The win ratio involves comparing each video against others with different grades of dynamics. For instance, a video rated as “high dynamics” by evaluators should score lower in dynamics than any video rated as “Very high dynamics” but higher than those rated as “static”.

Table 4 shows the assessment results of the six T2V generation models. It can be seen that the dynamics metrics and the naturalness metric exhibit a strong alignment with human evaluation. The improved metrics ( $S_{f}$ , $S_{s}$ , $S_{v}$ defined in Sec. 3.5) further enhance the alignment with human evaluations.

Table 5: Evaluation of T2V models on dynamics range (

\mathbf{D}_{range}

), dynamics controllability (

\mathbf{D}_{control}

), and dynamics quality (

\mathbf{D}_{quality}

) using our text prompt benchmark. All metrics are normalized with maximum values of 100% and minimum values of 0%, higher scores indicate better performance. Dynamics quality is also assessed at low (

\mathbf{D}_{quality}^{L}

), medium (

\mathbf{D}_{quality}^{M}

), and high (

\mathbf{D}_{quality}^{H}

) levels.

T2V models	$\mathbf{D}_{range}$	$\mathbf{D}_{control}$	$\mathbf{D}_{quality}$	$\mathbf{D}_{quality}^{L}$	$\mathbf{D}_{quality}^{M}$	$\mathbf{D}_{quality}^{H}$
GEN-2 [2]	30.8	82.5	43.6	93.4	45.4	0.0
Pika [4]	43.2	72.0	52.1	90.0	66.4	0.0
VideoCrafter2 [13]	34.1	57.0	43.6	89.1	41.7	0.0
OpenSora [22]	61.2	62.4	63.7	84.4	84.5	22.2
StreamingT2V [18]	65.9	62.8	60.8	61.1	80.8	40.6
FreeNoise-Lavie [30]	66.9	58.7	66.3	65.9	87.7	45.5
Hotshot-XL [3]	34.7	58.9	52.2	92.8	63.9	0.0
Show-1 [47]	45.1	73.9	57.7	92.6	80.3	0.0
ModelScope [38]	52.9	63.6	62.6	91.2	79.1	17.5
ZeroScope [5]	26.4	66.4	44.8	90.9	43.6	0.0

4.2 Dynamic-Quality Bi-variate Analysis

To investigate the relationship between video dynamics and quality, we calculated the correlation coefficients between various quality metrics and the overall dynamics score ( $S$ ), as well as the distribution of video quality scores along $S$ . As shown in Table 6, Naturalness, Motion Smoothness, Subject Consistency, and Background Consistency all have Pearson correlation coefficients above 50% with $S$ , indicating the significant impact of dynamics on these metrics. Fig. 2 shows the distribution of video quantity and quality scores along $S$ . Most models, especially high-ranking ones like GEN-2 [2], Pika [4], and VideoCrafter2 [13], generate videos concentrated in low dynamic regions. As dynamics increase, quality metrics significantly decline. This suggests that models can improve benchmark quality scores by generating low-dynamic videos. In conclusion, video dynamics significantly impact quality evaluation, and quality metrics design should account for dynamics.

4.3 Evaluation of Dynamics Metrics

We evaluate the dynamics range $\mathbf{D}_{range}$ , dynamics controllability $\mathbf{D}_{control}$ and dynamics quality $\mathbf{D}_{quality}$ of T2V models on our text prompt benchmark. All metrics are normalized with maximum values of 100% and minimum values of 0%. To assess dynamics quality, we consider low, medium, and high levels, obtaining $\mathbf{D}_{quality}^{L}$ , $\mathbf{D}_{quality}^{M}$ , and $\mathbf{D}_{quality}^{H}$ . The score ranges for these levels are [0, 33.3%], [33.4%, 66.7%], and [66.8%, 100%] respectively, where higher scores indicate better performance. The results are shown in Table 5. In addition to six models that are annotated, we also evaluate another five SOTA T2V models to provide a comprehensive comparison of the latest models. It can be observed that the GEN-2 [2] and Pika [4] models achieve high dynamics alignment scores, but low dynamics range scores. This is because these methods generate videos with low dynamics. In contrast, the FreeNoise-Lavie [30] and StreamingT2V [18] achieve a high dynamics range but a low dynamics controllability score, indicating that it generates video dynamics misaligned with the text prompts. ³³3Please refer to Appendix A for details.

Origal Quality Metric v.s. Dynamics-based Quality Metric. Fig. 5 shows the comparison between the original quality metric and various dynamics-based quality metrics, including the overall dynamics-based quality metric ( $\mathbf{D}_{quality}$ ) and metrics at low ( $\mathbf{D}_{quality}^{L}$ ), medium ( $\mathbf{D}_{quality}^{M}$ ), and high ( $\mathbf{D}_{quality}^{H}$ ) dynamics levels. It shows that the original quality metric aligns closely with $\mathbf{D}_{quality}^{L}$ , indicating that it primarily reflects quality in low dynamics scenarios. Moreover, T2V models typically lack the ability to generate high-dynamics videos, resulting in lower scores for $\mathbf{D}_{quality}^{H}$ .

4.4 Insights from Video Dynamics Analysis

Existing datasets have biased dynamics distribution. The distribution of dynamics of the video datasets (such as WebVid2M [8]) is biased. The statistical result is shown in Fig. 6. It can be seen that most of the videos have a small dynamics score ( $\leq$ 0.4). The limited number of videos with high dynamics scores restricts the model’s ability to generate dynamics-rich videos which are common in practical applications. Therefore, existing datasets should be expanded in terms of dynamics, and the proposed metrics can provide guidance for this expansion.

Existing datasets have biased text prompts on dynamics for training. We use the dynamics controllability metric to evaluate two popular datasets, $i.e.$ , WebVid2M [8] and MSR-VTT [45], by using the ground-truth text prompts and videos. Unfortunately, they respectively achieve dynamics controllability scores of 36.31% and 52.98%. The poor performance indicates that the two datasets can not provide sufficient information/guidance while training the video generation models. To train better video generation models, the text prompts of these datasets requires to be elaborated on aspects of dynamics.

Existing T2V methods have limited real-world simulation ability. As shown in Fig. 2, we performed a statistical analysis of video quantity distribution, visual quality, motion smoothness, and naturalness metric scores for SOTA methods based on the distribution of dynamics score. When the dynamics score is small, the videos generated by these SOTA models have high scores under the aforementioned four metrics. As the dynamics score increases, these scores (especially the naturalness) significantly decrease. This might be caused by the fact that these models primarily focus on optimizing the generation of simple and slow-motion content, while dynamics are totally ignored in the evaluation metrics. Therefore, T2V models should be optimized on large range of dynamics to truly reflect real-world simulation.

5 Conclusion

We proposed DEVIL, a comprehensive and constructive evaluation protocol for T2V generation models. In the protocol, we defined a set of dynamics metrics corresponding to multiple temporal granularities, and a new benchmark of text prompts under multiple levels of dynamics. Based on the distribution of dynamics scores over the benchmark, we assessed the generation capacity of T2V models, characterized by dynamic ranges and degree of T2V alignment. Experiments show that DEVIL enjoys 90% consistency with human evaluation results, demonstrating the potential to be a powerful tool for advancing T2V generation models.

Limitations. At present, the grades of dynamics remain limited, which should be improved to more fine-grained grades. Furthermore, only a limited number of T2V models are evaluated using the proposed protocol. A more comprehensive evaluation of T2V models should be done in future work.

Social impacts. The positive impact can be that the proposed evaluation protocol may promote the development of T2V models. The negative impact can be a risk that advanced T2V models could be misused to create realistic but misleading video content, such as deepfakes.

References

gem [2024] Gemini. https://gemini.google.com/, 2024. Accessed: 2024-05-21.
gen [2024] Gen-2. https://research.runwayml.com/gen2, 2024. Accessed: 2024-05-21.
hot [2024] Hotshot-xl. https://huggingface.co/hotshotco/Hotshot-XL, 2024. Accessed: 2024-05-21.
pik [2024] Pika labs. https://pika.art, 2024. Accessed: 2024-05-21.
zer [2024] Zeroscope. https://huggingface.co/cerspense/zeroscope_v2_576w, 2024. Accessed: 2024-05-21.
Anne Hendricks et al. [2017] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
Bain et al. [2021a] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021a.
Bain et al. [2021b] Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021b.
Blattmann et al. [2023] Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
Box et al. [2015] George EP Box, Gwilym M Jenkins, Gregory C Reinsel, and Greta M Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
Caron et al. [2021] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In IEEE ICCV, pages 9630–9640, 2021.
Chen et al. [2023] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, **bo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, et al. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023.
Chen et al. [2024] Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024.
Developers [2024] FFmpeg Developers. Ffmpeg: A complete, cross-platform solution to record, convert and stream audio and video, 2024. URL https://ffmpeg.org/. Accessed: 2024-05-21.
Girdhar et al. [2023] Rohit Girdhar, Mannat Singh, Andrew Brown, Quentin Duval, Samaneh Azadi, Sai Saketh Rambhatla, Akbar Shah, Xi Yin, Devi Parikh, and Ishan Misra. Emu video: Factorizing text-to-video generation by explicit image conditioning. arXiv preprint arXiv:2311.10709, 2023.
Hamming [1950] Richard W Hamming. Error detecting and error correcting codes. The Bell system technical journal, 29(2):147–160, 1950.
Hendricks et al. [2018] Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with temporal language. In Empirical Methods in Natural Language Processing (EMNLP), 2018.
Henschel et al. [2024] Roberto Henschel, Levon Khachatryan, Daniil Hayrapetyan, Hayk Poghosyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. arXiv preprint arXiv:2403.14773, 2024.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Ho et al. [2022a] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022a.
Ho et al. [2022b] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022b.
HPC-AI Technology Inc. [2023] HPC-AI Technology Inc. Open-sora: Democratizing efficient video production for all. https://github.com/hpcaitech/Open-Sora, 2023.
Huang et al. [2023] Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang **, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. arXiv preprint arXiv:2311.17982, 2023.
Li et al. [2021] Yuntao Li, Bei Chen, Qian Liu, Yan Gao, Jian-Guang Lou, Yan Zhang, and Dongmei Zhang. Keep the structure: A latent shift-reduce parser for semantic parsing. In IJCAI, pages 3864–3870, 2021.
Lin et al. [2024] Han Lin, Jaemin Cho, Abhay Zala, and Mohit Bansal. Ctrl-adapter: An efficient and versatile framework for adapting diverse controls to any diffusion model. arXiv preprint arXiv:2404.09967, 2024.
Liu et al. [2023] Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, and Ying Shan. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
Liu et al. [2024] Yuanxin Liu, Lei Li, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu Sun, and Lu Hou. Fetv: A benchmark for fine-grained evaluation of open-domain text-to-video generation. Advances in Neural Information Processing Systems, 36, 2024.
Mei and Patel [2023] Kangfu Mei and Vishal Patel. Vidm: Video implicit diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 9117–9125, 2023.
OpenAI [2023] OpenAI. Chatgpt: A large language model. https://www.openai.com/chatgpt, 2023. Accessed: 2024-05-21.
Qiu et al. [2023] Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. Freenoise: Tuning-free longer video diffusion via noise rescheduling. arXiv preprint arXiv:2310.15169, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
Su et al. [2009] Danying Su, Zhiqiang Su, Jiaye Wang, Shanshan Yang, and **g Ma. Ucf-101, a novel omi/htra2 inhibitor, protects against cerebral ischemia/reperfusion injury in rats. The Anatomical Record: Advances in Integrative Anatomy and Evolutionary Biology: Advances in Integrative Anatomy and Evolutionary Biology, 292(6):854–861, 2009.
Unterthiner et al. [2019] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Fvd: A new metric for video generation. 2019.
Venkatesan et al. [2000] Ramarathnam Venkatesan, S-M Koon, Mariusz H Jakubowski, and Pierre Moulin. Robust image hashing. In Proceedings 2000 International Conference on Image Processing (Cat. No. 00CH37101), volume 3, pages 664–666. IEEE, 2000.
Wang et al. [2023a] Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising. arXiv preprint arXiv:2305.18264, 2023a.
Wang et al. [2023b] Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571, 2023b.
Wang and Yang [2024] Wenhao Wang and Yi Yang. Vidprom: A million-scale real prompt-gallery dataset for text-to-video diffusion models. arXiv preprint arXiv:2403.06098, 2024.
Wang et al. [2023c] Wen**g Wang, Huan Yang, Zixi Tuo, Huiguo He, Junchen Zhu, Jianlong Fu, and Jiaying Liu. Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation. arXiv preprint arXiv:2305.10874, 2023c.
Wang et al. [2023d] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023d.
Wang et al. [2023e] Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023e.
Wang et al. [2004] Zhou Wang, Alan Conrad Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004. URL https://api.semanticscholar.org/CorpusID:207761262.
Wu et al. [2023] Haoning Wu, Erli Zhang, Liang Liao, Chaofeng Chen, **gwen Hou Hou, Annan Wang, Wenxiu Sun Sun, Qiong Yan, and Weisi Lin. Exploring video quality assessment on user generated contents from aesthetic and technical perspectives. In International Conference on Computer Vision (ICCV), 2023.
Xu et al. [2016] Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
Yu et al. [2023] Sihyun Yu, Kihyuk Sohn, Subin Kim, and **woo Shin. Video probabilistic diffusion models in projected latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18456–18466, 2023.
Zhang et al. [2023] David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. arXiv preprint arXiv:2309.15818, 2023.
Zhang et al. [2024] Tianjun Zhang, Shishir G Patil, Naman Jain, Sheng Shen, Matei Zaharia, Ion Stoica, and Joseph E Gonzalez. Raft: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024.

Appendix

Appendix A Dynamics Scores

For the dynamics scores proposed in Section LABEL:sec:dynamic_metric, we present the detailed results of T2V models in Figure 7. It can be seen that ModelScope [38] excels in generating rapid inter-frame motions, while StreamingT2V [18] performs exceptionally well across most dynamics score metrics. StreamingT2V achieves high scores for the inter-segment dyanmics scores at video levels. This indicates that it has significant advantages in generating complex dynamic content. In contrast, GEN-2 [2] and VideoCrafter2 [13] perform poorly on several metrics, highlighting their deficiencies in dynamics.

Appendix B Correlation Between Existing Metrics and Dynamics

In Section 3.2, to identify the relevance between existing metrics with the dynamics metrics, we provide a bi-variate analysis strategy. Based on bi-variate analysis, we provide detailed correlation results for the models. In Table 6, the Pearson correlation coefficients between the dynamics scores and existing metrics, including aesthetic score, technical score, visual quality, motion smoothness, subject consistency, background consistency, and naturalness, are detailed.

The results indicate a clear trade-off between video dynamics and various existing metrics in T2V models. As dynamic complexity increases, there tends to be a decline in motion smoothness, subject consistency, background consistency, and naturalness. The aesthetic, technical, and visual quality metrics show relatively low correlation, which can be attributed to the fact that these metrics evaluate video frames independently, ignoring temporal relationships between frames.

Table 6: Pearson correlation coefficient between the dynamics metrics and the existing metrics including aesthetic score [44], technical score [44] visual quality [44], motion smoothness [23], subject consistency [23] and background consistency [23] and our naturalness.

Aesthetic

Score

Technical

Score

Visual

Quality

Motion

Smoothness

Subject

Consistency

Background

Consistency

Naturalness

GEN-2 [2]

-0.19

-0.09

-0.12

-0.54

-0.88

-0.73

-0.50

Pika [4]

-0.40

-0.20

-0.28

-0.65

-0.88

-0.78

-0.47

VideoCrafter2 [13]

-0.25

-0.20

-0.24

-0.59

-0.87

-0.76

-0.36

OpenSora [22]

-0.20

-0.27

-0.26

-0.70

-0.90

-0.83

-0.43

StreamingT2V [18]

-0.15

-0.21

-0.23

-0.57

-0.89

-0.81

-0.36

FreeNoise-Lavie [30]

-0.37

-0.31

-0.35

-0.75

-0.91

-0.86

-0.48

Average

-0.26

-0.21

-0.25

-0.63

-0.81

-0.79

-0.43

Appendix C Detail of Dynamics-based Quality

Let $S^{(i)}$ denote a score of generated video $i$ . Existing metrics simply average the scores of all videos to obtain the metric score $S$ of the $T2V$ model:

S=\frac{1}{|T|}\sum_{i=1}^{|T|}S^{(i)},

(16)

where $|T|$ is the total number of generated videos. Considering that some existing metrics show a considerable negative correlation with the video’s dynamics score, they fail to prevent models from generating low-dynamic videos.

To address this issue, we enhance existing metrics by integrating human-aligned dynamics scores, preventing models from attaining high scores by producing low-dynamic videos. Specifically, we first equally divide the human-aligned dynamics score into $L=12$ intervals. We then calculate the mean scores $S_{l}$ at each interval $l$ . The improved metric $S^{*}$ is defined as the average of $S_{l}$ across all intervals:

S^{*}=\frac{1}{L}\sum_{l=1}^{L}S_{l}.

(17)

Table 7 presents the scores of various models across four quality metrics: Motion Smoothness, Naturalness, Subject Consistency, and Background Consistency. FreeNoise and StreamingT2V achieve high overall scores due to their strong performance across a wide dynamic range. In contrast, Gen-2 and Pika excel in the low dynamic range, but their inability to generate high dynamic videos results in lower overall scores.

Table 7: Integrating dynamics scores with quality metrics, including Motion Smoothness, Naturalness, Subject Consistency, and Background Consistency. The table details scores across multiple models, with metrics divided into Overall, Low, Mid, and High categories based on modified dynamic intervals to achieve a comprehensive evaluation.

T2V Model	MotionSmoothness				Naturalness
T2V Model	Overall	Low	Mid	High	Overall	Low	Mid	High
FreeNoise [30]	71.7	71.7	95.4	47.9	57.1	54.8	73.9	42.5
GEN-2 [2]	49.7	99.5	49.7	0.0	39.1	81.6	35.6	0.0
OpenSora [22]	71.5	95.5	95.3	23.7	49.8	62.8	64.2	22.5
Pika [4]	58.0	99.5	74.5	0.0	39.8	69.4	50.1	0.0
StreamingT2V [18]	71.2	70.9	95.0	47.8	42.2	44.2	55.4	27.0
VideoCrafter2 [13]	48.9	97.8	48.8	0.0	31.4	70.1	24.2	0.0
HotShot-XL [3]	47.0	83.7	57.3	0.0	54.4	95.7	67.4	0.0
ModelScope [38]	50.4	77.1	61.6	12.5	67.9	95.7	87.6	20.4
Show-1 [47]	47.4	81.6	60.6	0.0	62.0	95.5	90.5	0.0
ZeroScope [5]	38.3	75.0	40.0	0.0	46.5	95.3	44.2	0.0
T2V Model	Subject Consistency				Background Consistency
T2V Model	Overall	Low	Mid	High	Overall	Low	Mid	High
FreeNoise [30]	66.0	66.3	87.3	44.3	70.6	70.7	94.0	45.5
GEN-2 [2]	47.7	95.2	47.8	0.0	48.6	97.3	48.5	0.0
OpenSora [22]	62.7	84.5	84.2	19.3	70.7	94.6	94.4	22.2
Pika [4]	54.6	94.6	69.1	0.0	56.1	96.5	71.8	0.0
StreamingT2V [18]	61.2	60.8	81.6	41.3	68.6	68.3	91.2	40.6
VideoCrafter2 [13]	45.5	90.8	45.6	0.0	48.7	97.5	48.4	0.0
HotShot-XL [3]	55.6	97.1	69.7	0.0	52.0	94.8	61.1	0.0
ModelScope [38]	70.1	97.1	91.6	21.6	62.0	94.8	75.5	17.5
Show-1 [47]	62.8	98.2	90.1	0.0	58.5	95.3	80.2	0.0
ZeroScope [5]	49.0	98.4	48.5	0.0	45.5	94.8	41.7	0.0

Appendix D Assigning Dynamics Grades to Text Prompts

As described in Section 3.3, we collect approximately 50,000 text prompts from existing benchmarks, including 19 object categories and 4 scene categories. Using GPT-4 coarse classification and human refinement, we construct the DEVIL prompt benchmark. The process of categorizing dynamics grades using GPT-4 is illustrated in Figure 8. In specific, we instruct GPT-4 to perform classification on the rate of content change. To enhance GPT-4’s classification accuracy, we further provide detailed criteria and examples for each dynamics grade. In the post-processing step, we recruit six human annotators to refine the dynamics grades over three months. Finally, we sample about 800 text prompts at different dynamics grades to ensure a uniform distribution across the grades.

Appendix E Details of Naturalness

We employed the advanced multi-modal large model, Gemini-1.5 Pro [1], equipped with video understanding capabilities, to assess and classify the naturalness of video content. As shown in Fig. 9, we demonstrate the process through which the model analyzes videos and assigns naturalness ratings. The figure details the five different levels used to evaluate video naturalness, ranging from “Completely Fantastical" to “Almost Realistic". Each level is defined based on how closely the video content aligns with the real world. Additionally, the figure includes two examples of video evaluations: the first video is rated as "Almost Realistic" due to its high conformity with reality, while the second video, due to minor distortions—such as the unrealistic number of legs on a dog—is rated as "Slightly Unrealistic". These examples validate the plausibility of the proposed naturalness metric.

Appendix F Human Annotation

To align human evaluations with automated metrics, we annotated a series of videos generated by SOTA T2V models. We initiated the process by generating videos using prompts from the DEVIL benchmark with six advanced T2V models including GEN-2, Pika, VideoCrafter2, OpenSora, StreamingT2V, and FreeNoise-Lavie. Subsequently, we developed a video annotation toolbox for evaluating the dynamics and naturalness of videos. As shown in Figure 10, the toolbox allows annotators to assess the dynamics of the videos across five grades, from almost static to very high dynamics, and the naturalness from almost real to completely unreal. To guarantee high-quality and consistent evaluations, we recruit six annotators who have undergraduate degrees and provided them with detailed training.

Appendix G Visual comparison

In Section 3, we use text prompts with different dynamics grades to generate videos with T2V models. Here, we provide visual results of the generated videos.