GenAI Arena: An Open Evaluation Platform for Generative Models

Dongfu Jiang^∗ Max Ku^∗ Tianle Li^∗ Yuansheng Ni Shizhuo Sun Rongqi Fan Wenhu Chen
University of Waterloo
{dongfu.jiang, m3ku, t29li, wenhuchen}@uwaterloo.ca
https://hf.co/spaces/TIGER-Lab/GenAI-Arena

Abstract

Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.

Refer to caption — Figure 1: GenAI Arena contains three components: (1) text-to-image, text-to-video and image editing arena, which accept community voting to obtain the preference pairs. (2) The leaderboard utilizes the preference pairs to calculate elo ranking for all the evaluated models. (3) We further release GenAI-Bench to judge different multimodal LLM judges.

1 Introduction

Image generation and manipulation technologies have seen rapid advancements, leading to their widespread application across various domains such as creating stunning artwork [41, 53, 66, 18], enhancing visual content [4, 35], and aiding in medical imaging [64, 9]. Despite these advancements, navigating through the multitude of available models and assessing their performance remains a challenging task [51]. Traditional evaluation metrics like PSNR, SSIM [60], LPIPS [67], and FID [17], while valuable, offer very specific insights into precise aspects of visual content generation. However, these metrics often fall short in providing a comprehensive assessment of overall model performance, especially when considering subjective qualities like aesthetics and user satisfaction [45].

To address these challenges, we introduce GenAI-Arena—a novel platform designed to enable fair evaluation. Inspired by successful implementations in other domains [69, 40], GenAI-Arena offers a dynamic and interactive platform where users can generate images, compare them side-by-side, and vote for their preferred models. Such a platform not only simplifies the process of comparing different models but also provides a ranking system that reflects human preferences, thereby offering a more holistic evaluation of model capabilities. To our knowledge, GenAI-Arena is the first evaluation platform with comprehensive evaluation capabilities across multiple properties. Unlike other platforms, it supports a wide range of tasks across text-to-image generation, text-guided image editing, and text-to-video generation, along with a public voting process to ensure labeling transparency. The votes are utilized to access the evaluation ability of Multimodal Large Language Model (MLLM) evaluators. Table 1 shows our platform excels in its versatility and transparency.

Since February 11th, 2024, we have collected over 6000 votes for three multimodal generative tasks. We constructed leaderboards for each task with these votes, identifying the state-of-the-art models as PlayGround V2.5, MagicBrush, and T2VTurbo, respectively (until June 4th, 2024). Detailed analyses based on the votes are presented. For example, our plotted winning fraction heatmaps reveal that while the Elo rating system is generally effective, it can be biased by imbalances between "easy games" and "hard games". We also performed several case studies for qualitative analysis, demonstrating that users can provide preference votes from multiple evaluation aspects, which help distinguish subtle differences between the outputs and upload high-quality votes for Elo rating computation.

Automatically assessing the quality of generated visual content is a challenging problem for several reasons: (1) images and videos have many different aspects like visual quality, consistency, alignment, artifacts, etc. Such a multi-faceted nature makes the evaluation intrinsically difficult. (2) the supervised data is relatively scarce on the web. In our work, we release the user voting data as GenAI-Bench to enable further development in this field. Specifically, we calculate the correlation between different image/video auto-raters (i.e. MLLM judges like GPT-4o, Gemini, etc.) with user preference to understand their judging abilities. Our results show that even the best MLLM, GPT-4o achieves at most 0.22 Pearson correlation with human preference.

Table 1: Comparison with different evaluation platforms on different properties.

Platform

Text-To-Image

Generation

Text-Guided

Image Editing

Text-To-Video

Generation

Human Label

Transparency

Open/Public

Voting Process

Judging

MLLM judge

T2I-CompBench [20]

✓

✗

HEIM [31]

✓

✗

✓

✗

ImagenHub [28]

✓

✗

✓

✗

VBench [21]

✗

✓

✗

EvalCrafter [37]

✗

✓

✗

GenAI-Arena

✓

To summarize, our work’s contributions include:

•

GenAI-Arena, the first open platform to rank multi-modal generative AI based on user preferences.
•

Discussion and case studies of collected user votes, showing the reliability of GenAI-Arena.
•

GenAI-Bench, a public benchmark for judging MLLM’s evaluation ability for generative tasks.

2 Related Work

2.1 Generative AI Evaluation Metrics

Numerous methods have been proposed to evaluate the performance of multi-modal generative models in various aspects. In the context of image generation, CLIPScore [16] is proposed to measure the text-alignment of an image and a text through computing the cosine similarity of the two embeddings from CLIP [50]. IS [54] and FID [17] measure image fidelity by computing a distance function between real and synthesized data distributions. PSNR, SSIM [60] assess the image similarity. LPIPS [67] and the follow-up works [12, 13] measure the perceptual similarity of images. More recent works leverage the Multimodal Large Language Model (MLLM) as a judge. T2I-CompBench [20] proposed the use of miniGPT4 [70] to evaluate compositional text-to-image generation task. TIFA [19] further adapted visual question answering to compute scores for the text-to-image generation task. VIEScore [27] leveraged MLLMs as a unified metric across image generation and editing tasks, reporting that MLLM has great potential in replacing human judges.

Metrics in similar fashions are also proposed for the video domain. For example, FVD [57] measures the coherence shifts and quality in frames. CLIPSIM [50] utilizes an image-text similarity model to assess the similarity between video frames and text. VBench [21] and EvalCrafter [37] also proposed different metrics for evaluating different aspects of the video generation task. However, these automatic metrics still lag compared with human preferences, achieving low correlation and thus giving doubts to their reliability.

2.2 Generative AI Evaluation Platforms

While auto-metric focuses on evaluating a single model’s performance, evaluation platforms aim to systematically rank a group of models. Recently, several benchmark suites have been developed to comprehensively assess generative AI models. For image generation, T2ICompBench [20] evaluates compositional text-to-image generation tasks, while HEIM [31] offers a holistic evaluation framework that measures text-to-image tasks across multiple dimensions, including safety and toxicity. Similarly, ImagenHub [28] evaluates text-to-image, image editing, and other prevalent image generation tasks in a unified benchmark suite. For video generation, VBench [21] and EvalCrafter [37] provide structured evaluation approaches ensuring rigorous assessment. Despite their functionality, these benchmarks rely on model-based evaluation metrics, which are less reliable than human evaluation.

To address this issue, variable model arenas have been developed to collect direct human preferences for ranking models. Chatbot Arena by LMsys [11] is the pioneering platform in this regard, setting the standard for evaluation. Subsequent efforts have led to the creation of arenas for vision-language models [62], TTS models [40], and tokenizers [24]. However, there is no existing arena for generative AI models. To fill this gap, we propose GenAI-Arena as a complementary solution in this field.

3 GenAI-Arena: Design and Implementation

3.1 Design

GenAI-Arena is designed to offer an intuitive and comprehensive evaluation platform for generative models, facilitating user interaction and participation. The platform is structured around three primary tasks: text-to-image generation, image edition, and text-to-video generation. Each task is supported by a set of features that include an anonymous side-by-side voting system, a battle playground, a direct generation tab, and a leaderboard as shown in Figure 2 . These features are designed to cater to both casual users and researchers, ensuring a democratic and accurate assessment of model performance.

Standardized Inference

To ensure a fair comparison between different models, we ported the highly dispersed codebase from the existing works and then standardized them into a unified format. During inference, we fixed the hyper-parameters and the prompt format to prevent per-instance prompt or hyper-parameter tuning, which makes the inference of different models fair and reproducible. Following ImagenHub [28], we build the new library of VideoGenHub, which aims to standardize the inference procedure for different text-to-video and image-to-video models. We find the best hyper-parameters of these models to ensure their highest performance.

Voting Rules

The anonymous battle section is designed to ensure unbiased voting and accurate evaluation of generative models. The rules for this section are as follows:

1.

Users input a prompt, which is then used to generate outputs from two anonymous models within the same category of task.
2.

The generated outputs from the two anonymous models are presented side-by-side for comparison.
3.

Users can vote based on their preference using the options: 1) left is better; 2) right is better; 3) tie; 4) both are bad. These four options are being used to calculate Elo ranking.
4.

Once the user has made their decision, they click the Vote button to submit their vote. It is important to ensure that the identity of the models remains anonymous throughout the process. Votes will not be counted if the model identity is revealed during the interaction.

3.2 Model Integration

In GenAI-Arena, we incorporate a diverse array of state-of-the-art generative models, covering a broad range of generative tasks including text-to-image generation, image edition, and text-to-video generation. To ensure comprehensive evaluations, the platform includes models that employ diverse underlying technologies, such as different types of architectures, training paradigms, training data and acceleration techniques. These variations can offer insights to understand these factors rigorously.

Text-to-Image Generation

In Table 2, we list all the included text-to-image generation models. For example, SDXL, SDXL-Turbo, and SDXL-Lightning are all derived based on SDXL [49], while SDXL-Turbo [55] and SDXL-Lightning [36] adopt different distillation method. We also include diffusion transformer models [47] like PixArt- $\alpha$ and PixArt- $\sigma$ . Playground V2 and Playground V2.5 are based on SDXL architecture, but trained by Playground.ai from scratch with an internal dataset.

Table 2: The overview of all text-to-image generation models.

Model	Size	Method	Resolution	#Steps
OpenJourney [44]	1B	SD-2.1 + MidJourney Dataset	512x512	50
LCM [38]	1B	SD-2.1 + Consistency Distillation	512x512	4
SDXL [49]	3.5B	Latent Diffusion Model	1K $\times$ 1K	50
SDXL-Turbo [55]	3.5B	Latent Diffusion Model + Distillation	1K $\times$ 1K	1
SDXL-Lightning [36]	3.5B	Latent Diffusion Model + Distillation	1K $\times$ 1K	4
PixArt- $\alpha$ [7]	0.6B	Diffusion Transformer	1K $\times$ 1K	50
PixArt- $\sigma$ [8]	0.6B	Diffusion Transformer + Weak-to-Strong	4K $\times$ 4K	50
StableCascade [48]	1.5B + 3.6B	Würstchen Architecture	1K $\times$ 1K	20+10
Playground V2 [33]	3.5B	Latent Diffusion Model	1K $\times$ 1K	50
Playground V2.5 [32]	3.5B	Latent Diffusion Model	1K $\times$ 1K	50

Table 3: Overview of all the image editing models.

Model	Trained?	Method	Runtime
Pix2PixZero [46]	Zero-shot	Editing Direction Discovery + Attention Control	21s
SDEdit [39]	Zero-shot	Iteratively Denoising through SDE	13s
CycleDiffusion [61]	Zero-shot	Reconstructable Encoder for Stochastic DPMs	9s
Prompt2Prompt [15]	Zero-shot	Prompt-based Cross-attention Control	120s
PnP [56]	Zero-shot	Feature and Self-attention Injection	120s
InfEdit [63]	Zero-shot	Consistent Model + Uni-Attention Control	5s
InstructPix2Pix [4]	Trained	Instruction-based Fine-tuning with Synthetic Data	12s
MagicBrush [65]	Trained	Instruction-based Fine-tuning with Annotated Data	12s
CosXLEdit [1]	Trained	Cosine-Continuous EDM VPred schedule	50s

Table 4: Overview of all text-to-video generation models.

Model	Base	Len	FPS	Dataset	Resolution	#Steps
AnimateDiff [14]	SD-1.5	2s	8	WebVid10M	512 x 512	25
AnimateDiff-Turbo [14]	SD-1.5	2s	8	WebVid10M	512 x 512	4
ModelScope [58]	SD-1.5	2s	8	WebVid10M	256 x 256	50
LaVie [59]	SD-1.5	2s	8	Vimeo25M	320 x 512	50
StableVideoDiffusion [2]	SD-2.1	2.5s	10	LVD-500M	576 x 1024	20
VideoCrafter2 [6]	SD-2.1	2s	16	WebVid10M	320 x 512	50
T2V-Turbo [34]	VideoCrafter2	2s	8	WebVid10M	320 x 512	4
OpenSora [42]	Pixart- $\alpha$	2s	16	WebVid10M	320 x 512	50

Text-guided Image Editing

In Table 3, we list all the image editing models and approaches. Some of them are plug-and-play approaches without requiring any training, like Pix2PixZero [46], InfEdit [63], SDEdit [39], etc. These methods can be applied to a broad range of diffusion models. Some of the models like PnP [56] and Prompt2Prompt [15] require DDIM inversion, which takes much longer time than the other approaches. We also include specialized trained image editing models like InstructP2P [4], MagicBrush [65] and CosXLEdit [1].

Text-to-Video Generation

In Table 4, we list all the text-to-video generation models. We include different types of models. For example, AnimateDiff [14], ModelScope [58], Lavie [59] are initialized from SD-1.5 and continue trained by injecting a motion layer to capture the temporal relation between frames. In contrast, StableVideoDiffusion [2] and VideoCrafter2 [5] are iniialized from SD-2.1. Besides these models, we also include OpenSora [42], which utilizes a Sora-like diffusion transformer [47] architecture for joint space-time attention.

3.3 Elo Rating System

Online Elo Rating

The Elo rating system models the probability of player $i$ winning against player $j$ , based on their current ratings, $R_{i}$ and $R_{j}$ respectively, where $i,j\in N$ . We define a binary outcome $Y_{ij}$ for each comparison between player $i$ and player $j$ , where $Y_{ij}=1$ if player $i$ wins and $Y_{ij}=0$ otherwise. The logistic probability is formulated as:

P(Y_{ij}=1)=\frac{1}{1+10^{(R_{j}-R_{i})/\alpha}}

(1)

where $\alpha=400$ for Elo rating computation. After each match, a player’s rating is updated using the formula:

R^{\prime}_{i}=R_{i}+K\times(S(i,j)-E(i,j))

(2)

where $S(i,j)$ is the actual match outcome, $S(i,j)=1$ for a win $S(i,j)=0.5$ for a tie, and $S(i,j)=0$ for a loss, and $E(i,j)=P(Y{i,j}=1)$ .

For example, given a model’s Elo rating as 1200 and the other model’s elo rating as 1100, then the estimated probability of the first model winning will be $\frac{1}{1+10^{(1100-1200)/400}}\approx 0.64$ . In this way, we can have a direct understanding of the elo rating’s meaning. This map** from absolute number to the pairwise winning rate of two models gives a more straightforward understanding of the meaning of elo rating score.

Another design logic behind the Elo rating is that a higher-rated player should gain fewer points if they win a lower-rated player, but lose more if they lose the game, whereas the lower-rated player experiences the opposite. In this way, the order of a specific set of matches will significantly affect the final computed Elo rating, as the player’s Elo rating and the rating gain of each match are both changing dynamically. This online Elo rating system might be good for real-world competitions, where players usually have less than 100 competitions a year. However the arena for AI models usually comes with thousands of votes (competitions), and the quality of votes is not ensured. Thus, it’s necessary to acquire an order-consistent and more stable elo rating. To do this, we follow Chatbot Arena [10] to adopt the Bradley–Terry model [3] for a statistically estimated elo rating.

Bradley–Terry Model Estimation

The Bradley–Terry (BT) model [3] estimates Elo ratings using logistic regression and maximum likelihood estimation (MLE). Suppose there are $N$ players and we have a series of pairwise comparisons, where $W_{ij}$ is the number of times player $i$ has won against player $j$ . The log-likelihood function for all pairwise comparisons is written as:

\mathcal{L}(\mathbf{R})=\sum_{i,j\in N,i\neq j}\left(W_{ij}Y_{ij}\log P(Y_{ij}% =1)\right)

(3)

where $\mathbf{R}=\{R_{1},\ldots,R_{N}\}$ represents the Elo ratings of each player. The Bradley–Terry model provides a stable statistical estimation of the players’ ratings by consistently incorporating all pairwise comparisons, thus overcoming the limitations of direct Elo computation in online settings.

Since the BT model does not account for ties, we first duplicate all the votes, then allocate half of the "tie" votes to the scenario where model $i$ wins ( $Y_{ij}=1$ ) and the other half to the scenario where model $j$ wins ( $Y_{ij}=0$ ) in practice. We model the solver to be a logistic regression model and solve it via the LogisticRegression model from sklearn for the solving.

Confidence Interval

To further investigate the variance of the estimated Elo rating, we use the "sandwich" standard errors described in Huber et al. [22]. That is, for each round, we record the estimated Elo rating based on the same number of battles sampled from the previous round. This process continues for 100 rounds. We select the lowest sampled elo rating as the lower bound of the confidence interval, and the highest sampled elo rating as the upper bound of the elo rating.

3.4 GenAI-Museum

Current GenAI-Arena runs the model on the Hugging Face Zero GPU system [23]. As shown in Table 3, the time for a single generative inference usually ranges from 5 to 120 seconds. Unlike the auto-regression language model, where inference acceleration techniques like VLLM [29], SGLang [68] generate responses in less than a second, diffusion model community does not have such powerful infrastructure. Therefore, pre-computation becomes a necessary way to mitigate computational overhead and streamline user interaction.

To achieve this, we serve GenAI-Museum as a pre-computed data pool comprising various inputs from existing datasets or user collection, along with each model’s output. Based on this, a "Random Sample" button shown in Figure 2 is additionally implemented to facilitate the random generation of prompts and the immediate retrieval of corresponding images or videos. This functionality operates by sending requests to our deployed GenAI-Museum every time "Random Sample" button is hit, receiving input and two random model’s pre-computed outputs. In this way, we save the computation time on the GPU, enable users to do instant comparisons and votes on the UI, and balance the votes for each unique input so we gradually collect votes for a full combination of all models. The input prompts were sampled from ImagenHub [28] and VBench [21].

Table 5: GenAI-Arena Leaderboards.
(Last updated on June 4th, 2024)

(a) Text-to-Image

Model	Elo	95% CI
PlayGround V2.5	1150	+21/-22
Playground V2	1101	+14/-20
StableCascade	1057	+20/-24
SDXL-Lightning	1053	+22/-22
PixArt- $\alpha$	1052	+15/-19
PixArt- $\sigma$	1050	+26/-23
SDXL	1001	+15/-14
SDXL-Turbo	935	+18/-16
OpenJourney	853	+13/-17
LCM	817	+20/-20

(b) Image editing

Model	Elo	95% CI
MagicBrush	1111	+28/-32
InfEdit	1079	+27/-33
CosXLEdit	1066	+31/-30
InstructPix2Pix	1033	+32/-26
PNP	998	+37/-36
Prompt2prompt	988	+25/-25
CycleDiffusion	939	+23/-26
SDEdit	929	+25/-21
Pix2PixZero	857	+20/-24

Model	Elo	95% CI
T2V-Turbo	1113	+53/-46
StableVDiffusion	1105	+45/-37
VideoCrafter2	1077	+18/-18
AnimateDiff	1075	+23/-26
LaVie	997	+24/-26
OpenSora	916	+19/-23
ModelScope	866	+19/-22
AnimateDiff-Turbo	851	+18/-20

4 Benchmarks and Results Discussion

4.1 Arena Leaderboard

We report our leaderboard at the time of paper writing in Table 5. For image generation, we collected 4443 votes in total. The currently top-ranked models are Playground V2.5 and Playground V2. Both of the models are released by Playground.ai, which follows the same architecture as SDXL but is trained with a private dataset. In contrast, SDXL only ranks in the seventh position, lagging significantly behind. Such finding highlights the importance of the training dataset. Following the Playground models is StableCascade, which utilizes a highly efficient cascade architecture to lower the training cost. According to Würstchen [48], StableCascade only requires a 10% training cost of SD-2.1, yet it can beat SDXL significantly on our leaderboard. This highlights the importance of the diffusion architecture to achieve strong performance. For image editing, a total of 1083 votes have been collected. MagicBrush, InFEdit, CosXLEdit, and InstructPix2Pix ranked higher as they can perform localized editing on images. PNP preserves the structure with feature injections, thus limiting the edit variety. The older methods such as Prompt-to-Prompt, CycleDiffusion, SDEdit, and Pix2PixZero, frequently result in completely different images during editing despite the high-quality images, which explains the lower ranking of these models. For text-to-video, there is a total of 1568 votes. T2VTurbo leads with the highest Elo score, suggesting it is the most effective model. Close behind, StableVideoDiffusion ranks second. Following VideoCrafter2 and AnimateDiff have very close elo scores, showing nearly equivalent capabilities. LaVie, OpenSora, ModelScope, and AnimateDiff-Turbo follow with decreasing scores, indicating progressively lower performance.

4.2 Discussion and Insights

Winning Fraction and Elo Rating

We visualize the winning fraction heatmap in Figure 4, where each cell represents the actual winning fraction of Model A over Model B. The models are ordered by their Elo rating in the heatmap. Horizontally across each row, the winning fraction of Model A increases as the Elo rating of Model B decreases, demonstrating the effectiveness of the Elo rating system in ranking different models.

Specific cells in the heatmap reveal notable findings. For instance, although PlayGround 2.5 achieves the state-of-the-art (SOTA) Elo rating in the Text-to-Image task, its winning fraction over PixArt- $\sigma$ is only $0.48$ , which is below 50%. Similarly, the Text-to-Video SoTA model, T2V-Turbo, has a lower winning fraction against StableVideoDiffusion. The higher Elo rating of T2V-Turbo might be due to our Arena collecting more votes from "easy games" with low-ranked models and fewer from "harder games" with high-ranked models. For example, the number of battles between T2V-Turbo and AnimateDiff-Turbo ( $30$ ) is way more than T2V-Turbo with other models (around $10$ ) in Figure 4. These anomalies highlight potential drawbacks of the Elo rating system: (1) a reliable and robust Elo rating requires a large amount of voting data, and (2) the estimated Elo rating may be biased by the imbalance between "easy games" and "harder games," as they carry similar weight in the estimation.

Case Study

We present case studies in Figure 5, showcasing the votes collected for three generative tasks. These cases demonstrate that GenAI-Arena users can provide high-quality votes, even for the most advanced models. For instance, in the text-to-image task, the image generated by PlayGround V2.5 was preferred over that of SDXL-Lightning for the prompt "a cute dog is playing with a ball," as the latter depicted two dogs instead of one. Users can clearly distinguish and vote based on the quality of the outputs, even when both models complete the task. In the image editing task, the edited image from Prompt2Prompt appeared more natural than the one from InfEdit, leading users to make a definitive vote. Similarly, votes collected for the text-to-video task were also of high quality.

5 GenAI-Bench

5.1 Dataset

We applied Llama Guard [25] as an NSFW filter to ensure that the user input prompt is appropriate for a wide range of audiences and protects users of the benchmark from exposure to potentially harmful or offensive content. In the text-to-image generation task, we collect 4.3k anonymous votes in total and there are 1.7k votes left after filtering for the safe content. We observe a large amount of the prompt is filtered out due to sexual content, which takes up 85.6% of the abandoned data. In the text-guided image editing task, we collect 1.1k votes from users before filtering. After applying Llama Guard, there are 0.9k votes for the image edition being released. In this task, 87.5% of the unsafe inputs contain violent crimes, and the other 12.5% is filtered out resulting from sex-related crimes. For text-to-video generation task, our platform collects 1.2k votes before post-processing. After cleaning it with the NSFW filter, we release the remaining 1.1k votes. All of the unsafe data abandoned in this task is due to the sexual content. We released the current version of GenAI-Bench¹¹1https://huggingface.co/datasets/TIGER-Lab/GenAI-Bench on the HuggingFace Dataset website, with an MIT license to allow the reuse with or without modification.

5.2 Correlations

To further analyze the collected human votes, we compute the correlation with several existing metrics. Specifically, We selected CLIPScore [16], GPT-4o [43], Gemini-1.5-Pro [52], Idefics2 [30], and Mantis [26] as our judges. For these MLLMs, we used the prompt from VIEScore [27] which includes the rating of semantics, quality, and overall performance, to evaluate the image generation tasks. Since VIEScore does not cover prompts related to video evaluation, we designed a suite of prompt templates in subsection A.6 for prompting MLLMs to evaluate the quality of the output for test-to-video generation tasks. Videos are extracted into image frames and fed into them as an image sequence. We encoded the voting results and computed the correlations with the score differences of the existing metrics between the two models. As shown in Table 6, the correlations are generally low. Most MLLM correlations with this preference-based voting approach are nearly random. Notably, CLIPScore achieved a low but significant correlation in the range of 0.2. GPT-4o’s quality measure, while achieving a similar low correlation for text-to-image and text-to-video tasks, shows a random correlation for the image editing task.

Table 6: Correlation Study on existing metrics and human votings in GenAI-Bench.

Metrics	Text-To-Image			Image Editing			Text-To-Video
Subscore	semantics	quality	overall	semantics	quality	overall	semantics	quality	overall
Random	-0.0188	-0.0188	-0.0188	-0.0293	-0.0293	-0.0293	-0.0168	-0.0168	-0.0168
CLIPScore [16]	0.1450	0.1450	0.1450	0.1434	0.1434	0.1434	0.2643	0.2643	0.2643
GPT-4o [43]	-0.0749	0.2259	-0.0233	-0.0314	0.0048	-0.0187	0.0216	0.2169	0.0393
Gemini-1.5-Pro [52]	-0.0008	0.1725	0.0114	0.0997	-0.0308	0.0916	0.0428	-0.0388	-0.0073
Idefics2 [30]	-0.0571	-0.1155	-0.0956	0.0325	-0.0363	0.0101	-0.1647	-0.0807	-0.1490
Mantis [26]	-0.0045	0.0118	-0.0078	-0.0754	-0.1006	-0.1258	-0.0301	-0.0001	0.0050

6 Conclusion

In this paper, we introduced GenAI-Arena, an open platform designed to rank generative models across text-to-image, image editing, and text-to-video tasks based on user preference. unlike other platforms, GenAI-Arena is driven by community voting to ensure transparency and sustainable operation. We employed the side-by-side human voting method to evaluate the models and collected over 6000 votes starting from February 11th, 2024. We compiled an Elo leaderboard with the votings and found that PlayGround V2.5, MagicBrush, and T2V-Turbo are the current state-of-the-art models in the three tasks (until June 4th, 2024). Analysis based on the collected votes shows that while the Elo rating is generally functional, but can biased by the imbalance of the "easy games" and "hard games". Several case studies demonstrate the high quality of our collected votes. What’s more, we also released the human preference voting as GenAI-Bench. We prompt the existing MLLMs to evaluate the generated images and videos on GenAI-Bench and compute the correlations with human voting. The experiment showed that the existing MLLMs achieve very low correlation, even the best model GPT-4o can only achieve $0.22$ Pearson correlation with human voting on quality, same as random guessing on other aspects. In the future, we will continue collecting human votes to update the leaderboard, hel** the community to keep track of the research progress. We also plan to develop a more robust MLLM to better approximate human ratings in GenAI-Bench.

References

AI [2024] S. AI. CosXL. https://huggingface.co/stabilityai/cosxl, 2024. Accessed on: 2024-04-13.
Blattmann et al. [2023] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, and D. Lorenz. Stable video diffusion: Scaling latent video diffusion models to large datasets. ArXiv, abs/2311.15127, 2023. URL https://api.semanticscholar.org/CorpusID:265312551.
Bradley and Terry [1952] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
Brooks et al. [2023] T. Brooks, A. Holynski, and A. A. Efros. Instructpix2pix: Learning to follow image editing instructions. In CVPR, 2023.
Chen et al. [2024a] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. arXiv preprint arXiv:2401.09047, 2024a.
Chen et al. [2024b] H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C.-L. Weng, and Y. Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. ArXiv, abs/2401.09047, 2024b. URL https://api.semanticscholar.org/CorpusID:267028095.
Chen et al. [2023] J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. T. Kwok, P. Luo, H. Lu, and Z. Li. Pixart- $\alpha$ : Fast training of diffusion transformer for photorealistic text-to-image synthesis. ArXiv, abs/2310.00426, 2023. URL https://api.semanticscholar.org/CorpusID:263334265.
Chen et al. [2024c] J. Chen, C. Ge, E. Xie, Y. Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li. Pixart- $\sigma$ : Weak-to-strong training of diffusion transformer for 4k text-to-image generation. ArXiv, abs/2403.04692, 2024c. URL https://api.semanticscholar.org/CorpusID:268264262.
Chen et al. [2024d] Q. Chen, X. Chen, H. Song, Z. Xiong, A. Yuille, C. Wei, and Z. Zhou. Towards generalizable tumor synthesis, 2024d.
Chiang et al. [2024a] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, and I. Stoica. Chatbot arena: An open platform for evaluating llms by human preference. ArXiv, abs/2403.04132, 2024a. URL https://api.semanticscholar.org/CorpusID:268264163.
Chiang et al. [2024b] W.-L. Chiang, L. Zheng, Y. Sheng, A. N. Angelopoulos, T. Li, D. Li, H. Zhang, B. Zhu, M. Jordan, J. E. Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024b.
Fu et al. [2024] S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. Advances in Neural Information Processing Systems, 36, 2024.
Ghazanfari et al. [2023] S. Ghazanfari, A. Araujo, P. Krishnamurthy, F. Khorrami, and S. Garg. Lipsim: A provably robust perceptual similarity metric. In The Twelfth International Conference on Learning Representations, 2023.
Guo et al. [2023] Y. Guo, C. Yang, A. Rao, Z. Liang, Y. Wang, Y. Qiao, M. Agrawala, D. Lin, and B. Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. In The Twelfth International Conference on Learning Representations, 2023.
Hertz et al. [2022] A. Hertz, R. Mokady, J. M. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or. Prompt-to-prompt image editing with cross attention control. ArXiv, abs/2208.01626, 2022. URL https://api.semanticscholar.org/CorpusID:251252882.
Hessel et al. [2021] J. Hessel, A. Holtzman, M. Forbes, R. L. Bras, and Y. Choi. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
Heusel et al. [2017] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
Ho et al. [2022] J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
Hu et al. [2023] Y. Hu, B. Liu, J. Kasai, Y. Wang, M. Ostendorf, R. Krishna, and N. A. Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023.
Huang et al. [2023] K. Huang, K. Sun, E. Xie, Z. Li, and X. Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36:78723–78747, 2023.
Huang et al. [2024] Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. **, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. VBench: Comprehensive benchmark suite for video generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
Huber et al. [1967] P. J. Huber et al. The behavior of maximum likelihood estimates under nonstandard conditions. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 221–233. Berkeley, CA: University of California Press, 1967.
Hugging Face [2024] Hugging Face. Zerogpu. https://huggingface.co/zero-gpu-explorers, 2024. Accessed: 2024-06-02.
Hugging Face Spaces [2024] Hugging Face Spaces. Tokenizer arena. https://huggingface.co/spaces/eson/tokenizer-arena, 2024. Accessed: 2024-06-05.
Inan et al. [2023] H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y. Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations. ArXiv, abs/2312.06674, 2023. URL https://api.semanticscholar.org/CorpusID:266174345.
Jiang et al. [2024] D. Jiang, X. He, H. Zeng, C. Wei, M. Ku, Q. Liu, and W. Chen. Mantis: Interleaved multi-image instruction tuning. arXiv preprint arXiv:2405.01483, 2024.
Ku et al. [2024a] M. Ku, D. Jiang, C. Wei, X. Yue, and W. Chen. Viescore: Towards explainable metrics for conditional image synthesis evaluation. In Proceedings of Annual Meeting of the Association for Computational Linguistics, 2024a.
Ku et al. [2024b] M. Ku, T. Li, K. Zhang, Y. Lu, X. Fu, W. Zhuang, and W. Chen. Imagenhub: Standardizing the evaluation of conditional image generation models. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=OuV9ZrkQlc.
Kwon et al. [2023] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
Laurençon et al. [2024] H. Laurençon, L. Tronchon, M. Cord, and V. Sanh. What matters when building vision-language models?, 2024.
Lee et al. [2024] T. Lee, M. Yasunaga, C. Meng, Y. Mai, J. S. Park, A. Gupta, Y. Zhang, D. Narayanan, H. Teufel, M. Bellagente, et al. Holistic evaluation of text-to-image models. Advances in Neural Information Processing Systems, 36, 2024.
Li et al. [2024a] D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi. Playground v2.5: Three insights towards enhancing aesthetic quality in text-to-image generation. ArXiv, abs/2402.17245, 2024a. URL https://api.semanticscholar.org/CorpusID:268033039.
Li et al. [2024b] D. Li, A. Kamko, A. Sabet, E. Akhgari, L. Xu, and S. Doshi. Playground v2, 2024b. URL [https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic](https://huggingface.co/playgroundai/playground-v2-1024px-aesthetic).
Li et al. [2024c] J. Li, W. Feng, T.-J. Fu, X. Wang, S. Basu, W. Chen, and W. Y. Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback. ArXiv, 2024c. URL https://api.semanticscholar.org/CorpusID:270094742.
Li et al. [2023] T. Li, M. Ku, C. Wei, and W. Chen. Dreamedit: Subject-driven image editing. Transactions on Machine Learning Research, 2023.
Lin et al. [2024] S. Lin, A. Wang, and X. Yang. Sdxl-lightning: Progressive adversarial diffusion distillation. ArXiv, abs/2402.13929, 2024. URL https://api.semanticscholar.org/CorpusID:267770548.
Liu et al. [2023] Y. Liu, X. Cun, X. Liu, X. Wang, Y. Zhang, H. Chen, Y. Liu, T. Zeng, R. Chan, and Y. Shan. Evalcrafter: Benchmarking and evaluating large video generation models. arXiv preprint arXiv:2310.11440, 2023.
Luo et al. [2023] S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference. ArXiv, abs/2310.04378, 2023. URL https://api.semanticscholar.org/CorpusID:263831037.
Meng et al. [2021] C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon. Sdedit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2021.
mrfakename et al. [2024] mrfakename, V. Srivastav, C. Fourrier, L. Pouget, Y. Lacombe, main, and S. Gandhi. Text to speech arena. https://huggingface.co/spaces/TTS-AGI/TTS-Arena, 2024.
Nichol et al. [2022] A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
of Singapore [2024] N. U. of Singapore. Open-Sora: Democratizing Efficient Video Production for All. https://github.com/hpcaitech/Open-Sora/blob/main/docs/report_01.md, 2024. Accessed on: 2024-05-24.
OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
openjourney.ai [2023] openjourney.ai. Openjourney is an open source stable diffusion fine tuned model on midjourney images, 2023. URL https://huggingface.co/prompthero/openjourney.
Otani et al. [2023] M. Otani, R. Togashi, Y. Sawai, R. Ishigami, Y. Nakashima, E. Rahtu, J. Heikkilä, and S. Satoh. Toward verifiable and reproducible human evaluation for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14277–14286, 2023.
Parmar et al. [2023] G. Parmar, K. Kumar Singh, R. Zhang, Y. Li, J. Lu, and J.-Y. Zhu. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
Peebles and Xie [2023] W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
Pernias et al. [2023] P. Pernias, D. Rampas, M. L. Richter, C. Pal, and M. Aubreville. Würstchen: An efficient architecture for large-scale text-to-image diffusion models. In The Twelfth International Conference on Learning Representations, 2023.
Podell et al. [2023] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Muller, J. Penna, and R. Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. ArXiv, abs/2307.01952, 2023. URL https://api.semanticscholar.org/CorpusID:259341735.
Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
Ramesh et al. [2022] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. ArXiv, abs/2204.06125, 2022. URL https://api.semanticscholar.org/CorpusID:248097655.
Reid et al. [2024] M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
Saharia et al. [2022] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
Salimans et al. [2016] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training gans. In D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. URL https://proceedings.neurips.cc/paper_files/paper/2016/file/8a3363abe792db2d8761d6403605aeb7-Paper.pdf.
Sauer et al. [2023] A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach. Adversarial diffusion distillation. ArXiv, abs/2311.17042, 2023. URL https://api.semanticscholar.org/CorpusID:265466173.
Tumanyan et al. [2023] N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
Unterthiner et al. [2018] T. Unterthiner, S. van Steenkiste, K. Kurach, R. Marinier, M. Michalski, and S. Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
Wang et al. [2023a] J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang. Modelscope text-to-video technical report. ArXiv, abs/2308.06571, 2023a. URL https://api.semanticscholar.org/CorpusID:260887737.
Wang et al. [2023b] Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P. Yang, et al. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103, 2023b.
Wang et al. [2004] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
Wu and la Torre [2023] C. H. Wu and F. D. la Torre. A latent space of stochastic diffusion models for zero-shot image editing and guidance. In ICCV, 2023.
Xu et al. [2023] P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, F. Meng, S. Huang, Y. Qiao, and P. Luo. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
Xu et al. [2024] S. Xu, Y. Huang, J. Pan, Z. Ma, and J. Chai. Inversion-free image editing with natural language. In Conference on Computer Vision and Pattern Recognition 2024, 2024.
Zhang et al. [2024] H. Zhang, J. Yang, S. Wan, and P. Fua. Lefusion: Synthesizing myocardial pathology on cardiac mri via lesion-focus diffusion models, 2024.
Zhang et al. [2023a] K. Zhang, L. Mo, W. Chen, H. Sun, and Y. Su. Magicbrush: A manually annotated dataset for instruction-guided image editing. NeurIPS dataset and benchmark track, 2023a.
Zhang et al. [2023b] L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
Zhang et al. [2018] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
Zheng et al. [2023] L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. Barrett, and Y. Sheng. Efficiently programming large language models using sglang, 2023.
Zheng et al. [2024] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
Zhu et al. [2023] D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2023.

Appendix A Appendix

A.1 Broader Society Impacts

The establishment of GenAI-Arena and the release of GenAI-Bench have broader societal implications. By democratizing the evaluation of generative models, GenAI-Arena encourages transparency and community engagement in AI development. This can lead to more trust in AI technologies as the public can gain insights into how models perform according to peer evaluations. Moreover, involving the community in such evaluations can accelerate the identification of potentially harmful biases or unethical uses of AI technologies. However, there are potential risks associated with the widespread use of generative AI technologies that GenAI-Arena evaluates. For instance, advancements in text-to-image and text-to-video generation can be misused for creating misleading or harmful content, such as those filtered by NSFW Filter.

A.2 Limitation

While the release of GenAI-Arena can enable a more reasonable evaluation of the generative models, there are several limitations in its development. First, the diversity and representativeness of the user base participating in GenAI-Arena may not fully encapsulate the broader population’s preferences, which will potentially bias the evaluation results. Despite efforts to attract voters with diverse backgrounds, there is an inherent challenge in ensuring a balanced representation across different cultures or professional backgrounds. In addition, the reliance on user feedback and votes introduces subjectivity into the evaluation process. While this is partially mitigated by the volume of data collected, individual biases and varying levels of expertise among users can skew the results.

A.3 Data Collection

We stated in the GenAI-Arena UI that the input and votes will be collected for research purposes only. By using this GenAI-Arena tool, the users agree to the collection of their input and votes for research purposes. The users are acknowledged that their data will be anonymized and will not be used for commercial purposes.

A.4 Extra Visualization on GenAI-Arena

We included more analysis in Figure 6 and 7 to show the reliability of GenAI-Arena. Specifically, Figure 6 shows the error bar of the Elo rating to prove the reliability. For Figure 7, it predicts the average win rate if the model is played against other models.

A.5 VideoGenHub

VideoGenHub is an open-source library to standardize the inference and evaluation of all the conditional video generation models, similar to ImagenHub [28] in the image domain. In the library, all models are implemented with the literature standard, and the seeds are set as 42 for a fair comparison, which is the same standard as ImagenHub [28] implementation.

A.6 Prompt Templates

In the following table, we present the prompts used for the experiments in subsection 5.2 evaluating text-to-video. The text-to-image and image editing prompts are directly used the ones from the VIEScore [27]. The overall score is computed as a geometric mean of both semantic consistency and perceptual quality.

For Semantic consistency in Text-To-Video task:

For Perceptual Quality in Text-To-Video task: