-
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
Authors:
Xuan He,
Dongfu Jiang,
Ge Zhang,
Max Ku,
Achint Soni,
Sherman Siu,
Haonan Chen,
Abhranil Chandra,
Ziyan Jiang,
Aaran Arulraj,
Kai Wang,
Quy Duc Do,
Yuansheng Ni,
Bohan Lyu,
Yaswanth Narsupalli,
Rongqi Fan,
Zhiheng Lyu,
Yuchen Lin,
Wenhu Chen
Abstract:
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-prov…
▽ More
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
△ Less
Submitted 24 June, 2024; v1 submitted 21 June, 2024;
originally announced June 2024.
-
GenAI Arena: An Open Evaluation Platform for Generative Models
Authors:
Dongfu Jiang,
Max Ku,
Tianle Li,
Yuansheng Ni,
Shizhuo Sun,
Rongqi Fan,
Wenhu Chen
Abstract:
Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the n…
▽ More
Generative AI has made remarkable strides to revolutionize fields such as image and video generation. These advancements are driven by innovative algorithms, architecture, and data. However, the rapid proliferation of generative models has highlighted a critical gap: the absence of trustworthy evaluation metrics. Current automatic assessments such as FID, CLIP, FVD, etc often fail to capture the nuanced quality and user satisfaction associated with generative outputs. This paper proposes an open platform GenAI-Arena to evaluate different image and video generative models, where users can actively participate in evaluating these models. By leveraging collective user feedback and votes, GenAI-Arena aims to provide a more democratic and accurate measure of model performance. It covers three arenas for text-to-image generation, text-to-video generation, and image editing respectively. Currently, we cover a total of 27 open-source generative models. GenAI-Arena has been operating for four months, amassing over 6000 votes from the community. We describe our platform, analyze the data, and explain the statistical methods for ranking the models. To further promote the research in building model-based evaluation metrics, we release a cleaned version of our preference data for the three tasks, namely GenAI-Bench. We prompt the existing multi-modal models like Gemini, GPT-4o to mimic human voting. We compute the correlation between model voting with human voting to understand their judging abilities. Our results show existing multimodal models are still lagging in assessing the generated visual content, even the best model GPT-4o only achieves a Pearson correlation of 0.22 in the quality subscore, and behaves like random guessing in others.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
Authors:
Yubo Wang,
Xueguang Ma,
Ge Zhang,
Yuansheng Ni,
Abhranil Chandra,
Shiguang Guo,
Weiming Ren,
Aaran Arulraj,
Xuan He,
Ziyan Jiang,
Tianle Li,
Max Ku,
Kai Wang,
Alex Zhuang,
Rongqi Fan,
Xiang Yue,
Wenhu Chen
Abstract:
In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in…
▽ More
In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.
△ Less
Submitted 23 June, 2024; v1 submitted 3 June, 2024;
originally announced June 2024.
-
MANTIS: Interleaved Multi-Image Instruction Tuning
Authors:
Dongfu Jiang,
Xuan He,
Huaye Zeng,
Cong Wei,
Max Ku,
Qian Liu,
Wenhu Chen
Abstract:
Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effec…
▽ More
Large multimodal models (LMMs) have shown great results in single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing LMMs like OpenFlamingo, Emu2, Idefics gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from the web, which is neither efficient nor effective. In this paper, we aim to build strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K multi-image instruction data to train a family of models Mantis. The instruction tuning empowers Mantis with different multi-image skills like co-reference, comparison, reasoning, and temporal understanding. We evaluate Mantis on five multi-image benchmarks and seven single-image benchmarks. Mantis-SigLIP can achieve SoTA results on all the multi-image benchmarks and beat the strongest multi-image baseline, Idefics2-8B by an average of 11 absolute points. Notably, Idefics2-8B was pre-trained on 140M interleaved multi-image data, which is 200x larger than Mantis-Instruct. We observe that Mantis performs equivalently well on the held-in and held-out benchmarks, which shows its generalization ability. Notably, we found that Mantis can even match the performance of GPT-4V on multi-image benchmarks. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2. Our results show that multi-image abilities are not necessarily gained through massive pre-training, instead, it can be gained by the low-cost instruction tuning. Our work provides new perspectives on how to improve LMMs' multi-image abilities.
△ Less
Submitted 23 May, 2024; v1 submitted 2 May, 2024;
originally announced May 2024.
-
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks
Authors:
Max Ku,
Cong Wei,
Weiming Ren,
Harry Yang,
Wenhu Chen
Abstract:
In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these…
▽ More
In the dynamic field of digital content creation using generative models, state-of-the-art video editing models still do not offer the level of quality and control that users desire. Previous works on video editing either extended from image-based generative models in a zero-shot manner or necessitated extensive fine-tuning, which can hinder the production of fluid video edits. Furthermore, these methods frequently rely on textual input as the editing guidance, leading to ambiguities and limiting the types of edits they can perform. Recognizing these challenges, we introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing into two primary steps: (1) employing an off-the-shelf image editing model to modify the first frame, (2) utilizing an existing image-to-video generation model to generate the edited video through temporal feature injection. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods. AnyV2V can also support any video length. Our evaluation indicates that AnyV2V significantly outperforms other baseline methods in automatic and human evaluations by significant margin, maintaining visual consistency with the source video while achieving high-quality edits across all the editing tasks.
△ Less
Submitted 10 June, 2024; v1 submitted 21 March, 2024;
originally announced March 2024.
-
VIEScore: Towards Explainable Metrics for Conditional Image Synthesis Evaluation
Authors:
Max Ku,
Dongfu Jiang,
Cong Wei,
Xiang Yue,
Wenhu Chen
Abstract:
In the rapidly advancing field of conditional image generation research, challenges such as limited explainability lie in effectively evaluating the performance and capabilities of various models. This paper introduces VIEScore, a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks. VIEScore leverages general knowledge from Multimodal Large Language M…
▽ More
In the rapidly advancing field of conditional image generation research, challenges such as limited explainability lie in effectively evaluating the performance and capabilities of various models. This paper introduces VIEScore, a Visual Instruction-guided Explainable metric for evaluating any conditional image generation tasks. VIEScore leverages general knowledge from Multimodal Large Language Models (MLLMs) as the backbone and does not require training or fine-tuning. We evaluate VIEScore on seven prominent tasks in conditional image tasks and found: (1) VIEScore (GPT4-o) achieves a high Spearman correlation of 0.4 with human evaluations, while the human-to-human correlation is 0.45. (2) VIEScore (with open-source MLLM) is significantly weaker than GPT-4o and GPT-4v in evaluating synthetic images. (3) VIEScore achieves a correlation on par with human ratings in the generation tasks but struggles in editing tasks. With these results, we believe VIEScore shows its great potential to replace human judges in evaluating image synthesis tasks.
△ Less
Submitted 3 June, 2024; v1 submitted 22 December, 2023;
originally announced December 2023.
-
ImagenHub: Standardizing the evaluation of conditional image generation models
Authors:
Max Ku,
Tianle Li,
Kai Zhang,
Yujie Lu,
Xingyu Fu,
Wenwen Zhuang,
Wenhu Chen
Abstract:
Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics - render fair comparisons…
▽ More
Recently, a myriad of conditional image generation and editing models have been developed to serve different downstream tasks, including text-to-image generation, text-guided image editing, subject-driven image generation, control-guided image generation, etc. However, we observe huge inconsistencies in experimental conditions: datasets, inference, and evaluation metrics - render fair comparisons difficult. This paper proposes ImagenHub, which is a one-stop library to standardize the inference and evaluation of all the conditional image generation models. Firstly, we define seven prominent tasks and curate high-quality evaluation datasets for them. Secondly, we built a unified inference pipeline to ensure fair comparison. Thirdly, we design two human evaluation scores, i.e. Semantic Consistency and Perceptual Quality, along with comprehensive guidelines to evaluate generated images. We train expert raters to evaluate the model outputs based on the proposed metrics. Our human evaluation achieves a high inter-worker agreement of Krippendorff's alpha on 76% models with a value higher than 0.4. We comprehensively evaluated a total of around 30 models and observed three key takeaways: (1) the existing models' performance is generally unsatisfying except for Text-guided Image Generation and Subject-driven Image Generation, with 74% models achieving an overall score lower than 0.5. (2) we examined the claims from published papers and found 83% of them hold with a few exceptions. (3) None of the existing automatic metrics has a Spearman's correlation higher than 0.2 except subject-driven image generation. Moving forward, we will continue our efforts to evaluate newly published models and update our leaderboard to keep track of the progress in conditional image generation.
△ Less
Submitted 10 March, 2024; v1 submitted 2 October, 2023;
originally announced October 2023.
-
DreamEdit: Subject-driven Image Editing
Authors:
Tianle Li,
Max Ku,
Cong Wei,
Wenhu Chen
Abstract:
Subject-driven image generation aims at generating images containing customized subjects, which has recently drawn enormous attention from the research community. However, the previous works cannot precisely control the background and position of the target subject. In this work, we aspire to fill the void and propose two novel subject-driven sub-tasks, i.e., Subject Replacement and Subject Additi…
▽ More
Subject-driven image generation aims at generating images containing customized subjects, which has recently drawn enormous attention from the research community. However, the previous works cannot precisely control the background and position of the target subject. In this work, we aspire to fill the void and propose two novel subject-driven sub-tasks, i.e., Subject Replacement and Subject Addition. The new tasks are challenging in multiple aspects: replacing a subject with a customized one can change its shape, texture, and color, while adding a target subject to a designated position in a provided scene necessitates a context-aware posture. To conquer these two novel tasks, we first manually curate a new dataset DreamEditBench containing 22 different types of subjects, and 440 source images with different difficulty levels. We plan to host DreamEditBench as a platform and hire trained evaluators for standard human evaluation. We also devise an innovative method DreamEditor to resolve these tasks by performing iterative generation, which enables a smooth adaptation to the customized subject. In this project, we conduct automatic and human evaluations to understand the performance of DreamEditor and baselines on DreamEditBench. For Subject Replacement, we found that the existing models are sensitive to the shape and color of the original subject. The model failure rate will dramatically increase when the source and target subjects are highly different. For Subject Addition, we found that the existing models cannot easily blend the customized subjects into the background smoothly, leading to noticeable artifacts in the generated image. We hope DreamEditBench can become a standard platform to enable future investigations toward building more controllable subject-driven image editing. Our project homepage is https://dreameditbenchteam.github.io/.
△ Less
Submitted 16 August, 2023; v1 submitted 21 June, 2023;
originally announced June 2023.
-
Vec2Gloss: definition modeling leveraging contextualized vectors with Wordnet gloss
Authors:
Yu-Hsiang Tseng,
Mao-Chang Ku,
Wei-Ling Chen,
Yu-Lin Chang,
Shu-Kai Hsieh
Abstract:
Contextualized embeddings are proven to be powerful tools in multiple NLP tasks. Nonetheless, challenges regarding their interpretability and capability to represent lexical semantics still remain. In this paper, we propose that the task of definition modeling, which aims to generate the human-readable definition of the word, provides a route to evaluate or understand the high dimensional semantic…
▽ More
Contextualized embeddings are proven to be powerful tools in multiple NLP tasks. Nonetheless, challenges regarding their interpretability and capability to represent lexical semantics still remain. In this paper, we propose that the task of definition modeling, which aims to generate the human-readable definition of the word, provides a route to evaluate or understand the high dimensional semantic vectors. We propose a `Vec2Gloss' model, which produces the gloss from the target word's contextualized embeddings. The generated glosses of this study are made possible by the systematic gloss patterns provided by Chinese Wordnet. We devise two dependency indices to measure the semantic and contextual dependency, which are used to analyze the generated texts in gloss and token levels. Our results indicate that the proposed `Vec2Gloss' model opens a new perspective to the lexical-semantic applications of contextualized embeddings.
△ Less
Submitted 28 May, 2023;
originally announced May 2023.
-
TheoremQA: A Theorem-driven Question Answering dataset
Authors:
Wenhu Chen,
Ming Yin,
Max Ku,
Pan Lu,
Yixin Wan,
Xueguang Ma,
Jianyu Xu,
Xinyi Wang,
Tony Xia
Abstract:
The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed…
▽ More
The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.
△ Less
Submitted 5 December, 2023; v1 submitted 21 May, 2023;
originally announced May 2023.
-
UAV Trajectory, User Association and Power Control for Multi-UAV Enabled Energy Harvesting Communications: Offline Design and Online Reinforcement Learning
Authors:
Chien-Wei Fu,
Meng-Lin Ku,
Yu-Jia Chen,
Tony Q. S. Quek
Abstract:
In this paper, we consider multiple solar-powered wireless nodes which utilize the harvested solar energy to transmit collected data to multiple unmanned aerial vehicles (UAVs) in the uplink. In this context, we jointly design UAV flight trajectories, UAV-node communication associations, and uplink power control to effectively utilize the harvested energy and manage co-channel interference within…
▽ More
In this paper, we consider multiple solar-powered wireless nodes which utilize the harvested solar energy to transmit collected data to multiple unmanned aerial vehicles (UAVs) in the uplink. In this context, we jointly design UAV flight trajectories, UAV-node communication associations, and uplink power control to effectively utilize the harvested energy and manage co-channel interference within a finite time horizon. To ensure the fairness of wireless nodes, the design goal is to maximize the worst user rate. The joint design problem is highly non-convex and requires causal (future) knowledge of the instantaneous energy state information (ESI) and channel state information (CSI), which are difficult to predict in reality. To overcome these challenges, we propose an offline method based on convex optimization that only utilizes the average ESI and CSI. The problem is solved by three convex subproblems with successive convex approximation (SCA) and alternative optimization. We further design an online convex-assisted reinforcement learning (CARL) method to improve the system performance based on real-time environmental information. An idea of multi-UAV regulated flight corridors, based on the optimal offline UAV trajectories, is proposed to avoid unnecessary flight exploration by UAVs and enables us to improve the learning efficiency and system performance, as compared with the conventional reinforcement learning (RL) method. Computer simulations are used to verify the effectiveness of the proposed methods. The proposed CARL method provides 25% and 12% improvement on the worst user rate over the offline and conventional RL methods.
△ Less
Submitted 21 July, 2022;
originally announced July 2022.
-
Data-Driven Stochastic Models and Policies for Energy Harvesting Sensor Communications
Authors:
Meng-Lin Ku,
Yan Chen,
K. J. Ray Liu
Abstract:
Energy harvesting from the surroundings is a promising solution to perpetually power-up wireless sensor communications. This paper presents a data-driven approach of finding optimal transmission policies for a solar-powered sensor node that attempts to maximize net bit rates by adapting its transmission parameters, power levels and modulation types, to the changes of channel fading and battery rec…
▽ More
Energy harvesting from the surroundings is a promising solution to perpetually power-up wireless sensor communications. This paper presents a data-driven approach of finding optimal transmission policies for a solar-powered sensor node that attempts to maximize net bit rates by adapting its transmission parameters, power levels and modulation types, to the changes of channel fading and battery recharge. We formulate this problem as a discounted Markov decision process (MDP) framework, whereby the energy harvesting process is stochastically quantized into several representative solar states with distinct energy arrivals and is totally driven by historical data records at a sensor node. With the observed solar irradiance at each time epoch, a mixed strategy is developed to compute the belief information of the underlying solar states for the choice of transmission parameters. In addition, a theoretical analysis is conducted for a simple on-off policy, in which a predetermined transmission parameter is utilized whenever a sensor node is active. We prove that such an optimal policy has a threshold structure with respect to battery states and evaluate the performance of an energy harvesting node by analyzing the expected net bit rate. The design framework is exemplified with real solar data records, and the results are useful in characterizing the interplay that occurs between energy harvesting and expenditure under various system configurations. Computer simulations show that the proposed policies significantly outperform other schemes with or without the knowledge of short-term energy harvesting and channel fading patterns.
△ Less
Submitted 28 May, 2014;
originally announced May 2014.
-
Detection of copy-move forgery in digital images based on DCT
Authors:
Nathalie Diane Wandji,
Sun Xingming,
Moise Fah Kue
Abstract:
With rapid advances in digital information processing systems, and more specifically in digital image processing software, there is a widespread development of advanced tools and techniques for digital image forgery. One of the techniques most commonly used is the Copy-move forgery which proceeds by copying a part of an image and pasting it into the same image, in order to maliciously hide an obje…
▽ More
With rapid advances in digital information processing systems, and more specifically in digital image processing software, there is a widespread development of advanced tools and techniques for digital image forgery. One of the techniques most commonly used is the Copy-move forgery which proceeds by copying a part of an image and pasting it into the same image, in order to maliciously hide an object or a region. In this paper, we propose a method to detect this specific kind of counterfeit. Firstly, the color image is converted from RGB color space to YCbCr color space and then the R, G, B and Y-component are splitted into fixed-size overlap** blocks and, features are extracted from the R, G and B-components image blocks on one hand and on the other, from the DCT representation of the R, G, B and Ycomponent image block. The feature vectors obtained are then lexicographically sorted to make similar image blocks neighbors and duplicated image blocks are identified using Euclidean distance as similarity criterion. Experimental results showed that the proposed method can detect the duplicated regions when there is more than one copy move forged area in the image and even in case of slight rotations, JPEG compression, shift, scale, blur and noise addition.
△ Less
Submitted 26 August, 2013;
originally announced August 2013.