Large Multi-modality Model Assisted
AI-Generated Image Quality Assessment

Puyi Wang, Wei Sun, Zicheng Zhang, Jun Jia,
Yanwei Jiang, Zhichao Zhang, Xiongkuo Min, Guangtao Zhai
Shanghai Jiao Tong UniversityShanghaiChina
Abstract.

Traditional deep neural network (DNN)-based image quality assessment (IQA) models leverage convolutional neural networks (CNN) or Transformer to learn the quality-aware feature representation, achieving commendable performance on natural scene images. However, when applied to AI-Generated images (AGIs), these DNN-based IQA models exhibit subpar performance. This situation is largely due to the semantic inaccuracies inherent in certain AGIs caused by uncontrollable nature of the generation process. Thus, the capability to discern semantic content becomes crucial for assessing the quality of AGIs. Traditional DNN-based IQA models, constrained by limited parameter complexity and training data, struggle to capture complex fine-grained semantic features, making it challenging to grasp the existence and coherence of semantic content of the entire image. To address the shortfall in semantic content perception of current IQA models, we introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) model, which utilizes semantically informed guidance to sense semantic information and extract semantic vectors through carefully designed text prompts. Moreover, it employs a mixture of experts (MoE) structure to dynamically integrate the semantic information with the quality-aware features extracted by traditional DNN-based IQA models. Comprehensive experiments conducted on two AI-generated content datasets, AIGCQA-20k and AGIQA-3k show that MA-AGIQA achieves state-of-the-art performance, and demonstrate its superior generalization capabilities on assessing the quality of AGIs. The code will be available.

Image Quality Assessment, AI-Generated Image, Large Multi-modality Model, Mixture of Experts
ccs: Computing methodologies Computer vision tasks
Refer to caption
Figure 1. The quality of AI-generated images is greatly influenced by the semantic content, i.e., the coherence and existence of semantic content. The coherence of a picture’s semantic content is crucial for providing a logically sound visual experience for human viewers, and images lacking semantic content fail to effectively communicate their intended design, reducing viewer engagement and satisfaction. This article primarily focuses on ”Contrary to common sense” (CCS), ”Noise image” (NI) and ”No semantic meanings” (NSM). We categorize ”CCS” and ”NI” under the coherence of semantics, while ”NSM” falls under the existence of semantics.

1. Introduction

The rapid advancement of artificial intelligence (AI) has led to a proliferation of AI-generated images (AGIs) on the Internet. However, current AI-driven image generation systems often produce multiple images, necessitating manual selection by users to identify the best ones. This labor-intensive process is not only time-consuming but also a significant barrier to fully automating image processing pipelines. Visual quality, as an important factor to select attractive AGIs, has gained lots of attention in recent years (Li et al., 2023b, 2024a). In this paper, we focus on how to evaluate the visual quality of AGIs, which on the one hand can be used to filter high-quality images from generation systems and on the other hand, can sever as reward function to optimize image generation models (Black et al., 2023), propelling progress in the field of AI-based image generation techniques.

While a substantial number of deep neural network (DNN)-based image quality assessment (IQA) models, such as HyperIQA (Su et al., 2020), MANIQA (Yang et al., 2022), DBCNN (Zhang et al., 2018), etc., have been developed, these models were specifically designed for and trained on natural scene images. When applied directly to AGIs, these models often exhibit poor performance. This is due to the fact that quality assessment of natural images primarily targets issues such as blur, noise, and other forms of degradation caused by photography equipment or techniques, which are not applicable to AGIs as they do not undergo such degradation during the generation process. Therefore, overemphasizing factors like blur or noise during the evaluation of AGIs is inappropriate.

As shown in Figure 1, AI-generated images, derived from advanced image generative models such as generative adversarial networks (GANs) (Ma et al., 2023), diffision (Ho et al., 2020) and related variant (Rombach et al., 2022b; Xu et al., 2018; Ramesh et al., 2021, 2022; dreamlike art, 2023; DeepFloyd, 2023; Holz, 2023; PlaygroundAI, 2023), often exhibit issues not commonly found in naturally captured images. Visual quality of AGIs depends not only on basic visual features such as noise, blur (Su et al., 2021; Li et al., 2022; Zhang et al., 2024), etc., but also on more intricate semantic perception (Li et al., 2024a), such as existence of reasonable semantic content, scene plausibility, and the coherence among objects (Wu et al., 2023b; Zhang et al., 2023; Li et al., 2024b; Wu et al., 2023c, 2024b). Although re-training existing IQA models on AGIs datasets leads to improved outcomes, it fails to achieve optimal performance. One reason is that traditional DNN models, especially early convolutional neural networks (CNNs), despite their notable achievements in tasks like image recognition and classification (Simonyan and Zisserman, 2015; He et al., 2015; Szegedy et al., 2014), still struggle to grasp the fine-grained semantic content of images (Zhang et al., 2021). What’s more, traditional DNN-based IQA models fail to capture the intrinsic characteristics essential for assessing image quality and thus exhibit poor generalization abilities. Hence, we argue that the quality assessment models of AGIs are still in their infancy and need further exploration.

Refer to caption
Figure 2. For the subset of grainy images (extracted from prompts containing “digital” and generated by LCM_Pixart in AIGCQA-20k) that include semantic content, MANIQA achieves an SRCC of 0.2545, which is 70.0%percent\%% lower than the overall SRCC of 0.8507. In contrast, our MA-AGIQA model achieves an SRCC of 0.8364. It demonstrates that our model possesses a significantly enhanced understanding of AGIs, particularly those whose quality is deeply intertwined with semantic elements.

To address the issue of semantic awareness, we resort to large multi-modality models (LMMs). Because LMMs is typically pre-trained on large-scale datasets and has already learned a rich set of joint visual and language knowledge, it can effectively capture the fine-grained semantic features relevant to input prompts. However, LMMs perform excellently in high-level visual understanding tasks (Achiam et al., 2023; Li et al., 2023a), yet they do not perform well on tasks that are relatively simple for humans, such as identifying structural and textural distortions, color differences, and geometric transformations (Wu et al., 2024a). In contrast, traditional deep learning networks excel at perceiving low-dimensional features and can fit better to the data distribution of specific task (Hornik et al., 1989). Therefore, the idea of combining LMMs with traditional deep learning networks is a natural progression.

In this paper, we introduce a large Multi-modality model Assisted AI-Generated Image Quality Assessment (MA-AGIQA) framework, which enhances the capacity of traditional DNN-based IQA models to understand semantic content by incorporating LMM. Our approach initially repurposes a DNN, MANIQA (Yang et al., 2022), as an extractor for quality-aware features and establishes it as the training backbone for the MA-AGIQA framework. Subsequently, we guide a LMM, mPLUG-Owl2 (Ye et al., 2023), to focus on fine-grained semantic information through meticulously crafted prompts. We then extract and store the last-layer hidden vector from mPLUG-Owl2, merging it with features extracted by MANIQA to infuse the model with rich semantic insights. Finally, we employ a MoE to dynamically integrate quality-aware features with fine-grained semantic features, catering to the unique focal points of different images. As demonstrated in Figure 2, our approach surpasses MANIQA in terms of SRCC, particularly within subsets comprising semantically rich images overflowing with graininess, indicating that our methodology shows remarkable congruence with the human visual system’s (HVS) perceptual capabilities. MA-AGIQA achieves SRCC values of 0.8939 and 0.8644 on the AGIQA-3k and AIGCQA-20k datasets, respectively, exceeding the state-of-the-art models by 2.03% and 1.37%, and also demonstrates superior cross-dataset performance.

Our contributions are three-fold:

  • We systematically analyze the issue of traditional DNN-based IQA lacking the ability to understand the semantic content of AGIs, emphasizing the importance of incorporating semantic information into traditional DNN-based IQA models.

  • We introduce the MA-AGIQA model, which incorporates LMM to extract fine-grained semantic features and dynamically integrates these features with traditional DNN-based IQA models.

  • We evaluate the MA-AGIQA model on two AI-generated IQA datasets. Experimental results demonstrate that our model surpasses current state-of-the-art methods without extra training data and also showcases superior cross-dataset performance. Extensive ablation studies further validate the effectiveness of each component.

2. Related Work

Traditional IQA models. In the field of No-Reference Image Quality Assessment (NR-IQA) (Zhai and Min, 2020), traditional models primarily fall into two categories: handcrafted feature-based and DNN-based.

Models based on handcrafted features, such as BRISQUE (Mittal et al., 2012a), ILNIQE (Zhang et al., 2015), and NIQE (Mittal et al., 2012b), primarily utilize natural scene statistics (NSS) (Mittal et al., 2012a, b) derived from natural images. These models are adept at detecting domain variations introduced by synthetic distortions, including spatial (Mittal et al., 2012a, b; Zhang et al., 2015), gradient (Mittal et al., 2012a), discrete cosine transform (DCT) (Saad et al., 2012), and wavelet-based distortions (Moorthy and Bovik, 2011). However, despite their effectiveness on datasets with type-specific distortions, these handcrafted feature-based approaches exhibit limited capabilities in modeling real-world distortions.

With the advent of deep learning, CNNs have revolutionized many tasks in computer vision. (Kang et al., 2014) is pioneer in applying deep convolutional neural networks to NR-IQA. Its methodology employs CNNs to directly learn representations of image quality from raw image patches, bypassing the need for handcrafted features or a reference image. Following this, DBCNN (Zhang et al., 2018) introduces a deep bilinear CNN for blind image quality assessment (BIQA) (Zhai and Min, 2020), innovatively merging two CNN streams to address both synthetic and authentic image distortions separately. Furthermore, HyperIQA (Su et al., 2020), a self-adaptive hyper network, evaluates the quality of authentically distorted images through a novel three-stage process: content understanding, perception rule learning, and quality prediction.

The success of Vision Transformers (ViT) (Dosovitskiy et al., 2020) in various computer vision tasks has led to significant advancements. In the realm of IQA, IQT (You and Korhonen, 2021) leverages the combination of reference and distorted image features, extracted by CNNs, as inputs for a Transformer-based quality prediction task. MUSIQ (Ke et al., 2021) utilizes a Transformer to encode distortion image features across three scales, addressing the challenge of varying input image sizes during training and testing. TReS introduces relative ranking and self-consistency loss to capitalize on the abundant self-supervisory information available, aiming to decrease the network’s sensitivity. What’s more, MANIQA (Yang et al., 2022) explored multi-dimensional feature interaction, utilizing spatial and channel structural information to calculate a non-local representation of the image, enhancing the model’s ability to assess image quality comprehensively.

LMMs for IQA. Recent methodologies employing LMMs for IQA either utilize LMMs in isolation or combine them with DNNs as feature extractors to enhance performance. (Qu et al., 2024) introduces an innovative image-prompt fusion module, along with a specially designed quality assessment token, aiming to learn comprehensive representations for AGIs, providing insights from image-prompt alignment. However, the evaluation of AGIs in practical scenarios often does not involve prompts and image-prompt alignment is more significant for assessing the capabilities of generative models rather than images quality. CLIPIQA (Wang et al., 2023) signifies a breakthrough in assessing image quality and perception by harnessing the strengths of CLIP (Radford et al., 2021) models. This method bridges the divide between measurable image quality attributes and subjective perceptions of quality without necessitating extensive labeling efforts. Nonetheless, their (Qu et al., 2024; Wang et al., 2023) dependence on visual-text similarity for quality score prediction often constrains their performance, rendering it marginally less effective compared to methods that exclusively focus on visual analysis. What’s more, Q-Bench (Wu et al., 2023b) innovates with a softmax strategy, allowing LMMs to deduce quantifiable quality scores. This is achieved by extracting results from softmax pooling on logits corresponding to five quality-related tokens. And Q-Align (Wu et al., 2023a) employs strategic alignment techniques to foster accuracy. Expanding further, (Wu et al., 2024a) delves into enhancing the assessment of AGIs by focusing on optimizing individual text prompts to leverage the intrinsic capabilities of LMMs, aiming to provide a more nuanced understanding and evaluation of image quality of AGIs. However, these methods, while notable, fall short of achieving satisfying efficacy, leaving considerable room for improvement.

Refer to caption
Figure 3. Overview of our proposed MA-AGIQA framework. Initially, MANIQA is repurposed as the foundational training backbone, whose structure is modified to generate quality-aware features. Second, a parameter fixed LMM, mPLUG-Owl2, serves as a fine-grained semantic feature extractor. This module utilizes carefully crafted prompts to capture the desired semantic information. Finally, the AFM module acts as an organic feature integrator, dynamically combining these features for enhanced performance.

3. Method

As depicted in Figure 3, framework of MA-AGIQA is structured into three sections. Section 3.1 introduces our adoption of a DNN, specifically MANIQA (Yang et al., 2022), tailored for the AGIs quality assessment task, serving as our primary training backbone. In Section 3.2, we incorporate the LMM mPLUG-Owl2 (Ye et al., 2023) as a feature extractor. This component is crucial for acquiring fine-grained semantic features via carefully crafted text prompts. Lastly, Section 3.3 addresses the variability in focal points across different images. To adaptively integrate the feature vectors during training, we utilizes a MoE structure for feature fusion. This approach ensures that the most salient features are emphasized. Further details are elaborated below.

Refer to caption
Figure 4. Four types of image display with strong correlation between image quality and semantics. The ground truth and model predication of the relevant images are presented below each image, showing a significant difference between the model predication and the ground truth, indicating that the model’s understanding of semantics is not sufficient.

3.1. Quality-aware Feature Extraction

To leverage the capability of DNNs to adapt to the data distribution of specific tasks, we employ MANIQA (Yang et al., 2022) as a quality-aware feature extractor. MANIQA enhances the evaluation of image quality by applying attention mechanisms across both the channel and spatial dimensions, thereby increasing the interaction among various regions of the image, both globally and locally. This approach generates projections weight𝑤𝑒𝑖𝑔𝑡weightitalic_w italic_e italic_i italic_g italic_h italic_t (W)𝑊(\mathit{W}\,)( italic_W ) and score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e (S𝑆\mathit{S}\,italic_S) for a given image, and the final rating of the whole image is determined through the sum of multiplication of S𝑆\mathit{S}italic_S by W𝑊\mathit{W}italic_W, which can be illustrated as Equation (1):

(1) (S,W)𝑆𝑊\displaystyle(\mathit{S},\mathit{W})( italic_S , italic_W ) =𝒯([image]),absent𝒯delimited-[]𝑖𝑚𝑎𝑔𝑒\displaystyle=\mathcal{T}([image]),= caligraphic_T ( [ italic_i italic_m italic_a italic_g italic_e ] ) ,
rating =S×WW,absent𝑆𝑊𝑊\displaystyle=\frac{\sum\mathit{S}\times\mathit{W}}{\sum\mathit{W}},= divide start_ARG ∑ italic_S × italic_W end_ARG start_ARG ∑ italic_W end_ARG ,

where S𝑆\mathit{S}italic_S and W𝑊\mathit{W}italic_W are one dimensional vectors.

However, directly applying MANIQA to the quality assessment of AGIs presents challenges, as illustrated in Figure 4. Image (a) displays a complex, symmetrical pattern, devoid of meaningful semantic content. Image (b) features incoherent areas, such as two grey holes in the sky that are inconsistent with the common sense. The blurriness and fuzziness of the man’s face in image (c) along the edges significantly impair human perception. Conversely, image (d), despite its severe graininess, retains its semantic integrity, representing an appealing artistic form. Traditional DNN-based models like MANIQA, lacking the capacity to comprehend semantic content, tend to overestimate the quality of images (a), (b), and (c), resulting in scores much higher than the ground truth. However, these images should be rated as low quality due to the poor viewing experience they offer. For image (d), traditional DNN-based models focus excessively on the graininess, mistaking it for a flaw, and assign a score significantly lower than the ground truth. This highlights the critical need for incorporating semantic information into the quality assessment of AGIs by traditional DNN-based models.

To address this issue, modifications were made so that the generated S𝑆\mathbf{\mathit{S}}italic_S and W𝑊\mathbf{\mathit{W}}italic_W no longer produce a rating. Instead, they yield a quality-aware feature f1subscript𝑓1\mathit{f}_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, setting the stage for the subsequent fusion with features extracted by LMM. f1subscript𝑓1\mathit{f}_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is generated as:

(2) f1subscript𝑓1\displaystyle\mathit{f}_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =S×W.absent𝑆𝑊\displaystyle=\mathit{S}\times\mathit{W}.= italic_S × italic_W .

During the training phase, the parameters of modified MANIQA are continuously updated. This refinement process ensures that MANIQA can extract features more relevant to the quality of AGIs. Furthermore, the training process facilitates a more seamless integration between MANIQA and LMM, leading to superior outcomes.

3.2. Fine-grained Semantic Feature Extraction

LMMs are capable of understanding and analyzing the semantic content of images and their relationship with human cognition. They assess whether different parts of an image form a cohesive whole and evaluate whether the elements within the picture are semantically coherent (Gao et al., 2023; OpenAI, 2023; Liu et al., 2023). mPLUG-Owl2 (Ye et al., 2023) employs a modality-adaptive language decoder to handle different modalities within distinct modules, which mitigates the issue of modality interference. Given the importance of effectively guiding the model through textual prompts to elicit the desired output, we have selected mPLUG-Owl2 as our feature extractor.

We consider the application of mPLUG-Owl2 in the following aspects of semantic content:

  • Existence of Semantic Content. The importance of semantic content in an image lies in its ability to convey a clear and meaningful message to the viewer. An image lacking in semantic content may be difficult to understand, fail to effectively convey its intended message, reducing audience engagement and satisfaction.

  • Coherence of Semantic Content. The coherence of semantic content in an image relates to whether the generated image can provide a coherent, logically sound visual experience for human viewers. When the various parts of an image are semantically consistent, it is better able to convey a clear story, emotion, or message. In contrast, any inconsistency in the primary focus of images will greatly detract from their quality and convey a significantly negative impression.

Consequently, we try to propose the rational design of prompts leading LMMs to obtain those image semantic content. mPLUG-Owl2 possess the ability to understand fine-grained semantic contents, but without carefully designed input prompts, some prompts, such as ”Please evaluate if the image quality is compromised due to violations of common human sense or logic?” although it expresses the desire for the model to assess whether the semantic content of the image contradicts human perception, would lead to unsatisfactory results. To better utilize mPLUG-Owl2 for the task of evaluating AGIs, we meticulously designed prompts to guide the LMM. Specifically, we designed two prompts, denoted as prompta𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑎prompt_{a}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and promptb𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑏prompt_{b}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT respctively,

  • ”Evaluate the input image to determine if its quality is compromised due to a lack of meaningful semantic content.”

  • ”Evaluate if the image quality is compromised due to violations of coherence.”

corresponding to the existence of semantic content and the coherence of semantic content in images, respectively. Test results, as shown in Figure 5 using the mPLUG-Owl2 official demo111https://modelscope.cn/studios/iic/mPLUG-Owl2/summary, have proven these questions to be effective.

Refer to caption
Figure 5. Presentation of mPLUG-Owl2’s answers to two prompts.

However, the textual output from mPLUG-Owl2 is not immediately conducive to being utilized by MANIQA to impart semantic insights. To bridge this gap, it’s essential to obtain the information provided by mPLUG-Owl2 into a format that MANIQA can easily leverage. SO we extract features from the final layer of mPLUG-Owl2’s hidden layers, achieving an accessible embedded representation of the LMM’s output. This output is a tensor with dimensions of [token_length, hidden_size], where ”token_length” represents the number of output tokens, and ”hidden_size” denotes the dimensionality of the hidden layer representations associated with each token. For mPLUG-Owl2, the hidden_size is set to 4096. Subsequently, we conduct an averaging operation across the token dimension, yielding a vector with dimensions 1x4096. This vector then serves as the basis for further feature fusion procedures. The process can be represented as Equation (3) :

(3) (𝐦i1,𝐦i2,,𝐦in)subscriptsuperscript𝐦1𝑖subscriptsuperscript𝐦2𝑖subscriptsuperscript𝐦𝑛𝑖\displaystyle(\mathbf{m}^{1}_{i},\mathbf{m}^{2}_{i},\cdots,\mathbf{m}^{n}_{i})( bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ⋯ , bold_m start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) =([image],[prompti])[1],absentdelimited-[]𝑖𝑚𝑎𝑔𝑒delimited-[]𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑖delimited-[]1\displaystyle=\mathcal{M}([image],[prompt_{i}])[-1],= caligraphic_M ( [ italic_i italic_m italic_a italic_g italic_e ] , [ italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) [ - 1 ] ,
fi=𝐴𝑣𝑒𝑟𝑎𝑔𝑒subscript𝑓𝑖𝐴𝑣𝑒𝑟𝑎𝑔𝑒\displaystyle f_{i}=\mathit{Average}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_Average (𝐦i1,𝐦i2,,𝐦in),wherei{a,b},subscriptsuperscript𝐦1𝑖subscriptsuperscript𝐦2𝑖subscriptsuperscript𝐦𝑛𝑖where𝑖𝑎𝑏\displaystyle(\mathbf{m}^{1}_{i},\mathbf{m}^{2}_{i},\cdot,\mathbf{m}^{n}_{i}),% ~{}\text{where}~{}i\in\{a,b\},( bold_m start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , ⋅ , bold_m start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where italic_i ∈ { italic_a , italic_b } ,

where 𝐦iksubscriptsuperscript𝐦𝑘𝑖\mathbf{m}^{k}_{i}bold_m start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents a hidden vector of token k𝑘kitalic_k corresponding to prompti𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑖prompt_{i}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and \mathcal{M}caligraphic_M denotes mPLUG-Owl2.

It is important to note that throughout the entire training and testing process, the parameters of mPLUG-Owl2 are fixed. Because mPLUG-Owl2 is typically pre-trained on large-scale datasets and has already learned a rich set of joint visual and language knowledge, it can effectively capture the fine-grained semantic information relevant to input prompts, even with fixed parameters. Additionally, fine-tuning LMMs in every training iteration would significantly increase training time. Using it solely as a feature extractor significantly reduces computational costs, making the training process more efficient. So, we pre-obtain and save the semantic content features of each image in advance.

3.3. Adaptive Fusion Module

Given the complex influence of color, composition, details, semantic content, and other factors on image quality, simply concatenating the extracted features may not always yield the best results. To dynamically fuse a variety of complementary features, we propose the adaptive fusion module (AFM) for organic feature integration. This process can be divided into two main parts. The first part involves transforming the extracted features into a unified vector space of the same dimension, allowing for vector fusion operations. Specifically, for features extracted by MANIQA, this transformation block applies a fully connected (Fc) layer, transforming them to the same dimension as the original features (1x784) to provide a richer combination. For features derived from mPLUG-Owl2, it uses a Fc layer to project them onto a 1x784 dimension, followed by a relu activation layer and a dropout layer to enhance the network’s expressive power and generalization. The second part employs a MoE to dynamically fuse the three features. The MoE’s gating network takes the transformed three features as input and outputs dynamic weights 𝜶𝜶\boldsymbol{\alpha}bold_italic_α, corresponding to the three features’ contributions to image quality. Structurally, this gating network comprises a Fc layer and a sigmoid layer. The final image quality representation vector g𝑔\mathit{g}italic_g can be obtained through a weighted sum of the three feature vectors. Following the denotation which sign the three features as f1subscript𝑓1f_{1}italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, fasubscript𝑓𝑎f_{a}italic_f start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, fbsubscript𝑓𝑏f_{b}italic_f start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, this process can be represented as:

(4) fisubscriptsuperscript𝑓𝑖\displaystyle\mathit{f}^{\prime}_{i}italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =itrans(fi),wherefid,formulae-sequenceabsentsubscriptsuperscript𝑡𝑟𝑎𝑛𝑠𝑖subscript𝑓𝑖wheresubscriptsuperscript𝑓𝑖superscript𝑑\displaystyle=\mathcal{F}^{trans}_{i}(\mathit{f}_{i}),~{}\text{where}~{}% \mathit{f}^{\prime}_{i}\in\mathbb{R}^{d},= caligraphic_F start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , where italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ,
𝜶𝜶\displaystyle\boldsymbol{\alpha}bold_italic_α =gate(Concat(f1,fa,fb)),where𝜶3,formulae-sequenceabsentsuperscript𝑔𝑎𝑡𝑒Concatsubscriptsuperscript𝑓1subscriptsuperscript𝑓𝑎subscriptsuperscript𝑓𝑏where𝜶superscript3\displaystyle=\mathcal{F}^{gate}(\text{Concat}(\mathit{f}^{\prime}_{1},\mathit% {f}^{\prime}_{a},\mathit{f}^{\prime}_{b})),~{}\text{where}~{}\boldsymbol{% \alpha}\in\mathbb{R}^{3},= caligraphic_F start_POSTSUPERSCRIPT italic_g italic_a italic_t italic_e end_POSTSUPERSCRIPT ( Concat ( italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) ) , where bold_italic_α ∈ blackboard_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ,
g𝑔\displaystyle\mathit{g}italic_g =i=13fiαi,wheregd,i{1,a,b},formulae-sequenceabsentsuperscriptsubscript𝑖13subscriptsuperscript𝑓𝑖subscript𝛼𝑖formulae-sequencewhere𝑔superscript𝑑𝑖1𝑎𝑏\displaystyle=\sum\nolimits_{i=1}^{3}\mathit{f}^{\prime}_{i}\cdot\alpha_{i},~{% }\text{where}~{}\mathit{g}\in\mathbb{R}^{d},~{}i\in\{1,a,b\},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , where italic_g ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , italic_i ∈ { 1 , italic_a , italic_b } ,

where itranssuperscriptsubscript𝑖𝑡𝑟𝑎𝑛𝑠\mathcal{F}_{i}^{trans}caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUPERSCRIPT is the transformation block of feature i𝑖iitalic_i, and gatesuperscript𝑔𝑎𝑡𝑒\mathcal{F}^{gate}caligraphic_F start_POSTSUPERSCRIPT italic_g italic_a italic_t italic_e end_POSTSUPERSCRIPT is the gating network’s map** function, fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the original extracted feature and fisuperscriptsubscript𝑓𝑖f_{i}^{\prime}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the transformed feature. dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the dimension space of fisuperscriptsubscript𝑓𝑖f_{i}^{\prime}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Finally, we obtain the final image quality score output through a simple regression layer, consisting of a Fc layer.

Table 1. Comparisons with SOTA (State-Of-The-Art) methods on AGIQA-3k and AIGCQA-20K-Image datasets. The up arrow ”\uparrow” means that a larger value indicates better performance. The best and second best performances are bolded and underlined, respectively. MA-AGIQA outperforms existing SOTA methods on both datasets by large margins. Note: to ensure fair comparisons, we trained and tested all deep learning based models and ours with the same dataset splitting method.
Type Method AGIQA-3k AIGCQA-20K-Image
SRCC\uparrow PLCC\uparrow KRCC\uparrow RMSE\downarrow SRCC\uparrow PLCC\uparrow KRCC\uparrow RMSE\downarrow
Handcrafted feature-based BRISQUE (Mittal et al., 2012a) 0.4726 0.5612 0.3227 0.8299 0.1663 0.3580 0.1112 0.6813
NIQE (Mittal et al., 2012b) 0.5236 0.5668 0.3637 0.8260 0.2085 0.3378 0.1394 0.6868
ILNIQE (Zhang et al., 2015) 0.6097 0.6551 0.4318 0.7576 0.3359 0.4551 0.2290 0.6497
LMM-based CLIPIQA (Wang et al., 2023) 0.6524 0.6968 0.4632 0.7191 0.4147 0.6459 0.2861 0.5570
CLIPIQA+ (Wang et al., 2023) 0.6933 0.7493 0.4957 0.664 0.4553 0.6682 0.3169 0.5428
Q-Align (Wu et al., 2023a) 0.6728 0.6910 0.4728 0.7204 0.6743 0.6815 0.4808 0.5199
Traditional DNN-based HyperIQA (Su et al., 2020) 0.8509 0.9049 0.6685 0.4134 0.8162 0.8329 0.6207 0.3902
MANIQA (Yang et al., 2022) 0.8618 0.9115 0.6839 0.4111 0.8507 0.8870 0.6612 0.3273
DBCNN (Zhang et al., 2018) 0.8263 0.8900 0.6393 0.4533 0.8054 0.8483 0.6121 0.3726
StairIQA (Sun et al., 2023) 0.8343 0.8933 0.6485 0.4510 0.7899 0.8428 0.6053 0.3927
BAID (Yi et al., 2023) 0.1304 0.2030 0.0854 0.9487 0.1652 0.1483 0.1279 0.7297
MUSIQ (Ke et al., 2021) 0.8261 0.8657 0.6400 0.4907 0.8329 0.8646 0.6403 0.3634
DL with LMM MA-AGIQA 0.8939 0.9273 0.7211 0.3756 0.8644 0.9050 0.6804 0.3104

4. Experiments

4.1. Dataset and Evaluation Metrics

Dataset. Our model is evaluated on two AI-Generated image datasets, including AIGCQA-20k (Li et al., 2024a) and AGIQA-3k (Li et al., 2023b). Specifically, AIGCQA-20k contains 20k images, but at the time of writing, only 14k images have been published. Our experiments are conducted on these 14k images. The MOS for AIGCQA-20k images are distributed between 0-5, with higher scores indicating better image quality. Images in AIGCQA-20k are generated by 15 models, including DALLE2 (Ramesh et al., 2022), DALLE3 (Ramesh et al., 2022), Dream (dreamlike art, 2023), IF (DeepFloyd, 2023), LCM Pixart (Luo et al., 2023), LCM SD1.5 (Luo et al., 2023), LCM SDXL (Luo et al., 2023), Midjourney (Holz, 2023), Pixart α𝛼\alphaitalic_α (Chen et al., 2023), Playground (PlaygroundAI, 2023), SD1.4 (Rombach et al., 2022b), SD1.5 (Rombach et al., 2022b), SDXL (Rombach et al., 2022a) and SSD1B (Gupta et al., 2024). AGIQA-3k includes 2982 images, with MOS also distributed between 0-5, where higher values represent better quality. Images in AGIQA-3k are derived from six models, including GLIDE (Nichol et al., 2022), Stable Diffusion V-1.5 (Rombach et al., 2022b), Stable Diffusion XL-2.2 (Rombach et al., 2022a), Midjourney (Holz, 2023), AttnGAN (Xu et al., 2018), and DALLE2 (Ramesh et al., 2022). During training, we split the entire dataset into 70%percent\%% for training, 10%percent\%% for validation, and 20%percent\%% for testing. To ensure the same set of images in each subset when testing across different models, we set the same random seed during the split to control variables and ensure reproducibility.

Evaluation Metric. Spearman’s Rank-Order Correlation Coefficient (SRCC), Pearson’s Linear Correlation Coefficient (PLCC), the Kullback-Leibler Correlation Coefficient (KLCC), and the Root Mean Square Error (RMSE) are selected as metrics to measure monotonicity and accuracy. SRCC, PLCC, and KLCC range from -1.0 to 1.0, with larger values indicating better results. In our experiments, we employ the sum of SRCC and PLCC as the criterion for selecting the optimal validation case, and emphasize SRCC for comparing model performance.

4.2. Implementation Details

Our method is implemented based on PyTorch, and all experiments are conducted on 4 NVIDIA 3090 GPUs. For all datasets, we opt for handcrafted feature-based BRISQUE (Mittal et al., 2012a), NIQE (Mittal et al., 2012b) and ILNIQE (Zhang et al., 2015), deep learning (DL)-based HyperIQA (Su et al., 2020), MANIQA (Yang et al., 2022), MUSIQ (Ke et al., 2021), DBCNN (Zhang et al., 2018), StairIQA (Sun et al., 2023), BAID (Yi et al., 2023), and LMM-based CLIPIQA (Wang et al., 2023), CLIPIQA+ (Wang et al., 2023) and Q-Align (Wu et al., 2023a). During the training process of deep learning models, we use the Adam optimizer (Kingma and Ba, 2014) with a weight decay of 1e-5, and the initial learning rate is 1e-5. The batch size is 8 during training, validation, and testing. All DL-based models are trained for 30 epochs using MSE loss and validated after each training process. The checkpoint with the highest sum of SRCC and PLCC during validation is used for testing. Handcrafted feature-based and LMM based models are used directly without training.

4.3. Comparison with SOTA methods

Table 1 lists the results of MA-AGIQA and 12 other models on the AGIQA-3k and AIGCQA-20k dataset. It has been observed that LMM-based models significantly outperform those that rely on handcrafted features. This superior performance is attributed to LMMs being trained on extensive datasets, which provides them with a robust understanding of images and enhances their generalizability. However, trained DL-based models generally perform far better than the LMM-based models because DL-based models tend to fit the data distribution of specific tasks better, thereby resulting in improved performance. Among these twelve models, the ViT-based MANIQA outperforms the other eleven models, and our method still significantly surpasses it on the same training and testing split with large margins (+3.72%percent\%% of SRCC, +1.73%percent\%% of PLCC and +5.43%percent\%% of KRCC in AGIQA-3k & +1.61%percent\%% of SRCC, +2.02%percent\%% of PLCC and +2.90%percent\%% of KRCC in AIGCQA-20k). This demonstrates the superiority of integrating features extracted by LMM into traditional DNN, significantly improving the accuracy and consistency of prediction results.

Table 2. Cross-dataset performance comparison for M-AIGQ-QA, HyperIQA, and StairIQA. “Direction” from A to B means training with train subset of dataset A and testing on test subset of dataset B. The best result is bolded.
direction SRCC \uparrow PLCC \uparrow KRCC \uparrow RMSE \downarrow
MA-AGIQA 20k\rightarrow3k 0.8053 0.8430 0.6083 0.5399
3k\rightarrow20k 0.7722 0.8314 0.5777 0.4055
HyperIQA 20k\rightarrow3k 0.6820 0.6806 0.4806 0.7352
3k\rightarrow20k 0.6374 0.6547 0.4577 0.5414
StairIQA 20k\rightarrow3k 0.4335 0.5234 0.3294 0.8549
3k\rightarrow20k 0.6495 0.6895 0.4644 0.5285

To evaluate the generalization capability of our MA-AGIQA model, we conducted cross-dataset evaluations. Table 2 shows that MA-AGIQA significantly outperforms the other two models, HyperIQA and StairIQA, which performed best on single datasets, with large margins. This superior performance can largely be attributed to the robust generalization capability of the LMM and the benefits of the MoE architecture, which excels in dynamically fusing features.

4.4. Ablation Study

Table 3. Ablation studies of different component combinations in the MA-AGIQA model on AGIQA-3k. SRCC, PLCC and KRCC are reported. The best result is bolded. Note: ”semantic feature” and ”coherence feature” denote features extracted by mPLUG-Owl2 through prompta𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑎prompt_{a}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and promptb𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑏prompt_{b}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT respectively.
MANIQA Semantic Feature Coherence Feature SRCC\uparrow PLCC\uparrow KRCC\uparrow
0.8800 0.9196 0.7031
0.8662 0.9082 0.6823
0.8661 0.9084 0.6821
0.8685 0.9108 0.6853
0.8820 0.9197 0.7090
0.8699 0.9102 0.6867
0.8939 0.9273 0.7211

Necessity of Fine-grained Semantic Features. To assess the benefits of integrating features extracted by mPLUG-Owl2 (Ye et al., 2023) into MANIQA (Yang et al., 2022), we carried out comprehensive ablation studies on each component and their various combinations, as detailed in Tables 3 and 4. Our findings indicate that using either the features extracted by the LMM alone or solely relying on a traditional network does not yield the best outcomes. In contrast, integrating one fine-grained semantic feature with the original MANIQA network can enhance the network’s performance. However, the optimal results were achieved by combining two features extracted by the LMM with MANIQA, which led to significant improvements on the AGIQA-3k dataset (increases of 1.57%percent\%%, 0.83%percent\%%, and 2.56%percent\%% in SRCC, PLCC, and KRCC, respectively) and on the AIGCQA-20k dataset (enhancements of 2.72%percent\%%, 1.94%percent\%%, and 4.35%percent\%%).

The marked enhancements achieved by incorporating two fine-grained semantic features suggest that LMM is adept at capturing nuanced, complex features that traditional models might overlook, fostering a more thorough understanding and assessment of AGIs quality. The results from these ablation experiments highlight the significant contribution of fine-grained semantic features.

Contribution of MoE. Table 5 demonstrates that incorporating the MoE structure, rather than simply concatenating three vectors, does indeed improve network performance, albeit marginally. Specifically, on the AGIQA-3k dataset, we observed increases of 0.20%percent\%%, 0.17%percent\%%, and 0.16%percent\%% in SRCC, PLCC, and KRCC, respectively. For the AIGCQA-20k dataset, the improvements were 0.67%percent\%%, 0.95%percent\%%, and 1.37%percent\%%. The gains, although seemingly modest, highlight the potential of MoE structure in complex systems where integrating diverse expertise can yield better decision-making and predictive outcomes.

Table 4. Ablation studies of different component combinations in the MA-AGIQA model on AIGCQA-20k. SRCC, PLCC and KRCC are reported. The best result is bolded. Note: ”semantic feature” and ”coherence feature” denote features extracted by mPLUG-Owl2 through prompta𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑎prompt_{a}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and promptb𝑝𝑟𝑜𝑚𝑝subscript𝑡𝑏prompt_{b}italic_p italic_r italic_o italic_m italic_p italic_t start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT respectively.
MANIQA Semantic Feature Coherence Feature SRCC\uparrow PLCC\uparrow KRCC\uparrow
0.8415 0.8877 0.6520
0.8184 0.8345 0.6323
0.8181 0.8343 0.6320
0.8540 0.8975 0.6671
0.8596 0.9016 0.6738
0.8180 0.8323 0.6312
0.8644 0.9050 0.6804
Table 5. Ablation studies on the MoE structure in the AFM demonstrate that compositions integrating MoE yield superior results on both AGIQA-3k and AIGCQA-20k datasets.The better result is bolded.
dataset MoE SRCC \uparrow PLCC \uparrow KRCC \uparrow RMSE \downarrow
3k 0.8921 0.9257 0.7199 0.3797
0.8939 0.9273 0.7211 0.3756
20k 0.8586 0.8964 0.6712 0.3234
0.8644 0.9050 0.6804 0.3104
Refer to caption
Figure 6. Comparative Density Distributions of Absolute Differences for MANQA and MA-AGIQA on AGIQA-3k and AIGQA-20k Datasets

4.5. Visualization

Refer to caption
Figure 7. Comparative Analysis of Image Quality Assessment Models: Evaluating MANIQA versus MA-AGIQA Against Ground Truth Scores

To vividly demonstrate the efficacy of the MA-AGIQA framework, we selected 300 images from the AIGCQA-20k and AGIQA-3k datasets where MANIQA had the poorest performance. These images primarily exhibit issues in semantic content. We computed the absolute values of the differences between the model scores and the image ground truth, and illustrated these differences in Figure 6, using 0.1 as the bin size for plotting the quality score distribution. The results clearly show that our MA-AGIQA model are more closely aligned with human perception, with a noticeable shift in the difference distribution toward zero and a marked reduction in peak values.

Figure 7 presents a collection of images where the assessments from the MANIQA model were mostly off the mark. Scores assigned by MANIQA alongside those given by the proposed MA-AGIQA model and the ground truth are listed, which reveal that the MA-AGIQA model markedly enhances alignment with the ground truth in contrast to MANIQA. For instance, in the first image of the top row, MANIQA’s score is 3.50, which diverging substantially from the ground truth score of 1.50. However, MA-AGIQA’s score is 2.98, demonstrating a much closer approximation to the ground truth. This pattern is consistent across the images shown, with MA-AGIQA consistently producing scores that are closer to the ground truth, reflecting a more accurate assessment of image quality.

5. Conclusion

To mitigate the shortcomings of traditional DNNs in capturing semantic content in AGIs, this study explored the integration of LMMs with traditional DNNs and introduced the MA-AGIQA network. Leveraging mPLUG-Owl2 (Ye et al., 2023), our network efficiently extracts semantic features to enhance MANIQA (Yang et al., 2022) for quality assessment. The MA-AGIQA network’s ability to dynamically integrate fine-grained semantic features with quality-aware features enables it to effectively handle the varied quality aspects of AGIs. Experiment results across two prominent AGIs datasets confirm our model’s superior performance. Through thorough ablation studies, the indispensable role of each component within our framework has been validated. This research aspires to catalyze further exploration into the fusion of LMMs within AI-generated content quality assessment and envisions broader application potentials for such methodology.

References

  • (1)
  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  • Black et al. (2023) Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. 2023. Training Diffusion Models with Reinforcement Learning. In The Twelfth International Conference on Learning Representations.
  • Chen et al. (2023) Junsong Chen, **cheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, ** Luo, Huchuan Lu, and Zhenguo Li. 2023. PixArt-α𝛼\alphaitalic_α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis. 2310.00426.
  • DeepFloyd (2023) DeepFloyd. 2023. IF-I-XL-v1.0. https://www.deepfloyd.ai.
  • Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  • dreamlike art (2023) dreamlike art. 2023. dreamlike-photoreal-2.0. https://dreamlike.art.
  • Gao et al. (2023) Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, and Yu Qiao. 2023. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv preprint arXiv:2304.15010 (2023).
  • Gupta et al. (2024) Yatharth Gupta, Vishnu V. Jaddipal, Harish Prabhala, Sayak Paul, and Patrick Von Platen. 2024. Progressive Knowledge Distillation Of Stable Diffusion XL Using Layer Level Loss. 2401.02677.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs.CV]
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]
  • Holz (2023) David Holz. 2023. Midjourney. https://www.midjourney.com.
  • Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. 1989. Multilayer feedforward networks are universal approximators. Neural networks 2, 5 (1989), 359–366.
  • Kang et al. (2014) Le Kang, Peng Ye, Yi Li, and David Doermann. 2014. Convolutional neural networks for no-reference image quality assessment. In CVPR. 1733–1740.
  • Ke et al. (2021) Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. 2021. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 5148–5157.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980
  • Li et al. (2023a) Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, and Jianfeng Gao. 2023a. Multimodal Foundation Models: From Specialists to General-Purpose Assistants. arXiv:2309.10020 [cs.CV]
  • Li et al. (2024a) Chunyi Li, Tengchuan Kou, Yixuan Gao, Yuqin Cao, Wei Sun, Zicheng Zhang, Yingjie Zhou, Zhichao Zhang, Weixia Zhang, Haoning Wu, Xiaohong Liu, Xiongkuo Min, and Guangtao Zhai. 2024a. AIGIQA-20K: A Large Database for AI-Generated Image Quality Assessment. arXiv:2404.03407 [cs.CV]
  • Li et al. (2022) Chunyi Li, Haoyang Li, Ning Yang, and Dazhi He. 2022. A PBCH Reception Algorithm in 5G Broadcasting. In IEEE International Symposium on Broadband Multimedia Systems and Broadcasting.
  • Li et al. (2024b) Chunyi Li, Haoning Wu, Zicheng Zhang, Hongkun Hao, Kaiwei Zhang, Lei Bai, Xiaohong Liu, Xiongkuo Min, Weisi Lin, and Guangtao Zhai. 2024b. Q-Refine: A Perceptual Quality Refiner for AI-Generated Image. arXiv:2401.01117
  • Li et al. (2023b) Chunyi Li, Zicheng Zhang, Haoning Wu, Wei Sun, Xiongkuo Min, Xiaohong Liu, Guangtao Zhai, and Weisi Lin. 2023b. AGIQA-3K: An Open Database for AI-Generated Image Quality Assessment. arXiv:2306.04717 [cs.CV]
  • Liu et al. (2023) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual Instruction Tuning.
  • Luo et al. (2023) Simian Luo, Yiqin Tan, Suraj Patil, Daniel Gu, Patrick von Platen, Apolinário Passos, Longbo Huang, Jian Li, and Hang Zhao. 2023. LCM-LoRA: A Universal Stable-Diffusion Acceleration Module. arXiv:2311.05556 [cs.CV]
  • Ma et al. (2023) Haichuan Ma, Dong Liu, and Feng Wu. 2023. Rectified Wasserstein Generative Adversarial Networks for Perceptual Image Restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2023), 3648–3663. https://doi.org/10.1109/TPAMI.2022.3185316
  • Mittal et al. (2012a) Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. 2012a. No-reference image quality assessment in the spatial domain. IEEE Transactions on image processing 21, 12 (2012), 4695–4708.
  • Mittal et al. (2012b) Anish Mittal, Rajiv Soundararajan, and Alan C Bovik. 2012b. Making a “completely blind” image quality analyzer. IEEE Signal processing letters 20, 3 (2012), 209–212.
  • Moorthy and Bovik (2011) Anush Krishna Moorthy and Alan Conrad Bovik. 2011. Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE transactions on Image Processing 20, 12 (2011), 3350–3364.
  • Nichol et al. (2022) Alexander Quinn Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob Mcgrew, Ilya Sutskever, and Mark Chen. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In International Conference on Machine Learning. PMLR, 16784–16804.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  • PlaygroundAI (2023) PlaygroundAI. 2023. playground-v2-1024px-aesthetic. https://playground.com.
  • Qu et al. (2024) Bowen Qu, Haohui Li, and Wei Gao. 2024. Bringing Textual Prompt to AI-Generated Image Quality Assessment. arXiv:2403.18714 [cs.CV]
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020 [cs.CV]
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical Text-Conditional Image Generation with CLIP Latents. 2204.06125.
  • Ramesh et al. (2021) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. 2021. Zero-shot text-to-image generation. In International Conference on Machine Learning. PMLR, 8821–8831.
  • Rombach et al. (2022b) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022b. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  • Rombach et al. (2022a) Robin Rombach, Andreas Blattmann, and Björn Ommer. 2022a. Text-Guided Synthesis of Artistic Images with Retrieval-Augmented Diffusion Models. 2207.13038.
  • Saad et al. (2012) Michele A Saad, Alan C Bovik, and Christophe Charrier. 2012. Blind image quality assessment: A natural scene statistics approach in the DCT domain. IEEE transactions on Image Processing 21, 8 (2012), 3339–3352.
  • Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv:1409.1556 [cs.CV]
  • Su et al. (2021) Shaolin Su, Vlad Hosu, Hanhe Lin, Yanning Zhang, and Dietmar Saupe. 2021. Koniq++: Boosting no-reference image quality assessment in the wild by jointly predicting image quality and defects. In The 32nd British Machine Vision Conference.
  • Su et al. (2020) Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, **qiu Sun, and Yanning Zhang. 2020. Blindly assess image quality in the wild guided by a self-adaptive hyper network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3667–3676.
  • Sun et al. (2023) Wei Sun, Xiongkuo Min, Danyang Tu, Siwei Ma, and Guangtao Zhai. 2023. Blind quality assessment for in-the-wild images via hierarchical feature fusion and iterative mixed database training. IEEE Journal of Selected Topics in Signal Processing (2023).
  • Szegedy et al. (2014) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2014. Going Deeper with Convolutions. arXiv:1409.4842 [cs.CV]
  • Wang et al. (2023) Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. 2023. Exploring clip for assessing the look and feel of images. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2555–2563.
  • Wu et al. (2023b) Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu Sun, Qiong Yan, Guangtao Zhai, et al. 2023b. Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision. arXiv:2309.14181
  • Wu et al. (2023c) Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Annan Wang, Kaixin Xu, Chunyi Li, **gwen Hou, Guangtao Zhai, et al. 2023c. Q-instruct: Improving low-level visual abilities for multi-modality foundation models. arXiv:2311.06783
  • Wu et al. (2023a) Haoning Wu, Zicheng Zhang, Weixia Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Yixuan Gao, Annan Wang, Erli Zhang, Wenxiu Sun, et al. 2023a. Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels. arXiv:2312.17090
  • Wu et al. (2024b) Haoning Wu, Hanwei Zhu, Zicheng Zhang, Erli Zhang, Chaofeng Chen, Liang Liao, Chunyi Li, Annan Wang, Wenxiu Sun, Qiong Yan, Xiaohong Liu, Guangtao Zhai, Shiqi Wang, and Weisi Lin. 2024b. Towards Open-ended Visual Quality Comparison. arXiv preprint arXiv:2402.16641.
  • Wu et al. (2024a) Tianhe Wu, Kede Ma, Jie Liang, Yujiu Yang, and Lei Zhang. 2024a. A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment. arXiv:2403.10854 [cs.CV]
  • Xu et al. (2018) Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1316–1324.
  • Yang et al. (2022) Sidi Yang, Tianhe Wu, Shuwei Shi, Shanshan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. 2022. MANIQA: Multi-dimension Attention Network for No-Reference Image Quality Assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1191–1200.
  • Ye et al. (2023) Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, and **gren Zhou. 2023. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. arXiv:2311.04257 [cs.CL]
  • Yi et al. (2023) Ran Yi, Haoyuan Tian, Zhihao Gu, Yu-Kun Lai, and Paul L. Rosin. 2023. Towards Artistic Image Aesthetics Assessment: A Large-Scale Dataset and a New Method. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 22388–22397.
  • You and Korhonen (2021) Junyong You and Jari Korhonen. 2021. Transformer for image quality assessment. In 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 1389–1393.
  • Zhai and Min (2020) Guangtao Zhai and Xiongkuo Min. 2020. Perceptual image quality assessment: a survey. Science China Information Sciences 63 (2020), 1–52.
  • Zhang et al. (2015) Lin Zhang, Lei Zhang, and Alan C Bovik. 2015. A feature-enriched completely blind image quality evaluator. IEEE Transactions on Image Processing 24, 8 (2015), 2579–2591.
  • Zhang et al. (2018) Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang. 2018. Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Transactions on Circuits and Systems for Video Technology 30, 1 (2018), 36–47.
  • Zhang et al. (2021) Xinfeng Zhang, Weisi Lin, and Qingming Huang. 2021. Fine-grained image quality assessment: A revisit and further thinking. IEEE Transactions on Circuits and Systems for Video Technology 32, 5 (2021), 2746–2759.
  • Zhang et al. (2023) Zicheng Zhang, Haoning Wu, Zhongpeng Ji, Chunyi Li, Erli Zhang, Wei Sun, Xiaohong Liu, Xiongkuo Min, Fengyu Sun, Shangling Jui, et al. 2023. Q-Boost: On Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models. arXiv:2312.15300
  • Zhang et al. (2024) Zicheng Zhang, Yingjie Zhou, Long Teng, Wei Sun, Chunyi Li, Xiongkuo Min, Xiao-** Zhang, and Guangtao Zhai. 2024. Quality-of-Experience Evaluation for Digital Twins in 6G Network Environments. IEEE Transactions on Broadcasting (2024).