(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University ¹¹email: {zwx8981, zhaiguangtao, xkyang}@sjtu.edu.cn ²²institutetext: Peng Cheng Laboratory, Shenzhen, China
²²email: [email protected]³³institutetext: Department of Computer Science, City University of Hong Kong ³³email: [email protected]

Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents

Weixia Zhang 11 Dingquan Li 22 Guangtao Zhai 11 Xiaokang Yang 11 Kede Ma 33

Abstract

Contemporary no-reference image quality assessment (NR-IQA) models can effectively quantify the perceived image quality, with high correlations between model predictions and human perceptual scores on fixed test sets. However, little progress has been made in comparing NR-IQA models from a perceptual optimization perspective. Here, for the first time, we demonstrate that NR-IQA models can be plugged into the maximum a posteriori (MAP) estimation framework for image enhancement. This is achieved by taking the gradients in differentiable and bijective diffusion latents rather than in the raw pixel domain. Different NR-IQA models are likely to induce different enhanced images, which are ultimately subject to psychophysical testing. This leads to a new computational method for comparing NR-IQA models within the analysis-by-synthesis framework. Compared to conventional correlation-based metrics, our method provides complementary insights into the relative strengths and weaknesses of the competing NR-IQA models in the context of perceptual optimization.

Keywords:

No-reference image quality assessment Model comparison Diffusion models

1 Introduction

Image quality assessment (IQA) models have been extensively studied [70] as surrogates of human subjects for performance evaluation of image processing and computer vision systems. Depending on the accessibility to pristine-quality reference images, IQA models can be broadly categorized into two types: full-reference IQA (FR-IQA) [71, 79] and no-reference IQA (NR-IQA) [41, 2]. FR-IQA models are well-suited for scenarios where undistorted references are available. However, in many real-world applications (e.g., low-light image enhancement and image-to-image translation), specifying desired reference outputs is an inherently challenging task [6], if not impossible. This highlights the necessity for accurate NR-IQA models to assess the perceived image quality without referencing to the pristine-quality counterparts.

Apart from performance evaluation, IQA models hold great promise for perceptual optimization of image processing systems. FR-IQA models have been serving the purpose for a long time, from the age of measuring error visibility and structural similarity [71] to the data-driven era [79, 13]. Recently, an extensive comparison of FR-IQA models for optimization of various image processing algorithms has been conducted [12], turning the perceptual optimization process into an analysis-by-synthesis tool [23]. A natural question then arises:

Can we compare NR-IQA models in terms of their performance in perceptual optimization as well?

A straightforward computational effort to answer the above question is directly maximizing an NR-IQA model, $q_{\bm{w}}(\cdot):\mathbb{R}^{N}\mapsto\mathbb{R}$ , parameterized by a vector $\bm{w}$ , with respect to the input image $\bm{x}$ :

\displaystyle\bm{x}^{\star}=\operatorname*{arg\,max}_{\bm{x}}q_{\bm{w}}(\bm{x}),

(1)

where a larger $q_{\bm{w}}(\cdot)$ indicates higher predicted quality, and $\bm{x}^{\star}\in\mathbb{R}^{N}$ is the optimized image. As observed in [81] and shown in Fig. 1 (b), even if $q_{\bm{w}}$ is instantiated by a state-of-the-art NR-IQA model (LIQE [85] in this case), the optimized image appears locally texture and color distorted, with degraded quality compared to the initial.

A natural extension is to interpret $q_{\bm{w}}(\cdot)$ as a natural image prior, and plug it into the maximum a posteriori (MAP) estimation framework, leading to the following optimization problem:

\displaystyle\bm{x}^{\star}

\displaystyle=\operatorname*{arg\,min}_{\bm{x}}E(\bm{x}|\bm{x}^{\mathrm{init}}% )=\operatorname*{arg\,min}_{\bm{x}}D(\bm{x},\bm{x}^{\mathrm{init}})-\lambda q_% {\bm{w}}(\bm{x}),

(2)

where we express the posterior probability $p(\bm{x}|\bm{x}^{\mathrm{init}})\propto\exp(-E(\bm{x}|\bm{x}^{\mathrm{init}}))$ by its associated Gibbs energy. $D(\cdot,\cdot):\mathbb{R}^{N\times N}\mapsto\mathbb{R}$ is the fidelity term (also known as the negative log-likelihood term), which can be implemented by a distance metric. $\lambda$ is a parameter, trading off the likelihood and the prior terms. As shown in Fig. 1 (c), even when $\lambda$ is optimally set through careful visual inspection, the MAP-optimized image is still no better than the initial, which manifests itself as a perceptually imperceptible adversarial example of $q_{\bm{w}}(\cdot)$ [81].

Here, for the first time, we successfully demonstrate the perceptual optimization capability of contemporary NR-IQA models in the MAP estimation framework (see Fig. 1 (d)). Taking inspiration from the impressive progress of diffusion models [55, 25, 57, 56, 77] in image synthesis and beyond [11, 58], we supplement an NR-IQA method with a differentiable and bijective diffusion model. This empowers the NR-IQA model with the capability of modeling highly complex distributions of natural images by operating in the diffusion latent space¹¹1We distinguish the two terms “latent diffusion” and “diffusion latents.” The former means that the diffusion process takes place in the latent space rather than in the raw pixel domain. The latter treats the diffusion process as a feature transform and the corresponding Gaussian noise vector as the latent representation.. Specifically, we work with the method of exact diffusion inversion via coupled transformations (EDICT) [66], that turns any pre-trained diffusion model into a bijection between the input image and the latent noise vector through affine coupling [14, 15].

The resulting paradigm, MAP estimation in diffusion latents, gives us an opportunity to compare NR-IQA models within the analysis-by-synthesis framework. Specifically, we plug eight NR-IQA models into this paradigm to enhance a set of photographic images with realistic camera distortions [19, 27, 16] . Psychophysical testing on the optimized images of identical visual content reveals the relative performance of the competing NR-IQA models in image enhancement, and provides complementary insights into future NR-IQA model development beyond fixed-set evaluation.

In summary, this paper presents two main contributions.

•

We propose the method of MAP estimation in diffusion latents, which enables “imperfect” NR-IQA models to act as natural image priors for perceptual optimization.
•

We apply the proposed method to compare eight contemporary NR-IQA models with a number of interesting observations.

2 Related Work

In this section, we first review the progress of NR-IQA. We then discuss model comparison methods from an analysis-by-synthesis perspective [44, 23]. Finally, we provide an overview of diffusion models and their applications to various computer vision tasks.

2.1 NR-IQA Models

In the early days, NR-IQA relies heavily on handcrafted natural scene statistics (NSS) [41, 42, 43, 20], followed by quality regression. In the deep learning era, end-to-end optimized deep neural networks (DNNs) for NR-IQA have been developed, under the constraint that human perceptual data, represented as mean opinion scores (MOSs), are scarce. Typical data-efficient training strategies include patch-wise training [2], quality-aware pre-training [37, 33, 82], learning from pseudo-labels [38, 75], meta learning [88], contrastive learning [39, 86, 53], and multitask learning [16, 59, 85]. Emerging NR-IQA research directions involve predicting local quality [78], combating subpopulation shift [83, 80, 84], exploring more powerful and unified computational [30], and leveraging multi-modal information [68, 74].

2.2 Model Comparison Methods

Model comparison plays a pivotal role in model development [4]. Conventional model comparison involves preparing a fixed test set of ground-truth measurements and comparing computational models according to their goodness of fit. Instead of proving a model to be correct in such a discriminative way, we may as well test it in a generative way, known as analysis-by-synthesis in Grenander’s Pattern Theory [44, 23]. Representative generative model comparison methods include the maximum differentiation (MAD) competition [72], the eigen-distortion analysis [1], and the controversial stimuli synthesis [21]. While the core idea remains consistent, the synthesis process can be discretized into sample selection from a large-scale unlabeled set, making it more applicable to various vision tasks, including IQA [36], image recognition [67], semantic segmentation [76], and image enhancement [6]. Our work is closely related to the perceptual attack of NR-IQA models [81], as both can be cast into MAP estimation, but they differ substantially in design goals and implementation details. The perceptual attack assesses the adversarial robustness of NR-IQA models to visually imperceptible perturbations in the pixel space, while our method operates in the diffusion latent space, and compares NR-IQA models in terms of their ability to enhance the perceived quality of photographic images.

2.3 Diffusion Models in Vision

Diffusion models represent a family of generative models designed to reverse a process that progressively perturbs the training data over multiple steps [9]. Three variants of diffusion models are identified in the literature, including 1) the denoising diffusion probabilistic models [55, 25] inspired by the non-equilibrium thermodynamics theory [48], noise conditioned score networks [57] trained through score matching [29] to estimate the gradient of the log noisy data density, and stochastic differential equations [58], unifying the former two under a more general formulation. Diffusion models have offered remarkable results in a wide range of image synthesis tasks [11, 45, 40]. For finer control over generated results, classifier guidance [11] and classifier-free guidance [26] have been proposed at their respective costs of training noise-aware classifiers and class-conditional diffusion models. As for photographic image editing [24], diffusion inversion (i.e., identifying the noise vector that generates the photographic image to be edited through reverse diffusion) is necessary. EDICT [66] describes an elegant solution to converting any pre-trained diffusion model into a differentiable bijection through affine coupling [14, 15], and achieving exact diffusion inversion without the overhead for additional training. In this paper, we will supplement NR-IQA models with EDICT, leading to end-to-end MAP estimation of the diffusion latents for image enhancement.

3 Proposed Method

In this section, we first present the necessary preliminaries of diffusion models, in particular the EDICT model. We then introduce the proposed method, MAP estimation in diffusion latents. Finally, we supplement the details of the NR-IQA model comparison procedure using the proposed method.

3.1 Preliminaries

For a photographic image $\bm{x}\in\mathbb{R}^{L}$ subject to potential realistic camera distortion(s), the goal of an NR-IQA model $q_{\bm{w}}(\cdot)$ is to quantify the perceived quality of $\bm{x}$ to closely approximate the underlying MOS $q(\bm{x})\in\mathbb{R}$ .

We also work with a diffusion model [77]. For the forward diffusion process, a set of timesteps, $\{\alpha_{t}\}_{t=0}^{T},\alpha_{0}=1$ and $\alpha_{T}=0$ , index a monotonically increasing noise schedule, according to which $\bm{x}_{0}$ is perturbed at the $t$ -th step by $\bm{x}_{t}=\sqrt{\alpha_{t}}\bm{x}_{0}+\sqrt{1-\alpha_{t}}\bm{\epsilon}$ , where $\bm{\epsilon}\sim\mathcal{N}(\bm{0},\mathbf{I})$ . For the reverse diffusion process (i.e., the data generation process), the diffusion model trains a denoising network $\bm{\epsilon}_{\bm{\theta}}(\cdot;t)$ $:\mathbb{R}^{N}\mapsto\mathbb{R}^{N}$ , parameterized by a vector $\bm{\theta}$ and a time index $t$ , to progressively removes the noise starting from the pure Gaussian noise vector [56]:

	$\displaystyle\hat{\bm{x}}_{t-1}=$	$\displaystyle\sqrt{\alpha_{t-1}}\frac{\hat{\bm{x}}_{t}-\sqrt{1-\alpha_{t}}% \epsilon_{\bm{\theta}}(\hat{\bm{x}}_{t};t)}{\sqrt{\alpha_{t}}}+\sqrt{1-\alpha_% {t-1}}\bm{\epsilon}_{\bm{\theta}}(\hat{\bm{x}}_{t};t)$
	$\displaystyle=$	$\displaystyle a_{t}\hat{\bm{x}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}(\hat{\bm{% x}}_{t};t),$		(3)

where $a_{t}=\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}$ , $b_{t}=-\sqrt{\frac{\alpha_{t-1}(1-\alpha_{t})}{\alpha_{t}}}+\sqrt{1-\alpha_{t-% 1}}$ , and $\hat{\bm{x}}_{T}=\bm{\epsilon}$ . Key to the success of the diffusion model is the denoising network $\bm{\epsilon}_{\bm{\theta}}(\cdot;t)$ , which can be trained by minimizing

\mathbb{E}_{\bm{x}_{0},\bm{\epsilon}\sim\mathcal{N}(\bm{0},\mathbf{I}),t}\left% [\lambda(t)\lVert\bm{\epsilon}-\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)\rVert% _{2}^{2}\right],

(4)

where $\lambda(t)$ is the $t$ -th positive weighting, and $\bm{x}_{t}$ can be computed from $\bm{x}_{0}$ . The above denoising step is not exactly invertible due to that the linearization assumption $\bm{\epsilon}_{\bm{\theta}}(\hat{\bm{x}}_{t},t)=\bm{\epsilon}_{\bm{\theta}}(% \hat{\bm{x}}_{t-1},t)$ does not strictly hold. Noticing the resemblance between the denoising step in diffusion models and the affine coupling operation in normalizing flows [14, 15], Wallace et al. [66] made Eq. (3.1) invertible by maintaining two coupled noise vectors (and intermediate states):

	$\displaystyle{\bm{x}}_{t-1}$	$\displaystyle=a_{t}{\bm{x}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{y}}_{t},% t),$		(5)
	$\displaystyle{\bm{y}}_{t-1}$	$\displaystyle=a_{t}{\bm{y}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{x}}_{t-1% },t),\,\,t\in\{1,\ldots,T\},$		(6)

where ${\bm{x}}_{T}={\bm{y}}_{T}=\bm{\epsilon}$ . Here we strip off the “hat” symbol in the math notations to emphasize the bijectivity of Eqs. (5) and (6) in comparison to Eq. (3.1). To improve the stability and faithfulness to the original diffusion dynamics, a mixing operation is added, leading to the following denoising process:

$\displaystyle{\bm{x}}_{t}^{\prime}$	$\displaystyle=a_{t}{\bm{x}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{y}}_{t},% t),$	(7)
$\displaystyle{\bm{y}}_{t}^{\prime}$	$\displaystyle=a_{t}{\bm{y}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{x}}_{t}^% {\prime},t),$	(8)
$\displaystyle{\bm{x}}_{t-1}$	$\displaystyle=p{\bm{x}}_{t}^{\prime}+(1-p){\bm{y}}_{t}^{\prime},$	(9)
$\displaystyle{\bm{y}}_{t-1}$	$\displaystyle=p{\bm{y}}_{t}^{\prime}+(1-p){\bm{x}}_{t-1},\,\,t\in\{1,\ldots,T\},$	(10)

where $p\in[0,1]$ is a mixing parameter to alleviate the violation of the linearization assumption.

Eqs. (7) to (10) can be linearly inverted, yielding a deterministic noising process (i.e., feature transform):

$\displaystyle{\bm{y}}_{t}^{\prime}$	$\displaystyle=({\bm{y}}_{t-1}-(1-p){\bm{x}}_{t-1})/p,$	(11)
$\displaystyle{\bm{x}}_{t}^{\prime}$	$\displaystyle=({\bm{x}}_{t-1}-(1-p){\bm{y}}_{t}^{\prime})/p,$	(12)
$\displaystyle{\bm{y}}_{t}$	$\displaystyle=({\bm{y}}_{t}^{\prime}-b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{x}}% _{t}^{\prime},t))/a_{t},$	(13)
$\displaystyle{\bm{x}}_{t}$	$\displaystyle=({\bm{x}}_{t}^{\prime}-b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{y}}% _{t},t))/a_{t},\,\,t\in\{1,\ldots,T\}.$	(14)

We collectively denote the differentiable and bijective feature transform from the coupled input images, $(\bm{x}_{0},\bm{y}_{0})$ where $\bm{y}_{0}=\bm{x}_{0}$ , to the noise representation $(\bm{x}_{T},\bm{y}_{T})$ as $\bm{f}(\cdot)$ $:\mathbb{R}^{2L}\mapsto\mathbb{R}^{2L}$ , and denote its inverse $\bm{f}^{-1}$ as $\bm{h}(\cdot)$ $:\mathbb{R}^{2L}\mapsto\mathbb{R}^{2L}$ .

3.2 MAP Estimation in Diffusion Latents

In Bayesian statistics, MAP estimate of an unknown quantity, as a point estimate, is the mode of its posterior distribution. In the context of image enhancement, MAP estimation corresponds to

	$\displaystyle\bm{x}^{\star}$	$\displaystyle=\mathop{\arg\max}_{\bm{x}}p(\bm{x}\|\bm{x}^{\mathrm{init}})=% \mathop{\arg\max}_{\bm{x}}p(\bm{x}^{\mathrm{init}}\|\bm{x})p(\bm{x})$
		$\displaystyle=\mathop{\arg\min}_{\bm{x}}-\log{p(\bm{x}^{\mathrm{init}}\|\bm{x})% }-\log{p(\bm{x})}.$		(15)

The first term measures the image fidelity with respect to the initial $\bm{x}^{\mathrm{init}}$ , and is traditionally implemented by the mean squared error (MSE) to admit efficient (iterative) solvers, such as half-quadratic splitting [18]. Other FR-IQA models as distance functions are also applicable. The second term quantifies image naturalness. Typical options include the total variation regularization [52], the Gaussian scale mixture prior [49], the (non-local) patch-based priors [3], the sparsity regularization [73], the low-rank prior [5], the deep image prior [63], and the adversarial loss [22].

Algorithm 1 Gradient-based Solver for MAP Estimation in Diffusion Latents

Require: An NR-IQA model, $q_{\bm{w}}(\cdot)$ , an FR-IQA model as the image fidelity measure, $D(\cdot,\cdot)$ , the EDICT model to map a test image into diffusion latents, $\bm{f}(\cdot)$ and its inversion $\bm{h}(\cdot)$ , # of optimization steps, $\mathrm{MaxIter}$ , # of diffusion steps, $T$ , the trade-off parameter size, $\lambda$ , the momentum parameter, $\rho$ , and the learning rate, $\gamma$
Input: An initial image $\bm{x}^{\mathrm{init}}$
Output: A pair of optimized images $(\bm{x}^{\star},\bm{y}^{\star}$ ) with nearly identical content and enhanced perceptual quality

\bm{m}_{x}=\bm{0},\bm{m}_{y}=\bm{0}

({\bm{x}}_{T},{\bm{y}}_{T})\leftarrow\bm{f}(\bm{x}^{\mathrm{init}},\bm{x}^{% \mathrm{init}})

3:for

i=1\to\mathrm{MaxIter}

\ell(\bm{x}_{T})\leftarrow\ D(\bm{h}({\bm{x}}_{T}),\bm{x}^{\mathrm{init}})-% \lambda q_{\bm{w}}(\bm{h}({\bm{x}}_{T}))

\triangleright

\bm{y}_{T}

is omitted in

\bm{h}(\cdot)

for notation simplicity

\ell(\bm{y}_{T})\leftarrow D(\bm{h}({\bm{y}}_{T}),\bm{x}^{\mathrm{init}})-% \lambda q_{\bm{w}}(\bm{h}({\bm{y}}_{T}))

\triangleright

\bm{x}_{T}

is omitted in

\bm{h}(\cdot)

for notation simplicity

\ell(\bm{x}_{T},\bm{y}_{T})\leftarrow(\ell(\bm{x}_{T})+\ell(\bm{y}_{T}))/2

\Delta\ell(\bm{x}_{T})\leftarrow\partial\ell(\bm{x}_{T},\bm{y}_{T})/\partial% \bm{x}_{T}

10:

\Delta\ell(\bm{y}_{T})\leftarrow\partial\ell(\bm{x}_{T},\bm{y}_{T})/\partial% \bm{y}_{T}

11:

(\bm{m}_{x},\bm{m}_{y})\leftarrow\rho(\bm{m}_{x},\bm{m}_{y})-\gamma(\Delta\ell% (\bm{x}_{T}),\Delta\ell(\bm{y}_{T}))

12:

(\bm{x}_{T},\bm{y}_{T})\leftarrow(\bm{x}_{T},\bm{y}_{T})+(\bm{m}_{x},\bm{m}_{y})

13:

(\bm{x}^{\star},\bm{y}^{\star})\leftarrow\bm{h}(\bm{x}_{T},\bm{y}_{T})

As discussed in the Introduction, it is enticing to plug NR-IQA models as priors into MAP estimation, guiding the optimization to find the best-quality image in the vicinity of the initial $\bm{x}^{\mathrm{init}}$ . However, existing NR-IQA models are trained in a discriminative way on finite sets of limited distortion appearances, and seem to be of little use as naturalness priors (as shown in Fig. 1). Inspired by [65], we propose to supplement existing NR-IQA models with a differentiable and bijective diffusion model $\bm{h}(\cdot)$ described in Sec. 3.1, and thus equip them with generative capabilities. This leads to MAP estimation in diffusion latents:

\displaystyle\bm{x}_{T}^{\star}=\mathop{\arg\min}_{\bm{x}_{T}}D\left(\bm{h}(% \bm{x}_{T}),\bm{x}^{\mathrm{init}}\right)-\lambda q_{\bm{w}}(\bm{h}(\bm{x}_{T}% )),

(16)

where we have omitted the coupled noise vector $\bm{y}_{T}$ for notation simplicity. Eq. (16) can be directly solved by gradient-based optimizers, as summarized in Algorithm 1. Empirically, the optimized pair of images $(\bm{x}^{\star},\bm{y}^{\star})$ are nearly identical in content and appearance, consistent with the observation in [65].

As $\bm{h}(\cdot)$ is bijective, it is convenient to instantiate $D(\cdot,\cdot)$ as the MSE in the diffusion latents, i.e., $\|\bm{x}_{T}-\bm{x}_{T}^{\mathrm{init}}\|_{2}^{2}/L$ . We have tried other FR-IQA models such as the structural similarity (SSIM) index [71] in the raw pixel domain, the deep image structure and texture similarity (DISTS) metric [13] in the VGG feature domain, and the MSE in the CLIP feature domain [50], and find that the optimization is fairly robust to the selection of FR-IQA models. In Fig. 2, we emphasize the important role of the image fidelity term in maintaining the semantic faithfulness of the optimized image $\bm{x}^{\star}$ to the initial $\bm{x}^{\mathrm{init}}$ , in contrast to the direct diffusion latent optimization [65] (i.e., $\operatorname*{arg\,max}_{\bm{x}_{T}}q_{\bm{w}}(\bm{h}(\bm{x}_{T}))$ by setting $\lambda\rightarrow\infty$ ).

3.3 Model Comparison Procedure

By plugging different NR-IQA models into the paradigm of MAP estimation in diffusion latents (in Eq. (16)), we are likely to obtain different enhanced versions of the same test image in terms of color and detail reproduction. Fig. 3 shows a visual demonstration, where we adopt eight NR-IQA models to enhance the same image with realistic out-of-focus blur. Visual inspection of the eight images reveals their relative capabilities in guiding image enhancement.

We formalize the idea of NR-IQA model comparison using MAP estimation in diffusion latents. We first gather a set of $M$ photographic images with realistic camera distortions, $\mathcal{U}=\{\bm{x}^{(i)}\}_{i=1}^{M}$ , and employ $N$ NR-IQA models $\{q^{(j)}_{\bm{w}}\}_{j=1}^{N}$ . Due to the high dimensionality, nonlinearity, and nonconvexity of the objectives defined in Eq. (16), the optimal $\lambda$ may vary across test images and across NR-IQA models. To cope with this subtlety, we try $K$ different $\lambda$ ’s for each combination of test images and NR-IQA models. This gives rise to $K$ candidate enhancement results, with the best one identified using the staircase method in psychophysics [8] by three human subjects.

Consequently, from a total of $M\times N$ enhanced images, we form $M\times\binom{N}{2}$ image pairs, which are subject to formal psychophysical testing using two-alternative forced choice (2AFC) to differentiate fine-grained quality variations. In each trial, subjects are presented with a pair of enhanced images of the same content, corresponding to two different NR-IQA models. They are given unlimited viewing time to select the image with higher perceived quality. More details regarding the psychophysical experiment are given in Sec. Sec. 4.1.

After subjective testing, we arrange the raw subjective data into an $M\times N\times N$ tensor $\mathbf{C}$ , where $C_{ijk}$ records the empirical probability that $\bm{q}_{\bm{w}}^{(j)}$ is voted over $\bm{q}_{\bm{w}}^{(k)}$ for the $i$ -th test image $\bm{x}^{(i)}$ . Under the Thurstone Case V assumption [61], the underlying quality score $Q^{\star}_{ij}$ of the $i$ -th enhanced image associated with the $j$ -th NR-IQA model can be inferred using the maximum likelihood estimation [46]:

\displaystyle\mathbf{Q}^{\star}_{i\bullet}=\operatorname*{arg\,max}_{\mathbf{Q% }_{i\bullet}}\sum_{j,k}C_{ijk}\Phi\left(\frac{Q_{ij}-Q_{ik}}{\sqrt{2}\sigma}% \right),\quad\textrm{s.t.}\sum_{j=1}^{N}{Q_{ij}}=0,

(17)

where we add a linear constraint to resolve the scale ambiguity in the optimization objective. $\mathbf{Q}^{\star}_{i\bullet}$ denotes the $i$ -th row of the quality matrix $\mathbf{Q}^{\star}$ and $\Phi(\cdot)$ is the standard Normal cumulative distribution function. $\sigma$ is the standard deviation of the observer model, which is set to $1.0484$ , corresponding to $75\%$ of subjects voting one image over the other. In other words, two images are one unit apart in the just-objectionable-difference [47] scale. Last, we average $\mathbf{Q}^{\star}$ along the row dimension (i.e., over $M$ test images), resulting in the global ranking performance of $N$ NR-IQA models through MAP estimation in diffusion latents.

4 Experiments

In this section, we first describe the selection of the test images to be enhanced and the competing NR-IQA models, followed by other implementation details. We proceed to show and analyze the model comparison results. Additionally, we qualitatively analyze the failure cases of different NR-IQA models identified by MAP estimation in diffusion latents.

4.1 Experimental Setups

Image Selection. We select $20$ photographic images with visible realistic camera distortions from three IQA datasets: LIVE Challenge [19], KonIQ-10k [27], and SPAQ [16]. We use bicubic interpolation and center crop** to resize all images to $512\times 512$ pixels (see Fig. 4). Care is taken to exclude images with extremely low quality or with no main subjects in the foreground. This is because the enhanced images in those cases are less quality-discriminable due to the challenge of the enhancement task or limited semantics to be enhanced.

Model Selection. We select eight state-of-the-art NR-IQA models with different design philosophies: NIQE [42], DBCNN [82], HyperIQA [60], PaQ-2-PiQ [78], UNIQUE [83], MUSIQ [30], CLIPIQA+ [68], and LIQE [85], among which NIQE is a knowledge-driven training-free method, while the rest are data-driven. All implementations are obtained from the respective authors, and are tested with the default settings.

Optimization Details. Following the suggestions by the Video Quality Experts Group [64], we adopt a four-parameter logistic function to compensate for the prediction nonlinearity of different NR-IQA models:

\displaystyle q_{\bm{\xi}}\circ q_{\bm{w}}(\bm{x})=\frac{\xi_{1}-\xi_{2}}{1+% \exp^{-\frac{{q}_{\bm{w}}(\bm{x})-\xi_{3}}{|\xi_{4}|}}}+\xi_{2},

(18)

where $\circ$ indicates function composition. We manually enforce $\xi_{1}=100$ and $\xi_{2}=0$ , which determine the maximum and minimum map** values, respectively, and learn $\xi_{3}$ and $\xi_{4}$ from data. $q_{\bm{\xi}}(\cdot)$ can be seen as part of the NR-IQA model, making the selection of the trade-off parameter $\lambda$ under MAP estimation in diffusion latents much easier.

We use the Stable Diffusion v1.4 [51] inside of EDICT. Despite being text-conditional, Stable Diffusion also allows unconditional image synthesis with a null text prompt. We leverage this property in our experiments to avoid any potential risk of semantic drift, which is undesirable in image enhancement. Following [66], we adopt a configuration of EDICT with the mixing parameter, $p=0.93$ (in Eqs. (9) and (10)) and the number of diffusion steps, $T=50$ . We optimize each image for $\mathrm{MaxIter}=15$ iterations with the learning rate, $\gamma=1$ and the momentum parameter, $\rho=0.9$ . Three candidate trade-off parameter values, $\lambda\in\{0.01,0.1,1\}$ , are tested and finalized using the staircase method. Psychophysical Testing Details. The psychophysical experiment is carried out in an indoor office environment, illuminated by normal lighting sources with no reflecting wall or floor. All image pairs are displayed on a calibrated $24^{{}^{\prime\prime}}$ LCD monitor at a resolution of $1,920\times 1,080$ , with randomized spatial order. The 2AFC method is used, where the subject is forced to select the image of higher perceived quality. We recruit $25$ human subjects ( $15$ males and $10$ females), aged between $23$ and $36$ , to participate in our subjective study. A training session is included to familiarize each subject with this study. To avoid the fatigue effect, subjects are allowed to take a break at any time.

4.2 Model Comparison Results

After subjective testing, we obtain the quality scores of all optimized images as stated in Sec. 3.3. We then average the scores over the $20$ test images, yielding the global ranking of the eight competing NR-IQA models as shown in Fig. 5. We accompany the ranking result with a two-tailed statistical $t$ -test [10], where the null hypothesis states that the quality scores $\mathbf{Q}^{\star}_{\bullet j}$ for the $j$ -th NR-IQA model and $\mathbf{Q}^{\star}_{\bullet k}$ for the $k$ -th NR-IQA model come from the same Normal distribution. When the test fails to reject the null hypothesis at the $\alpha=5\%$ significance level, the performance of the two NR-IQA models is statistically indistinguishable.

We draw several interesting observations from Fig. 5. First, LIQE [85] and UNIQUE [83] statistically significantly outperform other methods, demonstrating the efficacy of learning to rank image quality on multiple datasets. The performance gap between LIQE and UNIQUE can be ascribed to the differences in the volume of data used for pre-training and the choice of backbone networks (see Table 1). Second, although DBCNN [82], MUSIQ [30], and HyperIQA [60] employ different backbone networks, they achieve close optimization performance, evidenced by the statistical significance test. It is noteworthy that all the three models are trained on KonIQ-10k [27], underscoring the favorable role of training data compared to the choice of backbone networks. Third, CLIPIQA+ [68] delivers inferior performance, which may appear surprising initially given the fact that it shares the CLIP backbone with the top-performing LIQE. Upon closer examination, we find that CLIPIQA+ employs the prompt learning strategy [87], with the pre-trained weights fixed. This constraint is likely to hinder CLIPIQA+ from learning/eliciting quality-aware computation. Fourth, although the knowledge-driven model NIQE [42] aims ambitiously to handle arbitrary distortions, it does not perform well under our paradigm of MAP estimation in diffusion latents. This reinforces the challenge of manually crafting features to model the intricate interactions between diverse image content and realistic camera distortions.

Table 1: Ranking results of eight NR-IQA models. A smaller rank indicates better performance. “Multiple” in the second column indicates that UNIQUE and LIQE are trained on the combined datasets of LIVE, CSIQ, BID, CLIVE, KonIQ-10k, and KADID-10k. The results in the third column are in millions. SRCC results are also shown in the fourth column (inside the brackets).

NR-IQA Model	Training Set	$\#$ of Params	SRCC Rank	MAP Rank	$\triangle$ Rank
NIQE [42]	–	0.001	8 (0.706)	8	0
CLIPIQA+ [68]	KonIQ-10k	101.349	2 (0.855)	7	-5
PaQ-2-PiQ [78]	FLIVE	11.704	5 (0.827)	6	-1
HyperIQA [60]	KonIQ-10k	27.375	7 (0.776)	5	2
MUSIQ [30]	KonIQ-10k	27.125	3 (0.853)	4	-1
DBCNN [82]	KonIQ-10k	15.311	6 (0.789)	3	3
UNIQUE [83]	Multiple	22.322	4 (0.838)	2	2
LIQE [85]	Multiple	150.976	1 (0.881)	1	0

We also compare the ranking results by the proposed MAP estimation and by the Spearman’s rank correlation coefficient (SRCC) on SPAQ [16] in Table 1. All NR-IQA models are not exposed to SPAQ, leading to a cross-dataset evaluation setup. The primary observation from Table 1 is that higher SRCC performance does not necessarily transfer to MAP estimation in diffusion latents. In particular, CLIPIQA+ [68] attains the second-highest SRCC result, but it performs the worst in MAP estimation. This is in stark contrast to UNIQUE, DBCNN, and HyperIQA, which demonstrate superior rankings in MAP estimation. In summary, our method features in discriminating the relative performance of NR-IQA models through analysis-by-synthesis, complementing conventional correlation-based performance measures.

4.3 Failure Case Analysis

We showcase the use of the proposed MAP estimation in diffusion latents to spot the weaknesses of NR-IQA models. Visual inspection on MAP-enhanced images shown in Fig. 6 exposes various failure cases of different NR-IQA models with distinctive visual characteristics. Although Figs. 6 (a), (b), and (e) are with improved visual quality compared to their initials, they exhibit noticeable local visual errors, such as false details and detail drift, indicating the need for enhancing these NR-IQA models to comprehend high-level semantics. Figs. 6 (c) and (f) are free from structural distortions but with unrealistic cartoonlike textures or colors. Fig. 6 (d) has been overly smoothed. These findings point out a deficiency in these methods to adequately address low-level visual cues. Consistent with observations in Fig. 5, the two least effective NR-IQA models, CLIPIQA+ and NIQE, either fail to induce perceptually meaningful enhancement, or produce an undesirable grainy appearance that is not present in the original image.

5 Conclusion and Discussion

We have developed the MAP estimation in diffusion latents, a computational method that, for the first time, enables existing “imperfect” NR-IQA models to perform challenging (unconstrained) image enhancement. The key trick is to augment NR-IQA models with a differentiable and bijective diffusion model, empowering them with generative capabilities. Due to the differences in design principles and implementation details, different NR-IQA models are highly probable to yield distinct enhancement results for the same input image. This allows us to compare existing NR-IQA models in terms of their image enhancement ability, which falls in the general analysis-by-synthesis framework. We have systematically compared eight NR-IQA models using MAP estimation in diffusion latents, and complemented the conventional correlation-based model comparison paradigm in the context of perceptual optimization.

Despite the promise of the proposed method for NR-IQA model comparison, it is far from being a flawless and general solution to real-world image enhancement, as exemplified in the failure cases. In the future, we plan to further pursue this direction, towards efficient and reliable real-world image enhancement. From the data perspective, we will finetune the best-performing NR-IQA model, LIQE, on the combination of previously trained datasets and newly MAP-enhanced images, and plug the finetuned LIQE back for MAP estimation. Our preliminary psychophysical experiments show that the finetuned LIQE can deliver MAP-optimized images of improved visual quality, as preferred by human subjects over those by the original LIQE (more details in the Appendix). Like [69], the procedure of MAP estimation in diffusion latents, psychophysical testing, and model finetuning can be iterated, leading to a closed loop between model evaluation and model development. From the model perspective, encouraged by the promising enhancement results, it is appealing to construct generative NR-IQA models (i.e., the conditional probability of the test image given the quality score, $p(\bm{x}|y)$ ) rather than existing discriminative counterparts (i.e., $p(q|\bm{x})$ ). Another pragmatic avenue for research is to substantially reduce computational complexity when taking the gradient with respect to the entire (reverse) diffusion process and to reliably determine the optimization hyper-parameters (e.g., # of optimization steps, $\mathrm{MaxIter}$ and the trade-off parameter, $\lambda$ ).

References

[1] Berardino, A., Ballé, J., Laparra, V., Simoncelli, E.P.: Eigen-distortions of hierarchical representations. In: Adv. Neural Inform. Process. Syst. pp. 3531–3540 (2017)
[2] Bosse, S., Maniry, D., Müller, K., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Process. 27(1), 206–219 (Jan 2018)
[3] Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 60–65 (2005)
[4] Burnham, K.P., Anderson, D.R.: Model Selection and Multi-model Inference: A Practical Information-Theoretic Approach. Springer New York (2004)
[5] Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (Jan 2010)
[6] Cao, P., Wang, Z., Ma, K.: Debiased subjective assessment of real-world image enhancement. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 711–721 (2021)
[7] Ciancio, A., Targino da Costa, A.L.N.T., da Silva, E.A.B., Said, A., Samadani, R., Obrador, P.: No-reference blur assessment of digital pictures based on multifeature classifiers. IEEE Trans. Image Process. 20(1), 64–75 (Jan 2011)
[8] Cornsweet, T.N.: The staircase-method in psychophysics. Am. J. Psychol. 75(3), 485–491 (Sep 1962)
[9] Croitoru, F.A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 45(9), 10850–10869 (Sep 2023)
[10] David, H.A.: The Method of Paired Comparisons. Hafner Publishing Company (1963)
[11] Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Adv. Neural Inform. Process. Syst. pp. 8780–8794 (2021)
[12] Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Comparison of full-reference image quality models for optimization of image processing systems. Int. J. Comput. Vis. 129(4), 1258–1281 (Apr 2021)
[13] Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2567–2581 (May 2022)
[14] Dinh, L., Krueger, D., Bengio, Y.: NICE: Non-linear independent components estimation. In: Int. Conf. Learn. Represent. Worksh. (2015)
[15] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: Int. Conf. Learn. Represent. (2017)
[16] Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 3677–3686 (2020)
[17] Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: CLIP-Adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. pp. 1–15 (Sep 2023)
[18] Geman, D., Yang, C.: Nonlinear image recovery with half-quadratic regularization. IEEE Trans. Image Process. 4(7), 932–946 (Jul 1995)
[19] Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process. 25(1), 372–387 (Jan 2016)
[20] Ghadiyaram, D., Bovik, A.C.: Perceptual quality prediction on authentically distorted images using a bag of features approach. J. Vis. 17(1), 32–32 (Jan 2017)
[21] Golan, T., Raju, P.C., Kriegeskorte, N.: Controversial stimuli: Pitting neural networks against each other as models of human cognition. Proc. Nat. Acad. Sci. 117(47), 29330–29337 (Nov 2020)
[22] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Adv. Neural Inform. Process. Syst. pp. 2672–2680 (2014)
[23] Grenander, U., Miller, M.I.: Pattern Theory: From Representation to Inference. Oxford University Press (2007)
[24] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: Int. Conf. Learn. Represent. (2023)
[25] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Adv. Neural Inform. Process. Syst. pp. 6840–6851 (2020)
[26] Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: Adv. Neural Inform. Process. Syst. Worksh. (2021)
[27] Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Trans. Image Process. 29, 4041–4056 (Jan 2020)
[28] Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: Int. Conf. Learn. Represent. (2021)
[29] Hyvärinen, A., Dayan, P.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6(4) (Apr 2005)
[30] Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: Multi-scale image quality transformer. In: Int. Conf. Comput. Vis. pp. 5148–5157 (2021)
[31] Larson, E.C., Chandler, D.M.: Most apparent distortion: Full-reference image quality assessment and the role of strategy. J. Electron. Imaging 19(1), 1–21 (Jan 2010)
[32] Lin, H., Hosu, V., Saupe, D.: KADID-10k: A large-scale artificially distorted IQA database. In: Int. Conf. Multimedia and Expo. pp. 1–3 (2019)
[33] Liu, X., Weijer, J.v.d., Bagdanov, A.D.: RankIQA: Learning from rankings for no-reference image quality assessment. In: Int. Conf. Comput. Vis. pp. 1040–1049 (2017)
[34] Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: Int. Conf. Learn. Represent. (2017)
[35] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Int. Conf. Learn. Represent. (2019)
[36] Ma, K., Duanmu, Z., Wang, Z., Wu, Q., Liu, W., Yong, H., Li, H., Zhang, L.: Group maximum differentiation competition: Model comparison with few samples. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 851–864 (Apr 2020)
[37] Ma, K., Liu, W., Zhang, K., Duanmu, Z., Wang, Z., Zuo, W.: End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process. 27(3), 1202–1213 (Mar 2018)
[38] Ma, K., Liu, X., Fang, Y., Simoncelli, E.P.: Blind image quality assessment by learning from multiple annotators. In: IEEE Int. Conf. Image Process. pp. 2344–2348 (2019)
[39] Madhusudana, P.C., Birkbeck, N., Wang, Y., Adsumilli, B., Bovik, A.C.: Image quality assessment using contrastive learning. IEEE Trans. Image Process. 31, 4149–4161 (Jun 2022)
[40] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: Int. Conf. Learn. Represent. (2021)
[41] Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (Dec 2012)
[42] Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Letters 20(3), 209–212 (Mar 2013)
[43] Moorthy, A.K., Bovik, A.C.: Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE Trans. Image Process. 20(12), 3350–3364 (Dec 2011)
[44] Mumford, D.: Pattern theory: A unifying perspective. In: Eur. Cong. Math. pp. 187–224 (1994)
[45] Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: Int. Conf. Mach. Learn. pp. 8162–8171 (2021)
[46] Perez-Ortiz, M., Mantiuk, R.K.: A practical guide and software for analysing pairwise comparison experiments. arXiv preprint arXiv:1712.03686 (2017)
[47] Perez-Ortiz, M., Mikhailiuk, A., Zerman, E., Hulusic, V., Valenzise, G., Mantiuk, R.K.: From pairwise comparisons and rating to a unified quality scale. IEEE Trans. Image Process. 29, 1139–1151 (2020)
[48] Pokrovskii, V.N.: Thermodynamics of Complex Systems: Principles and Applications. IOP Publishing (2020)
[49] Portilla, J., Strela, V., Wainwright, M.J., Simoncelli, E.P.: Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Trans. Image Process. 12(11), 1338–1351 (Nov 2003)
[50] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Int. Conf. Mach. Learn. pp. 8748–8763 (2021)
[51] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 10684–10695 (2022)
[52] Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D: Nonlinear Phenom. 60(1-4), 259–268 (Nov 1992)
[53] Saha, A., Mishra, S., Bovik, A.C.: Re-IQA: Unsupervised learning for image quality assessment in the wild. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 5846–5855 (2023)
[54] Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15(11), 3440–3451 (Nov 2006)
[55] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: Int. Conf. Mach. Learn. pp. 2256–2265 (2015)
[56] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Int. Conf. Learn. Represent. (2021)
[57] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Adv. Neural Inform. Process. Syst. (2019)
[58] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: Int. Conf. Learn. Represent. (2021)
[59] Su, S., Hosu, V., Lin, H., Zhang, Y., Saupe, D.: KonIQ++: Boosting no-reference image quality assessment in the wild by jointly predicting image quality and defects. In: Brit. Mach. Vis. Conf. pp. 1–12 (2021)
[60] Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 3667–3676 (2020)
[61] Thurstone, L.L.: A law of comparative judgment. Psychol. Rev. 34, 273–286 (Jul 1927)
[62] Tsai, M.F., Liu, T.Y., Qin, T., Chen, H.H., Ma, W.Y.: FRank: A ranking method with fidelity loss. In: ACM SIGIR Conf. Res. Develop. Inf. Retrieval. pp. 383–390 (2007)
[63] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. Int. J. Comput. Vis. 128(7), 1867–1888 (Jul 2020)
[64] VQEG: Final report from the video quality experts group on the validation of objective models of video quality assessment (2003), http://www.vqeg.org
[65] Wallace, B., Gokul, A., Ermon, S., Naik, N.: End-to-end diffusion latent optimization improves classifier guidance. In: Int. Conf. Comput. Vis. pp. 7280–7290 (2023)
[66] Wallace, B., Gokul, A., Naik, N.: EDICT: Exact diffusion inversion via coupled transformations. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 22532–22541 (2023)
[67] Wang, H., Chen, T., Wang, Z., Ma, K.: I am going MAD: Maximum discrepancy competition for comparing classifiers adaptively. In: Int. Conf. Learn. Represent. (2020)
[68] Wang, J., Chan, K.C., Loy, C.C.: Exploring CLIP for assessing the look and feel of images. In: AAAI Conf. Artif. Intell. pp. 2555–2563 (2023)
[69] Wang, Z., Ma, K.: Active fine-tuning from gMAD examples improves blind image quality assessment. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 4577–4590 (Sep 2022)
[70] Wang, Z., Bovik, A.C.: Modern Image Quality Assessment. Morgan & Claypool (2006)
[71] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (Apr 2004)
[72] Wang, Z., Simoncelli, E.P.: Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities. J. Vis. 8(12), 8.1–8.13 (Sep 2008)
[73] Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T.S., Yan, S.: Sparse representation for computer vision and pattern recognition. Proc. IEEE 98(6), 1031–1044 (Jun 2010)
[74] Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., Yan, Q., Min, X., Zhai, G., Lin, W.: Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)
[75] Wu, J., Ma, J., Liang, F., Dong, W., Shi, G., Lin, W.: End-to-end blind image quality prediction with cascaded deep neural network. IEEE Trans. Image Process. 29, 7414–7426 (Jun 2020)
[76] Yan, J., Zhong, Y., Fang, Y., Wang, Z., Ma, K.: Exposing semantic segmentation failures via maximum discrepancy competition. Int. J. Comput. Vis. 129(5), 1768–1786 (Mar 2021)
[77] Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., Yang, M.H.: Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surveys 56(4), 1–39 (Sep 2023)
[78] Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.C.: From patches to pictures (PaQ-2-PiQ): Map** the perceptual space of picture quality. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 3572–3582 (2020)
[79] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 586–595 (2018)
[80] Zhang, W., Li, D., Ma, C., Zhai, G., Yang, X., Ma, K.: Continual learning for blind image quality assessment. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 2864–2878 (Mar 2023)
[81] Zhang, W., Li, D., Min, X., Zhai, G., Guo, G., Yang, X., Ma, K.: Perceptual attacks of no-reference image quality models with human-in-the-loop. In: Adv. Neural Inform. Process. Syst. pp. 2916–2929 (2022)
[82] Zhang, W., Ma, K., Yan, J., Deng, D., Wang, Z.: Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Trans. Circuit Syst. Video Technol. 30(1), 36–47 (Jan 2020)
[83] Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process. 30, 3474–3486 (Mar 2021)
[84] Zhang, W., Ma, K., Zhai, G., Yang, X.: Task-specific normalization for continual learning of blind image quality models. IEEE Trans. Image Process. (2024), to appear
[85] Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 14071–14081 (2023)
[86] Zhao, K., Yuan, K., Sun, M., Li, M., Wen, X.: Quality-aware pre-trained models for blind image quality assessment. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 22302–22313 (2023)
[87] Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (Jul 2022)
[88] Zhu, H., Li, L., Wu, J., Dong, W., Shi, G.: MetaIQA: Deep meta-learning for no-reference image quality assessment. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 14131–14140 (2020)

Appendix 0.A Appendix

0.A.1 Model Rectification

The annotated MAP-optimized images in Sec. 3.3 give us an opportunity to rectify NR-IQA models. Drawing inspiration from recent wisdom in data-efficient model finetuning [28, 17], we introduce a lightweight quality rectifier $\bm{r}_{\bm{\phi}}(\cdot):\mathbb{R}^{L}\mapsto\mathbb{R}^{2}$ , parameterized by a vector $\bm{\phi}$ . Given an input image $\bm{x}$ , $\bm{r}_{\bm{\phi}}$ shares the image feature transform of the NR-IQA model $q^{\star}_{\bm{w}}$ , and produce a multiplicative scalar $s_{\bm{\phi}}(\bm{x})$ and an additive scalar $b_{\bm{\phi}}(\bm{x})$ to rectify the quality prediction: $q^{\star}(\bm{x})=s_{\bm{\phi}}(\bm{x})\times q^{\star}_{\bm{w}}(\bm{x})+b_{% \bm{\phi}}(\bm{x})$ . We fix the pre-trained weights $\bm{w}$ and optimize the rectifier parameters $\bm{\phi}$ on the combination of the previously trained and newly MAP-optimized images. Specifically, we choose the best-performing model LIQE [85] for demonstration. LIQE corresponds an image $\bm{x}$ to all candidate textual descriptions, yielding a joint probability over three tasks: quality prediction, scene classification, and distortion type identification. We marginalize over the two auxiliary tasks to obtain the marginal probability of image quality. We use the image representation produced by LIQE as the input to the rectifier $\bm{r}_{\bm{\phi}}$ . Following [85], we use AdamW [35] to train $\bm{r}_{\bm{\phi}}$ by minimizing the fidelity loss [62] for $15$ epochs with an initial learning rate of $10^{-3}$ , which is scheduled by a cosine annealing rule [34]. The mini-batch size is set to $4$ for the LIVE [54], CSIQ [31], BID [7], and LIVE Challenge [19] datasets, $16$ for KonIQ-10k [27] and KADID-10k [32], and $8$ for the set of MAP-enhanced images.

We compare the MAP-enhanced images produced by LIQE ${}^{\mathrm{rect}}$ and the original LIQE in a subjective experiment, similar to that described in Sec. 3.3. We query human preferences for the MAP-enhanced images of the same visual content (shown in Fig. 4), from which we find $69\%$ of the subjects favor the enhanced results produced by LIQE ${}^{\mathrm{rect}}$ . Moreover, we follow the MAD competition methodology [72] to select $20$ images that best differentiate between LIQE ${}^{\mathrm{rect}}$ and LIQE from a pool of $200$ candidate images. We employ DISTS [13] to quantify the perceptual similarity in MAD. The subjective testing on the MAD-selected images reveals that $71\%$ of the participants prefer the results produced by LIQE ${}^{\mathrm{rect}}$ . These findings verify the efficacy of MAP-enhanced images in rectifying NR-IQA models. Figs. A1 and A2 show some visual examples, where we find that images generated by LIQE ${}^{\mathrm{rect}}$ are less structurally distorted and more semantically consistent, leading to more natural visual appearances.

0.A.2 More Qualitative Results

We show more MAP-enhanced images corresponding to different NR-IQA models in Figs. A3 $-$ A10. Visual inspection of these images yields some additional interesting observations.

•

Given blurry initial images (see Figs. A3, A4, A7, and A8), NIQE [42] often guides the optimization process to deblur excessively, resulting in the emergence of unnatural textures, particularly in the background.
•

While PaQ-2-PiQ [78] performs well in structure restoration, it falls short in producing natural color appearances (see Figs. A3, A7, and A9), thereby compromising perceptual quality.
•

Although LIQE [85] excels in distortion removal, it tends to synthesize cartoonish images (see Fig. A7), suggesting room for further improvement.
•

All NR-IQA models fail to enhance Fig. A5 (a) in terms of semantic correctness, introducing textual artifacts in the bottom-left area. This not only highlights the semantic understanding deficiency of existing NR-IQA models but also aligns with the hallucination issue inherent in diffusion models.
•

For initial images with moderate perceptual quality, such as Fig. A6 (a), the competing NR-IQA models generate comparable visual results.