HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.06406v1 [cs.CV] 11 Mar 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University 11email: {zwx8981, zhaiguangtao, xkyang}@sjtu.edu.cn 22institutetext: Peng Cheng Laboratory, Shenzhen, China
22email: [email protected]
33institutetext: Department of Computer Science, City University of Hong Kong 33email: [email protected]

Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents

Weixia Zhang 11    Dingquan Li 22    Guangtao Zhai 11    Xiaokang Yang 11    Kede Ma 33
Abstract

Contemporary no-reference image quality assessment (NR-IQA) models can effectively quantify the perceived image quality, with high correlations between model predictions and human perceptual scores on fixed test sets. However, little progress has been made in comparing NR-IQA models from a perceptual optimization perspective. Here, for the first time, we demonstrate that NR-IQA models can be plugged into the maximum a posteriori (MAP) estimation framework for image enhancement. This is achieved by taking the gradients in differentiable and bijective diffusion latents rather than in the raw pixel domain. Different NR-IQA models are likely to induce different enhanced images, which are ultimately subject to psychophysical testing. This leads to a new computational method for comparing NR-IQA models within the analysis-by-synthesis framework. Compared to conventional correlation-based metrics, our method provides complementary insights into the relative strengths and weaknesses of the competing NR-IQA models in the context of perceptual optimization.

Keywords:
No-reference image quality assessment Model comparison Diffusion models

1 Introduction

Image quality assessment (IQA) models have been extensively studied [70] as surrogates of human subjects for performance evaluation of image processing and computer vision systems. Depending on the accessibility to pristine-quality reference images, IQA models can be broadly categorized into two types: full-reference IQA (FR-IQA) [71, 79] and no-reference IQA (NR-IQA) [41, 2]. FR-IQA models are well-suited for scenarios where undistorted references are available. However, in many real-world applications (e.g., low-light image enhancement and image-to-image translation), specifying desired reference outputs is an inherently challenging task [6], if not impossible. This highlights the necessity for accurate NR-IQA models to assess the perceived image quality without referencing to the pristine-quality counterparts.

Apart from performance evaluation, IQA models hold great promise for perceptual optimization of image processing systems. FR-IQA models have been serving the purpose for a long time, from the age of measuring error visibility and structural similarity [71] to the data-driven era [79, 13]. Recently, an extensive comparison of FR-IQA models for optimization of various image processing algorithms has been conducted [12], turning the perceptual optimization process into an analysis-by-synthesis tool [23]. A natural question then arises:

Can we compare NR-IQA models in terms of their performance in perceptual optimization as well?

A straightforward computational effort to answer the above question is directly maximizing an NR-IQA model, q𝒘():N:subscript𝑞𝒘maps-tosuperscript𝑁q_{\bm{w}}(\cdot):\mathbb{R}^{N}\mapsto\mathbb{R}italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ) : blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ↦ blackboard_R, parameterized by a vector 𝒘𝒘\bm{w}bold_italic_w, with respect to the input image 𝒙𝒙\bm{x}bold_italic_x:

𝒙=argmax𝒙q𝒘(𝒙),superscript𝒙subscriptargmax𝒙subscript𝑞𝒘𝒙\displaystyle\bm{x}^{\star}=\operatorname*{arg\,max}_{\bm{x}}q_{\bm{w}}(\bm{x}),bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) , (1)

where a larger q𝒘()subscript𝑞𝒘q_{\bm{w}}(\cdot)italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ) indicates higher predicted quality, and 𝒙Nsuperscript𝒙superscript𝑁\bm{x}^{\star}\in\mathbb{R}^{N}bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT is the optimized image. As observed in [81] and shown in Fig. 1 (b), even if q𝒘subscript𝑞𝒘q_{\bm{w}}italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT is instantiated by a state-of-the-art NR-IQA model (LIQE [85] in this case), the optimized image appears locally texture and color distorted, with degraded quality compared to the initial.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1: (a) Distorted image as the initial point. (b) Optimized image by directly maximizing a state-of-the-art NR-IQA model - LIQE [85] (see Eq. (1)). (c) Optimized image via MAP estimation (see Eq. (2)), where the likelihood and prior terms are implemented by the mean squared error (MSE) [71] and LIQE [85], respectively. (d) Optimized image via MAP estimation in diffusion latents (see Eq. (16)). The predicted quality score (with a maximum value of five by LIQE) is shown below each image; the larger, the better.

A natural extension is to interpret q𝒘()subscript𝑞𝒘q_{\bm{w}}(\cdot)italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ) as a natural image prior, and plug it into the maximum a posteriori (MAP) estimation framework, leading to the following optimization problem:

𝒙superscript𝒙\displaystyle\bm{x}^{\star}bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT =argmin𝒙E(𝒙|𝒙init)=argmin𝒙D(𝒙,𝒙init)λq𝒘(𝒙),absentsubscriptargmin𝒙𝐸conditional𝒙superscript𝒙initsubscriptargmin𝒙𝐷𝒙superscript𝒙init𝜆subscript𝑞𝒘𝒙\displaystyle=\operatorname*{arg\,min}_{\bm{x}}E(\bm{x}|\bm{x}^{\mathrm{init}}% )=\operatorname*{arg\,min}_{\bm{x}}D(\bm{x},\bm{x}^{\mathrm{init}})-\lambda q_% {\bm{w}}(\bm{x}),= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_E ( bold_italic_x | bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_D ( bold_italic_x , bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ) - italic_λ italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) , (2)

where we express the posterior probability p(𝒙|𝒙init)exp(E(𝒙|𝒙init))proportional-to𝑝conditional𝒙superscript𝒙init𝐸conditional𝒙superscript𝒙initp(\bm{x}|\bm{x}^{\mathrm{init}})\propto\exp(-E(\bm{x}|\bm{x}^{\mathrm{init}}))italic_p ( bold_italic_x | bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ) ∝ roman_exp ( - italic_E ( bold_italic_x | bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ) ) by its associated Gibbs energy. D(,):N×N:𝐷maps-tosuperscript𝑁𝑁D(\cdot,\cdot):\mathbb{R}^{N\times N}\mapsto\mathbb{R}italic_D ( ⋅ , ⋅ ) : blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT ↦ blackboard_R is the fidelity term (also known as the negative log-likelihood term), which can be implemented by a distance metric. λ𝜆\lambdaitalic_λ is a parameter, trading off the likelihood and the prior terms. As shown in Fig. 1 (c), even when λ𝜆\lambdaitalic_λ is optimally set through careful visual inspection, the MAP-optimized image is still no better than the initial, which manifests itself as a perceptually imperceptible adversarial example of q𝒘()subscript𝑞𝒘q_{\bm{w}}(\cdot)italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ) [81].

Here, for the first time, we successfully demonstrate the perceptual optimization capability of contemporary NR-IQA models in the MAP estimation framework (see Fig. 1 (d)). Taking inspiration from the impressive progress of diffusion models [55, 25, 57, 56, 77] in image synthesis and beyond [11, 58], we supplement an NR-IQA method with a differentiable and bijective diffusion model. This empowers the NR-IQA model with the capability of modeling highly complex distributions of natural images by operating in the diffusion latent space111We distinguish the two terms “latent diffusion” and “diffusion latents.” The former means that the diffusion process takes place in the latent space rather than in the raw pixel domain. The latter treats the diffusion process as a feature transform and the corresponding Gaussian noise vector as the latent representation.. Specifically, we work with the method of exact diffusion inversion via coupled transformations (EDICT) [66], that turns any pre-trained diffusion model into a bijection between the input image and the latent noise vector through affine coupling [14, 15].

The resulting paradigm, MAP estimation in diffusion latents, gives us an opportunity to compare NR-IQA models within the analysis-by-synthesis framework. Specifically, we plug eight NR-IQA models into this paradigm to enhance a set of photographic images with realistic camera distortions [19, 27, 16] . Psychophysical testing on the optimized images of identical visual content reveals the relative performance of the competing NR-IQA models in image enhancement, and provides complementary insights into future NR-IQA model development beyond fixed-set evaluation.

In summary, this paper presents two main contributions.

  • We propose the method of MAP estimation in diffusion latents, which enables “imperfect” NR-IQA models to act as natural image priors for perceptual optimization.

  • We apply the proposed method to compare eight contemporary NR-IQA models with a number of interesting observations.

2 Related Work

In this section, we first review the progress of NR-IQA. We then discuss model comparison methods from an analysis-by-synthesis perspective [44, 23]. Finally, we provide an overview of diffusion models and their applications to various computer vision tasks.

2.1 NR-IQA Models

In the early days, NR-IQA relies heavily on handcrafted natural scene statistics (NSS) [41, 42, 43, 20], followed by quality regression. In the deep learning era, end-to-end optimized deep neural networks (DNNs) for NR-IQA have been developed, under the constraint that human perceptual data, represented as mean opinion scores (MOSs), are scarce. Typical data-efficient training strategies include patch-wise training [2], quality-aware pre-training [37, 33, 82], learning from pseudo-labels [38, 75], meta learning [88], contrastive learning [39, 86, 53], and multitask learning [16, 59, 85]. Emerging NR-IQA research directions involve predicting local quality [78], combating subpopulation shift [83, 80, 84], exploring more powerful and unified computational [30], and leveraging multi-modal information [68, 74].

2.2 Model Comparison Methods

Model comparison plays a pivotal role in model development [4]. Conventional model comparison involves preparing a fixed test set of ground-truth measurements and comparing computational models according to their goodness of fit. Instead of proving a model to be correct in such a discriminative way, we may as well test it in a generative way, known as analysis-by-synthesis in Grenander’s Pattern Theory [44, 23]. Representative generative model comparison methods include the maximum differentiation (MAD) competition [72], the eigen-distortion analysis [1], and the controversial stimuli synthesis [21]. While the core idea remains consistent, the synthesis process can be discretized into sample selection from a large-scale unlabeled set, making it more applicable to various vision tasks, including IQA [36], image recognition [67], semantic segmentation [76], and image enhancement [6]. Our work is closely related to the perceptual attack of NR-IQA models [81], as both can be cast into MAP estimation, but they differ substantially in design goals and implementation details. The perceptual attack assesses the adversarial robustness of NR-IQA models to visually imperceptible perturbations in the pixel space, while our method operates in the diffusion latent space, and compares NR-IQA models in terms of their ability to enhance the perceived quality of photographic images.

2.3 Diffusion Models in Vision

Diffusion models represent a family of generative models designed to reverse a process that progressively perturbs the training data over multiple steps [9]. Three variants of diffusion models are identified in the literature, including 1) the denoising diffusion probabilistic models [55, 25] inspired by the non-equilibrium thermodynamics theory [48], noise conditioned score networks [57] trained through score matching [29] to estimate the gradient of the log noisy data density, and stochastic differential equations [58], unifying the former two under a more general formulation. Diffusion models have offered remarkable results in a wide range of image synthesis tasks [11, 45, 40]. For finer control over generated results, classifier guidance [11] and classifier-free guidance [26] have been proposed at their respective costs of training noise-aware classifiers and class-conditional diffusion models. As for photographic image editing [24], diffusion inversion (i.e., identifying the noise vector that generates the photographic image to be edited through reverse diffusion) is necessary. EDICT [66] describes an elegant solution to converting any pre-trained diffusion model into a differentiable bijection through affine coupling [14, 15], and achieving exact diffusion inversion without the overhead for additional training. In this paper, we will supplement NR-IQA models with EDICT, leading to end-to-end MAP estimation of the diffusion latents for image enhancement.

3 Proposed Method

In this section, we first present the necessary preliminaries of diffusion models, in particular the EDICT model. We then introduce the proposed method, MAP estimation in diffusion latents. Finally, we supplement the details of the NR-IQA model comparison procedure using the proposed method.

3.1 Preliminaries

For a photographic image 𝒙L𝒙superscript𝐿\bm{x}\in\mathbb{R}^{L}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT subject to potential realistic camera distortion(s), the goal of an NR-IQA model q𝒘()subscript𝑞𝒘q_{\bm{w}}(\cdot)italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ) is to quantify the perceived quality of 𝒙𝒙\bm{x}bold_italic_x to closely approximate the underlying MOS q(𝒙)𝑞𝒙q(\bm{x})\in\mathbb{R}italic_q ( bold_italic_x ) ∈ blackboard_R.

We also work with a diffusion model [77]. For the forward diffusion process, a set of timesteps, {αt}t=0T,α0=1superscriptsubscriptsubscript𝛼𝑡𝑡0𝑇subscript𝛼01\{\alpha_{t}\}_{t=0}^{T},\alpha_{0}=1{ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 and αT=0subscript𝛼𝑇0\alpha_{T}=0italic_α start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 0, index a monotonically increasing noise schedule, according to which 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is perturbed at the t𝑡titalic_t-th step by 𝒙t=αt𝒙0+1αtϵsubscript𝒙𝑡subscript𝛼𝑡subscript𝒙01subscript𝛼𝑡bold-italic-ϵ\bm{x}_{t}=\sqrt{\alpha_{t}}\bm{x}_{0}+\sqrt{1-\alpha_{t}}\bm{\epsilon}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ, where ϵ𝒩(𝟎,𝐈)similar-tobold-italic-ϵ𝒩0𝐈\bm{\epsilon}\sim\mathcal{N}(\bm{0},\mathbf{I})bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ). For the reverse diffusion process (i.e., the data generation process), the diffusion model trains a denoising network ϵ𝜽(;t)subscriptbold-italic-ϵ𝜽𝑡\bm{\epsilon}_{\bm{\theta}}(\cdot;t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ; italic_t ):NN:absentmaps-tosuperscript𝑁superscript𝑁:\mathbb{R}^{N}\mapsto\mathbb{R}^{N}: blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, parameterized by a vector 𝜽𝜽\bm{\theta}bold_italic_θ and a time index t𝑡titalic_t, to progressively removes the noise starting from the pure Gaussian noise vector [56]:

𝒙^t1=subscript^𝒙𝑡1absent\displaystyle\hat{\bm{x}}_{t-1}=over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = αt1𝒙^t1αtϵ𝜽(𝒙^t;t)αt+1αt1ϵ𝜽(𝒙^t;t)subscript𝛼𝑡1subscript^𝒙𝑡1subscript𝛼𝑡subscriptitalic-ϵ𝜽subscript^𝒙𝑡𝑡subscript𝛼𝑡1subscript𝛼𝑡1subscriptbold-italic-ϵ𝜽subscript^𝒙𝑡𝑡\displaystyle\sqrt{\alpha_{t-1}}\frac{\hat{\bm{x}}_{t}-\sqrt{1-\alpha_{t}}% \epsilon_{\bm{\theta}}(\hat{\bm{x}}_{t};t)}{\sqrt{\alpha_{t}}}+\sqrt{1-\alpha_% {t-1}}\bm{\epsilon}_{\bm{\theta}}(\hat{\bm{x}}_{t};t)square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG divide start_ARG over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t )
=\displaystyle== at𝒙^t+btϵ𝜽(𝒙^t;t),subscript𝑎𝑡subscript^𝒙𝑡subscript𝑏𝑡subscriptbold-italic-ϵ𝜽subscript^𝒙𝑡𝑡\displaystyle a_{t}\hat{\bm{x}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}(\hat{\bm{% x}}_{t};t),italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_t ) , (3)

where at=αt1αtsubscript𝑎𝑡subscript𝛼𝑡1subscript𝛼𝑡a_{t}=\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG, bt=αt1(1αt)αt+1αt1subscript𝑏𝑡subscript𝛼𝑡11subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡1b_{t}=-\sqrt{\frac{\alpha_{t-1}(1-\alpha_{t})}{\alpha_{t}}}+\sqrt{1-\alpha_{t-% 1}}italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG, and 𝒙^T=ϵsubscript^𝒙𝑇bold-italic-ϵ\hat{\bm{x}}_{T}=\bm{\epsilon}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = bold_italic_ϵ. Key to the success of the diffusion model is the denoising network ϵ𝜽(;t)subscriptbold-italic-ϵ𝜽𝑡\bm{\epsilon}_{\bm{\theta}}(\cdot;t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( ⋅ ; italic_t ), which can be trained by minimizing

𝔼𝒙0,ϵ𝒩(𝟎,𝐈),t[λ(t)ϵϵ𝜽(𝒙t,t)22],subscript𝔼formulae-sequencesimilar-tosubscript𝒙0bold-italic-ϵ𝒩0𝐈𝑡delimited-[]𝜆𝑡superscriptsubscriptdelimited-∥∥bold-italic-ϵsubscriptbold-italic-ϵ𝜽subscript𝒙𝑡𝑡22\mathbb{E}_{\bm{x}_{0},\bm{\epsilon}\sim\mathcal{N}(\bm{0},\mathbf{I}),t}\left% [\lambda(t)\lVert\bm{\epsilon}-\bm{\epsilon}_{\bm{\theta}}(\bm{x}_{t},t)\rVert% _{2}^{2}\right],blackboard_E start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) , italic_t end_POSTSUBSCRIPT [ italic_λ ( italic_t ) ∥ bold_italic_ϵ - bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (4)

where λ(t)𝜆𝑡\lambda(t)italic_λ ( italic_t ) is the t𝑡titalic_t-th positive weighting, and 𝒙tsubscript𝒙𝑡\bm{x}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be computed from 𝒙0subscript𝒙0\bm{x}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The above denoising step is not exactly invertible due to that the linearization assumption ϵ𝜽(𝒙^t,t)=ϵ𝜽(𝒙^t1,t)subscriptbold-italic-ϵ𝜽subscript^𝒙𝑡𝑡subscriptbold-italic-ϵ𝜽subscript^𝒙𝑡1𝑡\bm{\epsilon}_{\bm{\theta}}(\hat{\bm{x}}_{t},t)=\bm{\epsilon}_{\bm{\theta}}(% \hat{\bm{x}}_{t-1},t)bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t ) does not strictly hold. Noticing the resemblance between the denoising step in diffusion models and the affine coupling operation in normalizing flows [14, 15], Wallace et al. [66] made Eq. (3.1) invertible by maintaining two coupled noise vectors (and intermediate states):

𝒙t1subscript𝒙𝑡1\displaystyle{\bm{x}}_{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =at𝒙t+btϵ𝜽(𝒚t,t),absentsubscript𝑎𝑡subscript𝒙𝑡subscript𝑏𝑡subscriptbold-italic-ϵ𝜽subscript𝒚𝑡𝑡\displaystyle=a_{t}{\bm{x}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{y}}_{t},% t),= italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (5)
𝒚t1subscript𝒚𝑡1\displaystyle{\bm{y}}_{t-1}bold_italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =at𝒚t+btϵ𝜽(𝒙t1,t),t{1,,T},formulae-sequenceabsentsubscript𝑎𝑡subscript𝒚𝑡subscript𝑏𝑡subscriptbold-italic-ϵ𝜽subscript𝒙𝑡1𝑡𝑡1𝑇\displaystyle=a_{t}{\bm{y}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{x}}_{t-1% },t),\,\,t\in\{1,\ldots,T\},= italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t ) , italic_t ∈ { 1 , … , italic_T } , (6)

where 𝒙T=𝒚T=ϵsubscript𝒙𝑇subscript𝒚𝑇bold-italic-ϵ{\bm{x}}_{T}={\bm{y}}_{T}=\bm{\epsilon}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = bold_italic_ϵ. Here we strip off the “hat” symbol in the math notations to emphasize the bijectivity of Eqs. (5) and (6) in comparison to Eq. (3.1). To improve the stability and faithfulness to the original diffusion dynamics, a mixing operation is added, leading to the following denoising process:

𝒙tsuperscriptsubscript𝒙𝑡\displaystyle{\bm{x}}_{t}^{\prime}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =at𝒙t+btϵ𝜽(𝒚t,t),absentsubscript𝑎𝑡subscript𝒙𝑡subscript𝑏𝑡subscriptbold-italic-ϵ𝜽subscript𝒚𝑡𝑡\displaystyle=a_{t}{\bm{x}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{y}}_{t},% t),= italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , (7)
𝒚tsuperscriptsubscript𝒚𝑡\displaystyle{\bm{y}}_{t}^{\prime}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =at𝒚t+btϵ𝜽(𝒙t,t),absentsubscript𝑎𝑡subscript𝒚𝑡subscript𝑏𝑡subscriptbold-italic-ϵ𝜽superscriptsubscript𝒙𝑡𝑡\displaystyle=a_{t}{\bm{y}}_{t}+b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{x}}_{t}^% {\prime},t),= italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t ) , (8)
𝒙t1subscript𝒙𝑡1\displaystyle{\bm{x}}_{t-1}bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =p𝒙t+(1p)𝒚t,absent𝑝superscriptsubscript𝒙𝑡1𝑝superscriptsubscript𝒚𝑡\displaystyle=p{\bm{x}}_{t}^{\prime}+(1-p){\bm{y}}_{t}^{\prime},= italic_p bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + ( 1 - italic_p ) bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , (9)
𝒚t1subscript𝒚𝑡1\displaystyle{\bm{y}}_{t-1}bold_italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =p𝒚t+(1p)𝒙t1,t{1,,T},formulae-sequenceabsent𝑝superscriptsubscript𝒚𝑡1𝑝subscript𝒙𝑡1𝑡1𝑇\displaystyle=p{\bm{y}}_{t}^{\prime}+(1-p){\bm{x}}_{t-1},\,\,t\in\{1,\ldots,T\},= italic_p bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + ( 1 - italic_p ) bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_t ∈ { 1 , … , italic_T } , (10)

where p[0,1]𝑝01p\in[0,1]italic_p ∈ [ 0 , 1 ] is a mixing parameter to alleviate the violation of the linearization assumption.

Eqs. (7) to (10) can be linearly inverted, yielding a deterministic noising process (i.e., feature transform):

𝒚tsuperscriptsubscript𝒚𝑡\displaystyle{\bm{y}}_{t}^{\prime}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =(𝒚t1(1p)𝒙t1)/p,absentsubscript𝒚𝑡11𝑝subscript𝒙𝑡1𝑝\displaystyle=({\bm{y}}_{t-1}-(1-p){\bm{x}}_{t-1})/p,= ( bold_italic_y start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - ( 1 - italic_p ) bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) / italic_p , (11)
𝒙tsuperscriptsubscript𝒙𝑡\displaystyle{\bm{x}}_{t}^{\prime}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT =(𝒙t1(1p)𝒚t)/p,absentsubscript𝒙𝑡11𝑝superscriptsubscript𝒚𝑡𝑝\displaystyle=({\bm{x}}_{t-1}-(1-p){\bm{y}}_{t}^{\prime})/p,= ( bold_italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - ( 1 - italic_p ) bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) / italic_p , (12)
𝒚tsubscript𝒚𝑡\displaystyle{\bm{y}}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(𝒚tbtϵ𝜽(𝒙t,t))/at,absentsuperscriptsubscript𝒚𝑡subscript𝑏𝑡subscriptbold-italic-ϵ𝜽superscriptsubscript𝒙𝑡𝑡subscript𝑎𝑡\displaystyle=({\bm{y}}_{t}^{\prime}-b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{x}}% _{t}^{\prime},t))/a_{t},= ( bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_t ) ) / italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (13)
𝒙tsubscript𝒙𝑡\displaystyle{\bm{x}}_{t}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(𝒙tbtϵ𝜽(𝒚t,t))/at,t{1,,T}.formulae-sequenceabsentsuperscriptsubscript𝒙𝑡subscript𝑏𝑡subscriptbold-italic-ϵ𝜽subscript𝒚𝑡𝑡subscript𝑎𝑡𝑡1𝑇\displaystyle=({\bm{x}}_{t}^{\prime}-b_{t}\bm{\epsilon}_{\bm{\theta}}({\bm{y}}% _{t},t))/a_{t},\,\,t\in\{1,\ldots,T\}.= ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) / italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ∈ { 1 , … , italic_T } . (14)

We collectively denote the differentiable and bijective feature transform from the coupled input images, (𝒙0,𝒚0)subscript𝒙0subscript𝒚0(\bm{x}_{0},\bm{y}_{0})( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) where 𝒚0=𝒙0subscript𝒚0subscript𝒙0\bm{y}_{0}=\bm{x}_{0}bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, to the noise representation (𝒙T,𝒚T)subscript𝒙𝑇subscript𝒚𝑇(\bm{x}_{T},\bm{y}_{T})( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) as 𝒇()𝒇\bm{f}(\cdot)bold_italic_f ( ⋅ ):2L2L:absentmaps-tosuperscript2𝐿superscript2𝐿:\mathbb{R}^{2L}\mapsto\mathbb{R}^{2L}: blackboard_R start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT, and denote its inverse 𝒇1superscript𝒇1\bm{f}^{-1}bold_italic_f start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT as 𝒉()𝒉\bm{h}(\cdot)bold_italic_h ( ⋅ ):2L2L:absentmaps-tosuperscript2𝐿superscript2𝐿:\mathbb{R}^{2L}\mapsto\mathbb{R}^{2L}: blackboard_R start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT 2 italic_L end_POSTSUPERSCRIPT.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: (a) Initial image with slight over-exposure degradation. (b) MAP-enhanced image of (a) by Algorithm 1, in which the NR-IQA model LIQE [85] is adopted with λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1. (c) MAP-enhanced image with LIQE and λ=1.0𝜆1.0\lambda=1.0italic_λ = 1.0, whose semantics undesirably deviate from (a).

3.2 MAP Estimation in Diffusion Latents

In Bayesian statistics, MAP estimate of an unknown quantity, as a point estimate, is the mode of its posterior distribution. In the context of image enhancement, MAP estimation corresponds to

𝒙superscript𝒙\displaystyle\bm{x}^{\star}bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT =argmax𝒙p(𝒙|𝒙init)=argmax𝒙p(𝒙init|𝒙)p(𝒙)absentsubscript𝒙𝑝conditional𝒙superscript𝒙initsubscript𝒙𝑝conditionalsuperscript𝒙init𝒙𝑝𝒙\displaystyle=\mathop{\arg\max}_{\bm{x}}p(\bm{x}|\bm{x}^{\mathrm{init}})=% \mathop{\arg\max}_{\bm{x}}p(\bm{x}^{\mathrm{init}}|\bm{x})p(\bm{x})= start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_p ( bold_italic_x | bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ) = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT italic_p ( bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT | bold_italic_x ) italic_p ( bold_italic_x )
=argmin𝒙logp(𝒙init|𝒙)logp(𝒙).absentsubscript𝒙𝑝conditionalsuperscript𝒙init𝒙𝑝𝒙\displaystyle=\mathop{\arg\min}_{\bm{x}}-\log{p(\bm{x}^{\mathrm{init}}|\bm{x})% }-\log{p(\bm{x})}.= start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT - roman_log italic_p ( bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT | bold_italic_x ) - roman_log italic_p ( bold_italic_x ) . (15)

The first term measures the image fidelity with respect to the initial 𝒙initsuperscript𝒙init\bm{x}^{\mathrm{init}}bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT, and is traditionally implemented by the mean squared error (MSE) to admit efficient (iterative) solvers, such as half-quadratic splitting [18]. Other FR-IQA models as distance functions are also applicable. The second term quantifies image naturalness. Typical options include the total variation regularization [52], the Gaussian scale mixture prior [49], the (non-local) patch-based priors [3], the sparsity regularization [73], the low-rank prior [5], the deep image prior [63], and the adversarial loss [22].

Algorithm 1 Gradient-based Solver for MAP Estimation in Diffusion Latents

Require: An NR-IQA model, q𝒘()subscript𝑞𝒘q_{\bm{w}}(\cdot)italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( ⋅ ), an FR-IQA model as the image fidelity measure, D(,)𝐷D(\cdot,\cdot)italic_D ( ⋅ , ⋅ ), the EDICT model to map a test image into diffusion latents, 𝒇()𝒇\bm{f}(\cdot)bold_italic_f ( ⋅ ) and its inversion 𝒉()𝒉\bm{h}(\cdot)bold_italic_h ( ⋅ ), # of optimization steps, MaxIterMaxIter\mathrm{MaxIter}roman_MaxIter, # of diffusion steps, T𝑇Titalic_T, the trade-off parameter size, λ𝜆\lambdaitalic_λ, the momentum parameter, ρ𝜌\rhoitalic_ρ, and the learning rate, γ𝛾\gammaitalic_γ
Input: An initial image 𝒙initsuperscript𝒙init\bm{x}^{\mathrm{init}}bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT
Output: A pair of optimized images (𝒙,𝒚(\bm{x}^{\star},\bm{y}^{\star}( bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT) with nearly identical content and enhanced perceptual quality

1:𝒎x=𝟎,𝒎y=𝟎formulae-sequencesubscript𝒎𝑥0subscript𝒎𝑦0\bm{m}_{x}=\bm{0},\bm{m}_{y}=\bm{0}bold_italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = bold_0 , bold_italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = bold_0
2:(𝒙T,𝒚T)𝒇(𝒙init,𝒙init)subscript𝒙𝑇subscript𝒚𝑇𝒇superscript𝒙initsuperscript𝒙init({\bm{x}}_{T},{\bm{y}}_{T})\leftarrow\bm{f}(\bm{x}^{\mathrm{init}},\bm{x}^{% \mathrm{init}})( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ← bold_italic_f ( bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT , bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT )
3:for i=1MaxIter𝑖1MaxIteri=1\to\mathrm{MaxIter}italic_i = 1 → roman_MaxIter do
4:     (𝒙T)D(𝒉(𝒙T),𝒙init)λq𝒘(𝒉(𝒙T))subscript𝒙𝑇𝐷𝒉subscript𝒙𝑇superscript𝒙init𝜆subscript𝑞𝒘𝒉subscript𝒙𝑇\ell(\bm{x}_{T})\leftarrow\ D(\bm{h}({\bm{x}}_{T}),\bm{x}^{\mathrm{init}})-% \lambda q_{\bm{w}}(\bm{h}({\bm{x}}_{T}))roman_ℓ ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ← italic_D ( bold_italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ) - italic_λ italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) )
5:\triangleright 𝒚Tsubscript𝒚𝑇\bm{y}_{T}bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is omitted in 𝒉()𝒉\bm{h}(\cdot)bold_italic_h ( ⋅ ) for notation simplicity
6:     (𝒚T)D(𝒉(𝒚T),𝒙init)λq𝒘(𝒉(𝒚T))subscript𝒚𝑇𝐷𝒉subscript𝒚𝑇superscript𝒙init𝜆subscript𝑞𝒘𝒉subscript𝒚𝑇\ell(\bm{y}_{T})\leftarrow D(\bm{h}({\bm{y}}_{T}),\bm{x}^{\mathrm{init}})-% \lambda q_{\bm{w}}(\bm{h}({\bm{y}}_{T}))roman_ℓ ( bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ← italic_D ( bold_italic_h ( bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ) - italic_λ italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_h ( bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) )
7:\triangleright 𝒙Tsubscript𝒙𝑇\bm{x}_{T}bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is omitted in 𝒉()𝒉\bm{h}(\cdot)bold_italic_h ( ⋅ ) for notation simplicity
8:     (𝒙T,𝒚T)((𝒙T)+(𝒚T))/2subscript𝒙𝑇subscript𝒚𝑇subscript𝒙𝑇subscript𝒚𝑇2\ell(\bm{x}_{T},\bm{y}_{T})\leftarrow(\ell(\bm{x}_{T})+\ell(\bm{y}_{T}))/2roman_ℓ ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ← ( roman_ℓ ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + roman_ℓ ( bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) / 2
9:     Δ(𝒙T)(𝒙T,𝒚T)/𝒙TΔsubscript𝒙𝑇subscript𝒙𝑇subscript𝒚𝑇subscript𝒙𝑇\Delta\ell(\bm{x}_{T})\leftarrow\partial\ell(\bm{x}_{T},\bm{y}_{T})/\partial% \bm{x}_{T}roman_Δ roman_ℓ ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ← ∂ roman_ℓ ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) / ∂ bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
10:     Δ(𝒚T)(𝒙T,𝒚T)/𝒚TΔsubscript𝒚𝑇subscript𝒙𝑇subscript𝒚𝑇subscript𝒚𝑇\Delta\ell(\bm{y}_{T})\leftarrow\partial\ell(\bm{x}_{T},\bm{y}_{T})/\partial% \bm{y}_{T}roman_Δ roman_ℓ ( bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ← ∂ roman_ℓ ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) / ∂ bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT
11:     (𝒎x,𝒎y)ρ(𝒎x,𝒎y)γ(Δ(𝒙T),Δ(𝒚T))subscript𝒎𝑥subscript𝒎𝑦𝜌subscript𝒎𝑥subscript𝒎𝑦𝛾Δsubscript𝒙𝑇Δsubscript𝒚𝑇(\bm{m}_{x},\bm{m}_{y})\leftarrow\rho(\bm{m}_{x},\bm{m}_{y})-\gamma(\Delta\ell% (\bm{x}_{T}),\Delta\ell(\bm{y}_{T}))( bold_italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) ← italic_ρ ( bold_italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) - italic_γ ( roman_Δ roman_ℓ ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , roman_Δ roman_ℓ ( bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) )
12:     (𝒙T,𝒚T)(𝒙T,𝒚T)+(𝒎x,𝒎y)subscript𝒙𝑇subscript𝒚𝑇subscript𝒙𝑇subscript𝒚𝑇subscript𝒎𝑥subscript𝒎𝑦(\bm{x}_{T},\bm{y}_{T})\leftarrow(\bm{x}_{T},\bm{y}_{T})+(\bm{m}_{x},\bm{m}_{y})( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ← ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) + ( bold_italic_m start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , bold_italic_m start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT )
13:(𝒙,𝒚)𝒉(𝒙T,𝒚T)superscript𝒙superscript𝒚𝒉subscript𝒙𝑇subscript𝒚𝑇(\bm{x}^{\star},\bm{y}^{\star})\leftarrow\bm{h}(\bm{x}_{T},\bm{y}_{T})( bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) ← bold_italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )

As discussed in the Introduction, it is enticing to plug NR-IQA models as priors into MAP estimation, guiding the optimization to find the best-quality image in the vicinity of the initial 𝒙initsuperscript𝒙init\bm{x}^{\mathrm{init}}bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT. However, existing NR-IQA models are trained in a discriminative way on finite sets of limited distortion appearances, and seem to be of little use as naturalness priors (as shown in Fig. 1). Inspired by [65], we propose to supplement existing NR-IQA models with a differentiable and bijective diffusion model 𝒉()𝒉\bm{h}(\cdot)bold_italic_h ( ⋅ ) described in Sec. 3.1, and thus equip them with generative capabilities. This leads to MAP estimation in diffusion latents:

𝒙T=argmin𝒙TD(𝒉(𝒙T),𝒙init)λq𝒘(𝒉(𝒙T)),superscriptsubscript𝒙𝑇subscriptsubscript𝒙𝑇𝐷𝒉subscript𝒙𝑇superscript𝒙init𝜆subscript𝑞𝒘𝒉subscript𝒙𝑇\displaystyle\bm{x}_{T}^{\star}=\mathop{\arg\min}_{\bm{x}_{T}}D\left(\bm{h}(% \bm{x}_{T}),\bm{x}^{\mathrm{init}}\right)-\lambda q_{\bm{w}}(\bm{h}(\bm{x}_{T}% )),bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_D ( bold_italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) , bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ) - italic_λ italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) , (16)

where we have omitted the coupled noise vector 𝒚Tsubscript𝒚𝑇\bm{y}_{T}bold_italic_y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT for notation simplicity. Eq. (16) can be directly solved by gradient-based optimizers, as summarized in Algorithm 1. Empirically, the optimized pair of images (𝒙,𝒚)superscript𝒙superscript𝒚(\bm{x}^{\star},\bm{y}^{\star})( bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT , bold_italic_y start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ) are nearly identical in content and appearance, consistent with the observation in [65].

As 𝒉()𝒉\bm{h}(\cdot)bold_italic_h ( ⋅ ) is bijective, it is convenient to instantiate D(,)𝐷D(\cdot,\cdot)italic_D ( ⋅ , ⋅ ) as the MSE in the diffusion latents, i.e., 𝒙T𝒙Tinit22/Lsuperscriptsubscriptnormsubscript𝒙𝑇superscriptsubscript𝒙𝑇init22𝐿\|\bm{x}_{T}-\bm{x}_{T}^{\mathrm{init}}\|_{2}^{2}/L∥ bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_L. We have tried other FR-IQA models such as the structural similarity (SSIM) index [71] in the raw pixel domain, the deep image structure and texture similarity (DISTS) metric [13] in the VGG feature domain, and the MSE in the CLIP feature domain [50], and find that the optimization is fairly robust to the selection of FR-IQA models. In Fig. 2, we emphasize the important role of the image fidelity term in maintaining the semantic faithfulness of the optimized image 𝒙superscript𝒙\bm{x}^{\star}bold_italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to the initial 𝒙initsuperscript𝒙init\bm{x}^{\mathrm{init}}bold_italic_x start_POSTSUPERSCRIPT roman_init end_POSTSUPERSCRIPT, in contrast to the direct diffusion latent optimization [65] (i.e., argmax𝒙Tq𝒘(𝒉(𝒙T))subscriptargmaxsubscript𝒙𝑇subscript𝑞𝒘𝒉subscript𝒙𝑇\operatorname*{arg\,max}_{\bm{x}_{T}}q_{\bm{w}}(\bm{h}(\bm{x}_{T}))start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_h ( bold_italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) by setting λ𝜆\lambda\rightarrow\inftyitalic_λ → ∞).

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure 3: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility.

3.3 Model Comparison Procedure

By plugging different NR-IQA models into the paradigm of MAP estimation in diffusion latents (in Eq. (16)), we are likely to obtain different enhanced versions of the same test image in terms of color and detail reproduction. Fig. 3 shows a visual demonstration, where we adopt eight NR-IQA models to enhance the same image with realistic out-of-focus blur. Visual inspection of the eight images reveals their relative capabilities in guiding image enhancement.

We formalize the idea of NR-IQA model comparison using MAP estimation in diffusion latents. We first gather a set of M𝑀Mitalic_M photographic images with realistic camera distortions, 𝒰={𝒙(i)}i=1M𝒰superscriptsubscriptsuperscript𝒙𝑖𝑖1𝑀\mathcal{U}=\{\bm{x}^{(i)}\}_{i=1}^{M}caligraphic_U = { bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT, and employ N𝑁Nitalic_N NR-IQA models {q𝒘(j)}j=1Nsuperscriptsubscriptsubscriptsuperscript𝑞𝑗𝒘𝑗1𝑁\{q^{(j)}_{\bm{w}}\}_{j=1}^{N}{ italic_q start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Due to the high dimensionality, nonlinearity, and nonconvexity of the objectives defined in Eq. (16), the optimal λ𝜆\lambdaitalic_λ may vary across test images and across NR-IQA models. To cope with this subtlety, we try K𝐾Kitalic_K different λ𝜆\lambdaitalic_λ’s for each combination of test images and NR-IQA models. This gives rise to K𝐾Kitalic_K candidate enhancement results, with the best one identified using the staircase method in psychophysics [8] by three human subjects.

Consequently, from a total of M×N𝑀𝑁M\times Nitalic_M × italic_N enhanced images, we form M×(N2)𝑀binomial𝑁2M\times\binom{N}{2}italic_M × ( FRACOP start_ARG italic_N end_ARG start_ARG 2 end_ARG ) image pairs, which are subject to formal psychophysical testing using two-alternative forced choice (2AFC) to differentiate fine-grained quality variations. In each trial, subjects are presented with a pair of enhanced images of the same content, corresponding to two different NR-IQA models. They are given unlimited viewing time to select the image with higher perceived quality. More details regarding the psychophysical experiment are given in Sec. Sec. 4.1.

After subjective testing, we arrange the raw subjective data into an M×N×N𝑀𝑁𝑁M\times N\times Nitalic_M × italic_N × italic_N tensor 𝐂𝐂\mathbf{C}bold_C, where Cijksubscript𝐶𝑖𝑗𝑘C_{ijk}italic_C start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT records the empirical probability that 𝒒𝒘(j)superscriptsubscript𝒒𝒘𝑗\bm{q}_{\bm{w}}^{(j)}bold_italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT is voted over 𝒒𝒘(k)superscriptsubscript𝒒𝒘𝑘\bm{q}_{\bm{w}}^{(k)}bold_italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT for the i𝑖iitalic_i-th test image 𝒙(i)superscript𝒙𝑖\bm{x}^{(i)}bold_italic_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Under the Thurstone Case V assumption [61], the underlying quality score Qijsubscriptsuperscript𝑄𝑖𝑗Q^{\star}_{ij}italic_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT of the i𝑖iitalic_i-th enhanced image associated with the j𝑗jitalic_j-th NR-IQA model can be inferred using the maximum likelihood estimation [46]:

𝐐i=argmax𝐐ij,kCijkΦ(QijQik2σ),s.t.j=1NQij=0,formulae-sequencesubscriptsuperscript𝐐𝑖subscriptargmaxsubscript𝐐𝑖subscript𝑗𝑘subscript𝐶𝑖𝑗𝑘Φsubscript𝑄𝑖𝑗subscript𝑄𝑖𝑘2𝜎s.t.superscriptsubscript𝑗1𝑁subscript𝑄𝑖𝑗0\displaystyle\mathbf{Q}^{\star}_{i\bullet}=\operatorname*{arg\,max}_{\mathbf{Q% }_{i\bullet}}\sum_{j,k}C_{ijk}\Phi\left(\frac{Q_{ij}-Q_{ik}}{\sqrt{2}\sigma}% \right),\quad\textrm{s.t.}\sum_{j=1}^{N}{Q_{ij}}=0,bold_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ∙ end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_i ∙ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_j italic_k end_POSTSUBSCRIPT roman_Φ ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 end_ARG italic_σ end_ARG ) , s.t. ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0 , (17)

where we add a linear constraint to resolve the scale ambiguity in the optimization objective. 𝐐isubscriptsuperscript𝐐𝑖\mathbf{Q}^{\star}_{i\bullet}bold_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ∙ end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th row of the quality matrix 𝐐superscript𝐐\mathbf{Q}^{\star}bold_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT and Φ()Φ\Phi(\cdot)roman_Φ ( ⋅ ) is the standard Normal cumulative distribution function. σ𝜎\sigmaitalic_σ is the standard deviation of the observer model, which is set to 1.04841.04841.04841.0484, corresponding to 75%percent7575\%75 % of subjects voting one image over the other. In other words, two images are one unit apart in the just-objectionable-difference [47] scale. Last, we average 𝐐superscript𝐐\mathbf{Q}^{\star}bold_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT along the row dimension (i.e., over M𝑀Mitalic_M test images), resulting in the global ranking performance of N𝑁Nitalic_N NR-IQA models through MAP estimation in diffusion latents.

4 Experiments

In this section, we first describe the selection of the test images to be enhanced and the competing NR-IQA models, followed by other implementation details. We proceed to show and analyze the model comparison results. Additionally, we qualitatively analyze the failure cases of different NR-IQA models identified by MAP estimation in diffusion latents.

4.1 Experimental Setups

Image Selection. We select 20202020 photographic images with visible realistic camera distortions from three IQA datasets: LIVE Challenge [19], KonIQ-10k [27], and SPAQ [16]. We use bicubic interpolation and center crop** to resize all images to 512×512512512512\times 512512 × 512 pixels (see Fig. 4). Care is taken to exclude images with extremely low quality or with no main subjects in the foreground. This is because the enhanced images in those cases are less quality-discriminable due to the challenge of the enhancement task or limited semantics to be enhanced.

Model Selection. We select eight state-of-the-art NR-IQA models with different design philosophies: NIQE [42], DBCNN [82], HyperIQA [60], PaQ-2-PiQ [78], UNIQUE [83], MUSIQ [30], CLIPIQA+ [68], and LIQE [85], among which NIQE is a knowledge-driven training-free method, while the rest are data-driven. All implementations are obtained from the respective authors, and are tested with the default settings.

Refer to caption
Figure 4: Test photographic images to be enhanced under MAP estimation in diffusion latents.
Refer to caption
Figure 5: Subjective ranking of NR-IQA models using MAP estimation in diffusion latents. Below each model is the averaged quality score over 20202020 test image (larger is better). Models within the same colored box have statistically indistinguishable performance.

Optimization Details. Following the suggestions by the Video Quality Experts Group [64], we adopt a four-parameter logistic function to compensate for the prediction nonlinearity of different NR-IQA models:

q𝝃q𝒘(𝒙)=ξ1ξ21+expq𝒘(𝒙)ξ3|ξ4|+ξ2,subscript𝑞𝝃subscript𝑞𝒘𝒙subscript𝜉1subscript𝜉21superscriptsubscript𝑞𝒘𝒙subscript𝜉3subscript𝜉4subscript𝜉2\displaystyle q_{\bm{\xi}}\circ q_{\bm{w}}(\bm{x})=\frac{\xi_{1}-\xi_{2}}{1+% \exp^{-\frac{{q}_{\bm{w}}(\bm{x})-\xi_{3}}{|\xi_{4}|}}}+\xi_{2},italic_q start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∘ italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) = divide start_ARG italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 1 + roman_exp start_POSTSUPERSCRIPT - divide start_ARG italic_q start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) - italic_ξ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG | italic_ξ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT | end_ARG end_POSTSUPERSCRIPT end_ARG + italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (18)

where \circ indicates function composition. We manually enforce ξ1=100subscript𝜉1100\xi_{1}=100italic_ξ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 100 and ξ2=0subscript𝜉20\xi_{2}=0italic_ξ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0, which determine the maximum and minimum map** values, respectively, and learn ξ3subscript𝜉3\xi_{3}italic_ξ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and ξ4subscript𝜉4\xi_{4}italic_ξ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT from data. q𝝃()subscript𝑞𝝃q_{\bm{\xi}}(\cdot)italic_q start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ( ⋅ ) can be seen as part of the NR-IQA model, making the selection of the trade-off parameter λ𝜆\lambdaitalic_λ under MAP estimation in diffusion latents much easier.

We use the Stable Diffusion v1.4 [51] inside of EDICT. Despite being text-conditional, Stable Diffusion also allows unconditional image synthesis with a null text prompt. We leverage this property in our experiments to avoid any potential risk of semantic drift, which is undesirable in image enhancement. Following [66], we adopt a configuration of EDICT with the mixing parameter, p=0.93𝑝0.93p=0.93italic_p = 0.93 (in Eqs. (9) and (10)) and the number of diffusion steps, T=50𝑇50T=50italic_T = 50. We optimize each image for MaxIter=15MaxIter15\mathrm{MaxIter}=15roman_MaxIter = 15 iterations with the learning rate, γ=1𝛾1\gamma=1italic_γ = 1 and the momentum parameter, ρ=0.9𝜌0.9\rho=0.9italic_ρ = 0.9. Three candidate trade-off parameter values, λ{0.01,0.1,1}𝜆0.010.11\lambda\in\{0.01,0.1,1\}italic_λ ∈ { 0.01 , 0.1 , 1 }, are tested and finalized using the staircase method. Psychophysical Testing Details. The psychophysical experiment is carried out in an indoor office environment, illuminated by normal lighting sources with no reflecting wall or floor. All image pairs are displayed on a calibrated 24′′superscript24′′24^{{}^{\prime\prime}}24 start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT LCD monitor at a resolution of 1,920×1,080192010801,920\times 1,0801 , 920 × 1 , 080, with randomized spatial order. The 2AFC method is used, where the subject is forced to select the image of higher perceived quality. We recruit 25252525 human subjects (15151515 males and 10101010 females), aged between 23232323 and 36363636, to participate in our subjective study. A training session is included to familiarize each subject with this study. To avoid the fatigue effect, subjects are allowed to take a break at any time.

4.2 Model Comparison Results

After subjective testing, we obtain the quality scores of all optimized images as stated in Sec. 3.3. We then average the scores over the 20202020 test images, yielding the global ranking of the eight competing NR-IQA models as shown in Fig. 5. We accompany the ranking result with a two-tailed statistical t𝑡titalic_t-test [10], where the null hypothesis states that the quality scores 𝐐jsubscriptsuperscript𝐐absent𝑗\mathbf{Q}^{\star}_{\bullet j}bold_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∙ italic_j end_POSTSUBSCRIPT for the j𝑗jitalic_j-th NR-IQA model and 𝐐ksubscriptsuperscript𝐐absent𝑘\mathbf{Q}^{\star}_{\bullet k}bold_Q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∙ italic_k end_POSTSUBSCRIPT for the k𝑘kitalic_k-th NR-IQA model come from the same Normal distribution. When the test fails to reject the null hypothesis at the α=5%𝛼percent5\alpha=5\%italic_α = 5 % significance level, the performance of the two NR-IQA models is statistically indistinguishable.

We draw several interesting observations from Fig. 5. First, LIQE [85] and UNIQUE [83] statistically significantly outperform other methods, demonstrating the efficacy of learning to rank image quality on multiple datasets. The performance gap between LIQE and UNIQUE can be ascribed to the differences in the volume of data used for pre-training and the choice of backbone networks (see Table 1). Second, although DBCNN [82], MUSIQ [30], and HyperIQA [60] employ different backbone networks, they achieve close optimization performance, evidenced by the statistical significance test. It is noteworthy that all the three models are trained on KonIQ-10k [27], underscoring the favorable role of training data compared to the choice of backbone networks. Third, CLIPIQA+ [68] delivers inferior performance, which may appear surprising initially given the fact that it shares the CLIP backbone with the top-performing LIQE. Upon closer examination, we find that CLIPIQA+ employs the prompt learning strategy [87], with the pre-trained weights fixed. This constraint is likely to hinder CLIPIQA+ from learning/eliciting quality-aware computation. Fourth, although the knowledge-driven model NIQE [42] aims ambitiously to handle arbitrary distortions, it does not perform well under our paradigm of MAP estimation in diffusion latents. This reinforces the challenge of manually crafting features to model the intricate interactions between diverse image content and realistic camera distortions.

Table 1: Ranking results of eight NR-IQA models. A smaller rank indicates better performance. “Multiple” in the second column indicates that UNIQUE and LIQE are trained on the combined datasets of LIVE, CSIQ, BID, CLIVE, KonIQ-10k, and KADID-10k. The results in the third column are in millions. SRCC results are also shown in the fourth column (inside the brackets).
NR-IQA Model Training Set ##\## of Params SRCC Rank MAP Rank \triangle Rank
NIQE [42] 0.001 8 (0.706) 8 0
CLIPIQA+ [68] KonIQ-10k 101.349 2 (0.855) 7 -5
PaQ-2-PiQ [78] FLIVE 11.704 5 (0.827) 6 -1
HyperIQA [60] KonIQ-10k 27.375 7 (0.776) 5 2
MUSIQ [30] KonIQ-10k 27.125 3 (0.853) 4 -1
DBCNN [82] KonIQ-10k 15.311 6 (0.789) 3 3
UNIQUE [83] Multiple 22.322 4 (0.838) 2 2
LIQE [85] Multiple 150.976 1 (0.881) 1 0

We also compare the ranking results by the proposed MAP estimation and by the Spearman’s rank correlation coefficient (SRCC) on SPAQ [16] in Table 1. All NR-IQA models are not exposed to SPAQ, leading to a cross-dataset evaluation setup. The primary observation from Table 1 is that higher SRCC performance does not necessarily transfer to MAP estimation in diffusion latents. In particular, CLIPIQA+ [68] attains the second-highest SRCC result, but it performs the worst in MAP estimation. This is in stark contrast to UNIQUE, DBCNN, and HyperIQA, which demonstrate superior rankings in MAP estimation. In summary, our method features in discriminating the relative performance of NR-IQA models through analysis-by-synthesis, complementing conventional correlation-based performance measures.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Figure 6: Failure cases corresponding to subcaptioned NR-IQA models. (a) Incomplete object. (b) False details. (c) Unnatural/cartoon texture. (d) Over-smoothness. (e) Detail drift. (f) Over-saturated color. (g) Failed enhancement. (h) Grainy appearance. Local artifacts are highlighted in red blocks, and are better contrasted to the initials in Fig. 4.

4.3 Failure Case Analysis

We showcase the use of the proposed MAP estimation in diffusion latents to spot the weaknesses of NR-IQA models. Visual inspection on MAP-enhanced images shown in Fig. 6 exposes various failure cases of different NR-IQA models with distinctive visual characteristics. Although Figs. 6 (a), (b), and (e) are with improved visual quality compared to their initials, they exhibit noticeable local visual errors, such as false details and detail drift, indicating the need for enhancing these NR-IQA models to comprehend high-level semantics. Figs. 6 (c) and (f) are free from structural distortions but with unrealistic cartoonlike textures or colors. Fig. 6 (d) has been overly smoothed. These findings point out a deficiency in these methods to adequately address low-level visual cues. Consistent with observations in Fig. 5, the two least effective NR-IQA models, CLIPIQA+ and NIQE, either fail to induce perceptually meaningful enhancement, or produce an undesirable grainy appearance that is not present in the original image.

5 Conclusion and Discussion

We have developed the MAP estimation in diffusion latents, a computational method that, for the first time, enables existing “imperfect” NR-IQA models to perform challenging (unconstrained) image enhancement. The key trick is to augment NR-IQA models with a differentiable and bijective diffusion model, empowering them with generative capabilities. Due to the differences in design principles and implementation details, different NR-IQA models are highly probable to yield distinct enhancement results for the same input image. This allows us to compare existing NR-IQA models in terms of their image enhancement ability, which falls in the general analysis-by-synthesis framework. We have systematically compared eight NR-IQA models using MAP estimation in diffusion latents, and complemented the conventional correlation-based model comparison paradigm in the context of perceptual optimization.

Despite the promise of the proposed method for NR-IQA model comparison, it is far from being a flawless and general solution to real-world image enhancement, as exemplified in the failure cases. In the future, we plan to further pursue this direction, towards efficient and reliable real-world image enhancement. From the data perspective, we will finetune the best-performing NR-IQA model, LIQE, on the combination of previously trained datasets and newly MAP-enhanced images, and plug the finetuned LIQE back for MAP estimation. Our preliminary psychophysical experiments show that the finetuned LIQE can deliver MAP-optimized images of improved visual quality, as preferred by human subjects over those by the original LIQE (more details in the Appendix). Like [69], the procedure of MAP estimation in diffusion latents, psychophysical testing, and model finetuning can be iterated, leading to a closed loop between model evaluation and model development. From the model perspective, encouraged by the promising enhancement results, it is appealing to construct generative NR-IQA models (i.e., the conditional probability of the test image given the quality score, p(𝒙|y)𝑝conditional𝒙𝑦p(\bm{x}|y)italic_p ( bold_italic_x | italic_y )) rather than existing discriminative counterparts (i.e., p(q|𝒙)𝑝conditional𝑞𝒙p(q|\bm{x})italic_p ( italic_q | bold_italic_x )). Another pragmatic avenue for research is to substantially reduce computational complexity when taking the gradient with respect to the entire (reverse) diffusion process and to reliably determine the optimization hyper-parameters (e.g., # of optimization steps, MaxIterMaxIter\mathrm{MaxIter}roman_MaxIter and the trade-off parameter, λ𝜆\lambdaitalic_λ).

References

  • [1] Berardino, A., Ballé, J., Laparra, V., Simoncelli, E.P.: Eigen-distortions of hierarchical representations. In: Adv. Neural Inform. Process. Syst. pp. 3531–3540 (2017)
  • [2] Bosse, S., Maniry, D., Müller, K., Wiegand, T., Samek, W.: Deep neural networks for no-reference and full-reference image quality assessment. IEEE Trans. Image Process. 27(1), 206–219 (Jan 2018)
  • [3] Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 60–65 (2005)
  • [4] Burnham, K.P., Anderson, D.R.: Model Selection and Multi-model Inference: A Practical Information-Theoretic Approach. Springer New York (2004)
  • [5] Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (Jan 2010)
  • [6] Cao, P., Wang, Z., Ma, K.: Debiased subjective assessment of real-world image enhancement. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 711–721 (2021)
  • [7] Ciancio, A., Targino da Costa, A.L.N.T., da Silva, E.A.B., Said, A., Samadani, R., Obrador, P.: No-reference blur assessment of digital pictures based on multifeature classifiers. IEEE Trans. Image Process. 20(1), 64–75 (Jan 2011)
  • [8] Cornsweet, T.N.: The staircase-method in psychophysics. Am. J. Psychol. 75(3), 485–491 (Sep 1962)
  • [9] Croitoru, F.A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 45(9), 10850–10869 (Sep 2023)
  • [10] David, H.A.: The Method of Paired Comparisons. Hafner Publishing Company (1963)
  • [11] Dhariwal, P., Nichol, A.: Diffusion models beat GANs on image synthesis. In: Adv. Neural Inform. Process. Syst. pp. 8780–8794 (2021)
  • [12] Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Comparison of full-reference image quality models for optimization of image processing systems. Int. J. Comput. Vis. 129(4), 1258–1281 (Apr 2021)
  • [13] Ding, K., Ma, K., Wang, S., Simoncelli, E.P.: Image quality assessment: Unifying structure and texture similarity. IEEE Trans. Pattern Anal. Mach. Intell. 44(5), 2567–2581 (May 2022)
  • [14] Dinh, L., Krueger, D., Bengio, Y.: NICE: Non-linear independent components estimation. In: Int. Conf. Learn. Represent. Worksh. (2015)
  • [15] Dinh, L., Sohl-Dickstein, J., Bengio, S.: Density estimation using real NVP. In: Int. Conf. Learn. Represent. (2017)
  • [16] Fang, Y., Zhu, H., Zeng, Y., Ma, K., Wang, Z.: Perceptual quality assessment of smartphone photography. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 3677–3686 (2020)
  • [17] Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: CLIP-Adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. pp. 1–15 (Sep 2023)
  • [18] Geman, D., Yang, C.: Nonlinear image recovery with half-quadratic regularization. IEEE Trans. Image Process. 4(7), 932–946 (Jul 1995)
  • [19] Ghadiyaram, D., Bovik, A.C.: Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process. 25(1), 372–387 (Jan 2016)
  • [20] Ghadiyaram, D., Bovik, A.C.: Perceptual quality prediction on authentically distorted images using a bag of features approach. J. Vis. 17(1), 32–32 (Jan 2017)
  • [21] Golan, T., Raju, P.C., Kriegeskorte, N.: Controversial stimuli: Pitting neural networks against each other as models of human cognition. Proc. Nat. Acad. Sci. 117(47), 29330–29337 (Nov 2020)
  • [22] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Adv. Neural Inform. Process. Syst. pp. 2672–2680 (2014)
  • [23] Grenander, U., Miller, M.I.: Pattern Theory: From Representation to Inference. Oxford University Press (2007)
  • [24] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross-attention control. In: Int. Conf. Learn. Represent. (2023)
  • [25] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Adv. Neural Inform. Process. Syst. pp. 6840–6851 (2020)
  • [26] Ho, J., Salimans, T.: Classifier-free diffusion guidance. In: Adv. Neural Inform. Process. Syst. Worksh. (2021)
  • [27] Hosu, V., Lin, H., Sziranyi, T., Saupe, D.: KonIQ-10k: An ecologically valid database for deep learning of blind image quality assessment. IEEE Trans. Image Process. 29, 4041–4056 (Jan 2020)
  • [28] Hu, E.J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: Int. Conf. Learn. Represent. (2021)
  • [29] Hyvärinen, A., Dayan, P.: Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. 6(4) (Apr 2005)
  • [30] Ke, J., Wang, Q., Wang, Y., Milanfar, P., Yang, F.: MUSIQ: Multi-scale image quality transformer. In: Int. Conf. Comput. Vis. pp. 5148–5157 (2021)
  • [31] Larson, E.C., Chandler, D.M.: Most apparent distortion: Full-reference image quality assessment and the role of strategy. J. Electron. Imaging 19(1), 1–21 (Jan 2010)
  • [32] Lin, H., Hosu, V., Saupe, D.: KADID-10k: A large-scale artificially distorted IQA database. In: Int. Conf. Multimedia and Expo. pp. 1–3 (2019)
  • [33] Liu, X., Weijer, J.v.d., Bagdanov, A.D.: RankIQA: Learning from rankings for no-reference image quality assessment. In: Int. Conf. Comput. Vis. pp. 1040–1049 (2017)
  • [34] Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: Int. Conf. Learn. Represent. (2017)
  • [35] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: Int. Conf. Learn. Represent. (2019)
  • [36] Ma, K., Duanmu, Z., Wang, Z., Wu, Q., Liu, W., Yong, H., Li, H., Zhang, L.: Group maximum differentiation competition: Model comparison with few samples. IEEE Trans. Pattern Anal. Mach. Intell. 42(4), 851–864 (Apr 2020)
  • [37] Ma, K., Liu, W., Zhang, K., Duanmu, Z., Wang, Z., Zuo, W.: End-to-end blind image quality assessment using deep neural networks. IEEE Trans. Image Process. 27(3), 1202–1213 (Mar 2018)
  • [38] Ma, K., Liu, X., Fang, Y., Simoncelli, E.P.: Blind image quality assessment by learning from multiple annotators. In: IEEE Int. Conf. Image Process. pp. 2344–2348 (2019)
  • [39] Madhusudana, P.C., Birkbeck, N., Wang, Y., Adsumilli, B., Bovik, A.C.: Image quality assessment using contrastive learning. IEEE Trans. Image Process. 31, 4149–4161 (Jun 2022)
  • [40] Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.Y., Ermon, S.: SDEdit: Guided image synthesis and editing with stochastic differential equations. In: Int. Conf. Learn. Represent. (2021)
  • [41] Mittal, A., Moorthy, A.K., Bovik, A.C.: No-reference image quality assessment in the spatial domain. IEEE Trans. Image Process. 21(12), 4695–4708 (Dec 2012)
  • [42] Mittal, A., Soundararajan, R., Bovik, A.C.: Making a “completely blind” image quality analyzer. IEEE Sign. Process. Letters 20(3), 209–212 (Mar 2013)
  • [43] Moorthy, A.K., Bovik, A.C.: Blind image quality assessment: From natural scene statistics to perceptual quality. IEEE Trans. Image Process. 20(12), 3350–3364 (Dec 2011)
  • [44] Mumford, D.: Pattern theory: A unifying perspective. In: Eur. Cong. Math. pp. 187–224 (1994)
  • [45] Nichol, A.Q., Dhariwal, P.: Improved denoising diffusion probabilistic models. In: Int. Conf. Mach. Learn. pp. 8162–8171 (2021)
  • [46] Perez-Ortiz, M., Mantiuk, R.K.: A practical guide and software for analysing pairwise comparison experiments. arXiv preprint arXiv:1712.03686 (2017)
  • [47] Perez-Ortiz, M., Mikhailiuk, A., Zerman, E., Hulusic, V., Valenzise, G., Mantiuk, R.K.: From pairwise comparisons and rating to a unified quality scale. IEEE Trans. Image Process. 29, 1139–1151 (2020)
  • [48] Pokrovskii, V.N.: Thermodynamics of Complex Systems: Principles and Applications. IOP Publishing (2020)
  • [49] Portilla, J., Strela, V., Wainwright, M.J., Simoncelli, E.P.: Image denoising using scale mixtures of Gaussians in the wavelet domain. IEEE Trans. Image Process. 12(11), 1338–1351 (Nov 2003)
  • [50] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Int. Conf. Mach. Learn. pp. 8748–8763 (2021)
  • [51] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 10684–10695 (2022)
  • [52] Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D: Nonlinear Phenom. 60(1-4), 259–268 (Nov 1992)
  • [53] Saha, A., Mishra, S., Bovik, A.C.: Re-IQA: Unsupervised learning for image quality assessment in the wild. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 5846–5855 (2023)
  • [54] Sheikh, H.R., Sabir, M.F., Bovik, A.C.: A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Trans. Image Process. 15(11), 3440–3451 (Nov 2006)
  • [55] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: Int. Conf. Mach. Learn. pp. 2256–2265 (2015)
  • [56] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Int. Conf. Learn. Represent. (2021)
  • [57] Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. In: Adv. Neural Inform. Process. Syst. (2019)
  • [58] Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score-based generative modeling through stochastic differential equations. In: Int. Conf. Learn. Represent. (2021)
  • [59] Su, S., Hosu, V., Lin, H., Zhang, Y., Saupe, D.: KonIQ++: Boosting no-reference image quality assessment in the wild by jointly predicting image quality and defects. In: Brit. Mach. Vis. Conf. pp. 1–12 (2021)
  • [60] Su, S., Yan, Q., Zhu, Y., Zhang, C., Ge, X., Sun, J., Zhang, Y.: Blindly assess image quality in the wild guided by a self-adaptive hyper network. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 3667–3676 (2020)
  • [61] Thurstone, L.L.: A law of comparative judgment. Psychol. Rev. 34, 273–286 (Jul 1927)
  • [62] Tsai, M.F., Liu, T.Y., Qin, T., Chen, H.H., Ma, W.Y.: FRank: A ranking method with fidelity loss. In: ACM SIGIR Conf. Res. Develop. Inf. Retrieval. pp. 383–390 (2007)
  • [63] Ulyanov, D., Vedaldi, A., Lempitsky, V.: Deep image prior. Int. J. Comput. Vis. 128(7), 1867–1888 (Jul 2020)
  • [64] VQEG: Final report from the video quality experts group on the validation of objective models of video quality assessment (2003), http://www.vqeg.org
  • [65] Wallace, B., Gokul, A., Ermon, S., Naik, N.: End-to-end diffusion latent optimization improves classifier guidance. In: Int. Conf. Comput. Vis. pp. 7280–7290 (2023)
  • [66] Wallace, B., Gokul, A., Naik, N.: EDICT: Exact diffusion inversion via coupled transformations. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 22532–22541 (2023)
  • [67] Wang, H., Chen, T., Wang, Z., Ma, K.: I am going MAD: Maximum discrepancy competition for comparing classifiers adaptively. In: Int. Conf. Learn. Represent. (2020)
  • [68] Wang, J., Chan, K.C., Loy, C.C.: Exploring CLIP for assessing the look and feel of images. In: AAAI Conf. Artif. Intell. pp. 2555–2563 (2023)
  • [69] Wang, Z., Ma, K.: Active fine-tuning from gMAD examples improves blind image quality assessment. IEEE Trans. Pattern Anal. Mach. Intell. 44(9), 4577–4590 (Sep 2022)
  • [70] Wang, Z., Bovik, A.C.: Modern Image Quality Assessment. Morgan & Claypool (2006)
  • [71] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (Apr 2004)
  • [72] Wang, Z., Simoncelli, E.P.: Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities. J. Vis. 8(12), 8.1–8.13 (Sep 2008)
  • [73] Wright, J., Ma, Y., Mairal, J., Sapiro, G., Huang, T.S., Yan, S.: Sparse representation for computer vision and pattern recognition. Proc. IEEE 98(6), 1031–1044 (Jun 2010)
  • [74] Wu, H., Zhang, Z., Zhang, W., Chen, C., Liao, L., Li, C., Gao, Y., Wang, A., Zhang, E., Sun, W., Yan, Q., Min, X., Zhai, G., Lin, W.: Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. arXiv preprint arXiv:2312.17090 (2023)
  • [75] Wu, J., Ma, J., Liang, F., Dong, W., Shi, G., Lin, W.: End-to-end blind image quality prediction with cascaded deep neural network. IEEE Trans. Image Process. 29, 7414–7426 (Jun 2020)
  • [76] Yan, J., Zhong, Y., Fang, Y., Wang, Z., Ma, K.: Exposing semantic segmentation failures via maximum discrepancy competition. Int. J. Comput. Vis. 129(5), 1768–1786 (Mar 2021)
  • [77] Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., Zhang, W., Cui, B., Yang, M.H.: Diffusion models: A comprehensive survey of methods and applications. ACM Comput. Surveys 56(4), 1–39 (Sep 2023)
  • [78] Ying, Z., Niu, H., Gupta, P., Mahajan, D., Ghadiyaram, D., Bovik, A.C.: From patches to pictures (PaQ-2-PiQ): Map** the perceptual space of picture quality. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 3572–3582 (2020)
  • [79] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 586–595 (2018)
  • [80] Zhang, W., Li, D., Ma, C., Zhai, G., Yang, X., Ma, K.: Continual learning for blind image quality assessment. IEEE Trans. Pattern Anal. Mach. Intell. 45(3), 2864–2878 (Mar 2023)
  • [81] Zhang, W., Li, D., Min, X., Zhai, G., Guo, G., Yang, X., Ma, K.: Perceptual attacks of no-reference image quality models with human-in-the-loop. In: Adv. Neural Inform. Process. Syst. pp. 2916–2929 (2022)
  • [82] Zhang, W., Ma, K., Yan, J., Deng, D., Wang, Z.: Blind image quality assessment using a deep bilinear convolutional neural network. IEEE Trans. Circuit Syst. Video Technol. 30(1), 36–47 (Jan 2020)
  • [83] Zhang, W., Ma, K., Zhai, G., Yang, X.: Uncertainty-aware blind image quality assessment in the laboratory and wild. IEEE Trans. Image Process. 30, 3474–3486 (Mar 2021)
  • [84] Zhang, W., Ma, K., Zhai, G., Yang, X.: Task-specific normalization for continual learning of blind image quality models. IEEE Trans. Image Process. (2024), to appear
  • [85] Zhang, W., Zhai, G., Wei, Y., Yang, X., Ma, K.: Blind image quality assessment via vision-language correspondence: A multitask learning perspective. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 14071–14081 (2023)
  • [86] Zhao, K., Yuan, K., Sun, M., Li, M., Wen, X.: Quality-aware pre-trained models for blind image quality assessment. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 22302–22313 (2023)
  • [87] Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. 130(9), 2337–2348 (Jul 2022)
  • [88] Zhu, H., Li, L., Wu, J., Dong, W., Shi, G.: MetaIQA: Deep meta-learning for no-reference image quality assessment. In: IEEE/CVF Conf. Comput. Vis. Pattern Recog. pp. 14131–14140 (2020)

Appendix 0.A Appendix

0.A.1 Model Rectification

The annotated MAP-optimized images in Sec. 3.3 give us an opportunity to rectify NR-IQA models. Drawing inspiration from recent wisdom in data-efficient model finetuning [28, 17], we introduce a lightweight quality rectifier 𝒓ϕ():L2:subscript𝒓bold-italic-ϕmaps-tosuperscript𝐿superscript2\bm{r}_{\bm{\phi}}(\cdot):\mathbb{R}^{L}\mapsto\mathbb{R}^{2}bold_italic_r start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( ⋅ ) : blackboard_R start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ↦ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, parameterized by a vector ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ. Given an input image 𝒙𝒙\bm{x}bold_italic_x, 𝒓ϕsubscript𝒓bold-italic-ϕ\bm{r}_{\bm{\phi}}bold_italic_r start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT shares the image feature transform of the NR-IQA model q𝒘subscriptsuperscript𝑞𝒘q^{\star}_{\bm{w}}italic_q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT, and produce a multiplicative scalar sϕ(𝒙)subscript𝑠bold-italic-ϕ𝒙s_{\bm{\phi}}(\bm{x})italic_s start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x ) and an additive scalar bϕ(𝒙)subscript𝑏bold-italic-ϕ𝒙b_{\bm{\phi}}(\bm{x})italic_b start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x ) to rectify the quality prediction: q(𝒙)=sϕ(𝒙)×q𝒘(𝒙)+bϕ(𝒙)superscript𝑞𝒙subscript𝑠bold-italic-ϕ𝒙subscriptsuperscript𝑞𝒘𝒙subscript𝑏bold-italic-ϕ𝒙q^{\star}(\bm{x})=s_{\bm{\phi}}(\bm{x})\times q^{\star}_{\bm{w}}(\bm{x})+b_{% \bm{\phi}}(\bm{x})italic_q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( bold_italic_x ) = italic_s start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x ) × italic_q start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_w end_POSTSUBSCRIPT ( bold_italic_x ) + italic_b start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_italic_x ). We fix the pre-trained weights 𝒘𝒘\bm{w}bold_italic_w and optimize the rectifier parameters ϕbold-italic-ϕ\bm{\phi}bold_italic_ϕ on the combination of the previously trained and newly MAP-optimized images. Specifically, we choose the best-performing model LIQE [85] for demonstration. LIQE corresponds an image 𝒙𝒙\bm{x}bold_italic_x to all candidate textual descriptions, yielding a joint probability over three tasks: quality prediction, scene classification, and distortion type identification. We marginalize over the two auxiliary tasks to obtain the marginal probability of image quality. We use the image representation produced by LIQE as the input to the rectifier 𝒓ϕsubscript𝒓bold-italic-ϕ\bm{r}_{\bm{\phi}}bold_italic_r start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT. Following [85], we use AdamW [35] to train 𝒓ϕsubscript𝒓bold-italic-ϕ\bm{r}_{\bm{\phi}}bold_italic_r start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT by minimizing the fidelity loss [62] for 15151515 epochs with an initial learning rate of 103superscript10310^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, which is scheduled by a cosine annealing rule [34]. The mini-batch size is set to 4444 for the LIVE [54], CSIQ [31], BID [7], and LIVE Challenge [19] datasets, 16161616 for KonIQ-10k [27] and KADID-10k [32], and 8888 for the set of MAP-enhanced images.

We compare the MAP-enhanced images produced by LIQErectrect{}^{\mathrm{rect}}start_FLOATSUPERSCRIPT roman_rect end_FLOATSUPERSCRIPT and the original LIQE in a subjective experiment, similar to that described in Sec. 3.3. We query human preferences for the MAP-enhanced images of the same visual content (shown in Fig. 4), from which we find 69%percent6969\%69 % of the subjects favor the enhanced results produced by LIQErectrect{}^{\mathrm{rect}}start_FLOATSUPERSCRIPT roman_rect end_FLOATSUPERSCRIPT. Moreover, we follow the MAD competition methodology [72] to select 20202020 images that best differentiate between LIQErectrect{}^{\mathrm{rect}}start_FLOATSUPERSCRIPT roman_rect end_FLOATSUPERSCRIPT and LIQE from a pool of 200200200200 candidate images. We employ DISTS [13] to quantify the perceptual similarity in MAD. The subjective testing on the MAD-selected images reveals that 71%percent7171\%71 % of the participants prefer the results produced by LIQErectrect{}^{\mathrm{rect}}start_FLOATSUPERSCRIPT roman_rect end_FLOATSUPERSCRIPT. These findings verify the efficacy of MAP-enhanced images in rectifying NR-IQA models. Figs. A1 and A2 show some visual examples, where we find that images generated by LIQErectrect{}^{\mathrm{rect}}start_FLOATSUPERSCRIPT roman_rect end_FLOATSUPERSCRIPT are less structurally distorted and more semantically consistent, leading to more natural visual appearances.

0.A.2 More Qualitative Results

We show more MAP-enhanced images corresponding to different NR-IQA models in Figs. A3--A10. Visual inspection of these images yields some additional interesting observations.

  • Given blurry initial images (see Figs. A3A4A7, and A8), NIQE [42] often guides the optimization process to deblur excessively, resulting in the emergence of unnatural textures, particularly in the background.

  • While PaQ-2-PiQ [78] performs well in structure restoration, it falls short in producing natural color appearances (see Figs. A3A7, and A9), thereby compromising perceptual quality.

  • Although LIQE [85] excels in distortion removal, it tends to synthesize cartoonish images (see Fig. A7), suggesting room for further improvement.

  • All NR-IQA models fail to enhance Fig. A5 (a) in terms of semantic correctness, introducing textual artifacts in the bottom-left area. This not only highlights the semantic understanding deficiency of existing NR-IQA models but also aligns with the hallucination issue inherent in diffusion models.

  • For initial images with moderate perceptual quality, such as Fig. A6 (a), the competing NR-IQA models generate comparable visual results.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure A1: (a) to (d): Initial images. (e) to (h): MAP-enhanced images by LIQE. (i) to (l): MAP-enhanced images by LIQErectnormal-rect{}^{\mathrm{rect}}start_FLOATSUPERSCRIPT roman_rect end_FLOATSUPERSCRIPT.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure A2: (a) to (d): Initial images. (e) to (h): MAP-enhanced images by LIQE. (i) to (l): MAP-enhanced images by LIQErectnormal-rect{}^{\mathrm{rect}}start_FLOATSUPERSCRIPT roman_rect end_FLOATSUPERSCRIPT.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure A3: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure A4: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure A5: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure A6: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure A7: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure A8: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure A9: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Figure A10: (a) Realistically distorted image. (b) to (i) MAP-enhanced images corresponding to different NR-IQA models. Zoom in for improved visibility.