\SpecialIssueSubmission\BibtexOrBiblatex\electronicVersion\PrintedOrElectronic

Semantic-guided Adversarial Diffusion Model for Self-supervised Shadow Removal

Ziqi Zeng, Chen Zhao11footnotemark: 1, Weiling Cai, Chenyu Dong
School of Artificial Intelligence, Nan**g Normal University
These authors contributed equally to this work.Corresponding Author.
Abstract

Existing unsupervised methods have addressed the challenges of inconsistent paired data and tedious acquisition of ground-truth labels in shadow removal tasks. However, GAN-based training often faces issues such as mode collapse and unstable optimization. Furthermore, due to the complex map** between shadow and shadow-free domains, merely relying on adversarial learning is not enough to capture the underlying relationship between two domains, resulting in low quality of the generated images. To address these problems, we propose a semantic-guided adversarial diffusion framework for self-supervised shadow removal, which consists of two stages. At first stage a semantic-guided generative adversarial network (SG-GAN) is proposed to carry out a coarse result and construct paired synthetic data through a cycle-consistent structure. Then the coarse result is refined with a diffusion-based restoration module (DBRM) to enhance the texture details and edge artifact at second stage. Meanwhile, we propose a multi-modal semantic prompter (MSP) that aids in extracting accurate semantic information from real images and text, guiding the shadow removal network to restore images better in SG-GAN. We conduct experiments on multiple public datasets, and the experimental results demonstrate the effectiveness of our method. {CCSXML} <ccs2012> <concept> <concept_id>10010147.10010371.10010382.10010383</concept_id> <concept_desc>Computing methodologies Image processing</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010178.10010224.10010240.10010241</concept_id> <concept_desc>Computing methodologies Image representations</concept_desc> <concept_significance>500</concept_significance> </concept> </ccs2012>

\ccsdesc

[500]Computing methodologies Image processing \ccsdesc[500]Computing methodologies Image representations

\printccsdesc
volume: 43issue: 7

1 Introduction

In the realm of human perception, visual, auditory, and tactile senses collectively serve as conduits through which individuals interact with the world. Among these senses, vision stands out as the most intuitive and information-rich mode of perception. Visual perception occurs when light interacts with the retina, giving rise to visual experiences. Shadows, a natural consequence of obstructed light sources, play a crucial role in sha** the visual landscape. While shadows convey valuable cues regarding object shapes and light direction, their presence often complicates the semantic understanding of images in computer vision tasks such as image segmentation[LS18] and object detection[FY19]. The shadow regions may be misclassified as objects or part of object thereof may significantly impair the accuracy and performance of these tasks. Consequently, shadow detection and removal are essential for enhancing the efficacy of computer-based visual tasks.

Refer to caption
Figure 1: The shadow removal results of our method and other two GAN-based methods: G2R-ShadowNet[LYW21] and Mask-ShadowGAN[HJFH19]. Gan-based methods have obvious shadow boundaries and artifacts.

Early shadow removal methods[AHO11] [FHLD06] [GDH13] [XSXM13]focus on exploring shadow removal in images according to different physical properties of shadows. Due to the insufficient accuracy and limitations of the underlying physical model, traditional physical model-based shadow removal algorithms are unable to effectively address shadows in complex real-world scenes.

Learning-based methods[ZLY20] [CPS20][ZXF22] typically train networks using paired shadow images and corresponding shadow-free images in fully supervised manner. However, inconsistencies exist in large-scale paired shadow removal datasets due to uncontrollable nature of outdoor illumination. Some methods have been proposed, which do not require paired data for training and generate supervisory signals through shadow generation using a cycle-consistent architecture. Nevertheless, the gap between synthetic and real images limits the full application of these methods in real-world shadow removal task. Although existing unsupervised methods have achieved notable results in shadow removal, GAN-based methods are susceptible to unstable optimization and mode collapse during training [XKV21] [ZCHY24]. Additionally, adversarial training alone is insufficient to fully learn the complex map** between shadow and shadow-free domains. As shown in Figure 1, the resulting shadow-free images often exhibit visible boundary artifacts and lack detailed texture restoration, leaving room for improvement in visual quality.

Refer to caption
Figure 2: Visualization of different domains and domain transitions. The red arrow represents the previous adversarial generation process, and the green arrow represents the diffusion generation process.

In recent years, diffusion models, e.g.the Denoising Diffusion Probabilistic Model (DDPM) [HJA20], have achieved significant breakthroughs in visual generative tasks as a new branch of generative models [ZYY23] [LRJ23] [ZLM24] [LLK23] [LWL24] [RBL22] [SCC22]. Owing to their superior generative capabilities, many studies have explored the application of diffusion models in image restoration tasks to enhance texture recovery. ShadowDiffusion [GWY23a] proposes a diffusion sampling strategy to explicitly integrate the shadow degradation prior into the inherent iterative process of the dynamic mask sensing diffusion model. Liu [LKX24] reconstructs the local illumination of the shadow region by diffusion model. Although diffusion models provide a more stable training process and are more effective in capturing the pixel distribution of images in contrast to GANs. These diffusion-based methods still require paired data for training and are encumbered by the limitations inherent to the supervised paradigm. Consequently, we aim to explore self-supervised paradigm based on diffusion models.

As shown in Figure 2, we consider first using the GAN-based method to carry out coarse shadow removal, and in this process construct paired synthetic data through a cycle-consistent structure. Then we use diffusion model to process the paired data and refine the coarse results to make them closer to our target. Through this process, the diffusion model can be used in self-supervised shadow removal. In this paper, we propose a novel semantic-guided adversarial diffusion framework , which provides a self-supervised solution to shadow removal with unpaired data. Specifically, our framework comprises two major stages: coarse processing stage and refined restoration stage, which correspond to the semantic-guided generative adversarial network (SG-GAN) and the diffusion-based restoration module (DBRM) respectively. In the first stage, SG-GAN, consists of two sub-branches. We train mask-guided generators to synthesize shadow images, assisting the shadow removal network in adversarial training. Then, we propose a multi-modal semantic prompter (MSP) that uses pre-trained visual-language models to extract features and semantic information from real images and text, enhancing shadow removal performance in both branches. In the second stage, we use paired data as the input which is constructed by the cycle-consistent structure in the branch of SG-GAN. We exploit DBRM to further refine the coarse results obtained in SG-GAN, which might contain edge artifacts and texture blurring, making them closer to the target images. This process overcomes the obstacle that the diffusion model relies on paired data for training, which otherwise makes its use in unsupervised methods challenging.

In summary, our main contributions of this work are as follows:

  • We propose a semantic-guided adversarial diffusion model for self-supervised shadow removal to solve the difficulty of diffusion model processing unpaired data by constructing paired images with a cycle-consistent structure. Our methods can learn to remove shadow from unpaired data and solve the problem of obvious shadow boundary and texture detail missing in results.

  • A general-purpose multi-modal semantic prompter is introduced to bridge the inherent gap between real-world shadow images and synthetic shadow images. Meanwhile, the effectiveness of this module is validated in other methods.

  • We conduct extensive experiments on two public datasets. The experimental results show that our proposed method achieves competitive performance and is superior to previous unsupervised shadow removal methods.

2 Related work

2.1 Shadow removal

Traditional shadow removal methods rely on image gradients[FHLD06], illumination information[XSXM13], and image intensity regions[GDH13] to remove shadows. These early shadow removal methods often model the image without shadows or transfer color and texture features from non-shadow regions to shadow regions to achieve shadow removal. However, due to the lack of accuracy in underlying physical models, these methods usually cannot handle shadows in complex real-world scenes. With the emergence of deep learning methods, deep learning-based approaches[YYB21][YZG22][YWZ22][FSZY24] have shown greater advantages in dealing with more complex and varied scenes.

Qu[QTH17a] proposed an end-to-end deep neural network DeshadowNet for shadow removal, which predicts output shadows from three different directional information using multiple contextual architectures. Wang[WLY18] designed ST-CGAN to detect and remove shadows and created the first large-scale shadow benchmark dataset ISTD with three pairs of 1870 paired images. Due to the difficulty and inconsistency in obtaining paired shadow images in practice, to eliminate the reliance on paired data, Hu[HJFH19] proposed the Mask-ShadowGAN method based on the idea of CycleGAN, treating the shadow removal problem as an image-to-image style transfer problem. LG-ShadowNet[LYM21] proposed a shadow image enhancement method based on a simple physical lighting model and image decomposition formula for shadow and pseudo-shadow removal. Liu[LYW21] introduced G2R-ShadowNet for shadow removal using a training dataset constructed from a set of shadow images and their corresponding shadow masks. Most of these methods adopt adversarial learning, which results in obvious shadow boundaries and lack of texture details.

Refer to caption

Figure 3: Overall pipeline of our method. At the coarse processing stage, the SG-GAN, which consists of S2F, F2F, and MSP, predicts the coarse shadow removal results R~s2subscriptsuperscript~𝑅2𝑠{\widetilde{R}}^{2}_{s}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. At the refined restoration stage, DBRM takes paired data Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and R~s2subscriptsuperscript~𝑅2𝑠{\widetilde{R}}^{2}_{s}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT from the previous stage as input, where the coarse result R~s2subscriptsuperscript~𝑅2𝑠{\widetilde{R}}^{2}_{s}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is refined.

2.2 Diffusion model

Diffusion Models[SDWMG15] [HJA20] are generative models which learn distributions of real images by the Gaussian noise blurring process and the reverse denoising process. They have been successfully applied to various computer vision tasks, e.g., image super-resolution [SHC22], inpainting [LDR22], color harmonization [XHL23] and image restoration [GWY23b] [ZCDH24].

In recent years, with the proposed diffusion model, diffusion models have also used in shadow removal tasks. For example, [MFL24] enhanced the diffusion process by conditioning on a learned latent feature space from shadow-free images, while integrating noise features to avoid local optima during training. Liu[LKX24] used the diffusion model to reconstruct the local illumination of the shadow region under the condition of the global illumination of the shadow image. However, these methods using the diffusion model inevitably need to use paired data to provide supervisory information for network training.

Additionally, methods combining adversarial learning with diffusion models have emerged. [KOY22] employed DDPM with adversarial learning for unsupervised vessel segmentation and has achieved promising results. However, as far as we know, no study have yet combined these approaches in the domain of shadow removal.

3 Preliminary

In this section, we briefly review the key concepts underlying SDE-based diffusion models and outline the process of generating samples with reverse-time SDEs. Let p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT denote the initial data distribution, and t[0,T]𝑡0𝑇t\in[0,T]italic_t ∈ [ 0 , italic_T ] denote the continuous time variable. We consider a diffusion process x(t)t=0T𝑥superscriptsubscript𝑡𝑡0𝑇{x(t)}_{t=0}^{T}italic_x ( italic_t ) start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT defined by an SDE of the form:

dx=f(x,t)dt+g(t)dw,x(0)p0(x),formulae-sequence𝑑𝑥𝑓𝑥𝑡𝑑𝑡𝑔𝑡𝑑𝑤similar-to𝑥0subscript𝑝0𝑥dx=f(x,t)dt+g(t)dw,\quad x(0)\sim p_{0}(x),italic_d italic_x = italic_f ( italic_x , italic_t ) italic_d italic_t + italic_g ( italic_t ) italic_d italic_w , italic_x ( 0 ) ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) , (1)

where f𝑓fitalic_f and g𝑔gitalic_g are the drift and dispersion functions, respectively, w𝑤witalic_w is a standard Wiener process, and x(0)d𝑥0superscript𝑑x(0)\in\mathbb{R}^{d}italic_x ( 0 ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is the initial condition. Typically, the terminal state x(T)𝑥𝑇x(T)italic_x ( italic_T ) follows a Gaussian distribution with fixed mean and variance. The idea is to design such an SDE that gradually transforms the data distribution into Gaussian noise[DBMH22, LZB22].

We can reverse the process to sample data from noise by simulating the SDE backward in time[SSDK20]. [And82] shows that a reverse-time representation of the SDE (1) is given by:

dx=[f(x,t)g(t)2xlogpt(x)]dt+g(t)dw^,𝑑𝑥delimited-[]𝑓𝑥𝑡𝑔superscript𝑡2subscript𝑥subscript𝑝𝑡𝑥𝑑𝑡𝑔𝑡𝑑^𝑤dx=\left[f(x,t)-g(t)^{2}\nabla_{x}\log p_{t}(x)\right]dt+g(t)d\hat{w},italic_d italic_x = [ italic_f ( italic_x , italic_t ) - italic_g ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ] italic_d italic_t + italic_g ( italic_t ) italic_d over^ start_ARG italic_w end_ARG , (2)

where x(T)pT(x)similar-to𝑥𝑇subscript𝑝𝑇𝑥x(T)\sim p_{T}(x)italic_x ( italic_T ) ∼ italic_p start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_x ). Here, w^^𝑤\hat{w}over^ start_ARG italic_w end_ARG is a reverse-time Wiener process, and pt(x)subscript𝑝𝑡𝑥p_{t}(x)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) is the marginal probability density function of x(t)𝑥𝑡x(t)italic_x ( italic_t ) at time t𝑡titalic_t. Since the score function xlogpt(x)subscript𝑥subscript𝑝𝑡𝑥\nabla_{x}\log p_{t}(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) is generally intractable, SDE-based diffusion models approximate it by training a time-dependent neural network sθ(x,t)subscript𝑠𝜃𝑥𝑡s_{\theta}(x,t)italic_s start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_t ) using a score matching objective[SSDK20].

4 Proposed method

Refer to caption
Figure 4: Real shadow images from ISTD and corresponding synthetic shadow images obtained from our generator Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the histograms show the inconsistencies in the intensity distribution between them.

4.1 Overall framework

To enhance the effectiveness of shadow removal networks and address edge artifacts and blur texture in unsupervised methods, we propose a semantic-guided adversarial diffusion model for self-supervised shadow removal. Figure 3 illustrates the overall network architecture of our method, which consists of two stages: coarse processing stage and refined restoration stage.These two stages are composed of the semantic-guided generative adversarial network (SG-GAN) and the diffusion-based restoration module (DBRM) respectively. In SG-GAN, we use a general-purpose multi-modal semantic prompter (MSP) to extract semantic information by the pre-trained clip model helps the network in better restoration. In terms of the structure of SG-GAN, it can be divided into two branches which utilizes a set of unpaired shadow images, shadow-free images, and shadow masks as inputs for training. It includes shadow generation and removal generators Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, as well as shadow and shadow-free image discriminators Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Drsubscript𝐷𝑟D_{r}italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. However, SG-GAN relies on discriminators and consistency constraints, which are insufficient to achieve optimal results. The outcomes after shadow removal still contain undesirable noise.Therefore, in refined restoration stage, DBRM uses the coarse shadow removal results obtained from SG-GAN to construct paired data with the clean inputs for network training. By the powerful generative capabilities of the diffusion model, DBRM further refines the coarse results to remove artifacts and enhance texture details.

4.2 Multi-modal semantic prompter

Existing unsupervised methods [LYW21] [HJFH19] primarily rely on generated shadow images (synthetic data) to train shadow removal networks without the guidance from real prior information. However, as shown in Figure 4, even though the real shadow image and the synthesized shadow image look very similar, data distribution difference still exists in. The shadow removal network trained with synthetic images might reduce its effectiveness on real-world shadow images.

Useful prompts can help to correct task networks toward better performance. To minimize the impact of this gap, we designed a multi-modal semantic prompter (MSP). As shown in Figure 3, MSP extracts features using the image encoder Cimagesubscript𝐶𝑖𝑚𝑎𝑔𝑒C_{image}italic_C start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT and text encoder Ctextsubscript𝐶𝑡𝑒𝑥𝑡C_{text}italic_C start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT from a pre-trained CLIP model [RKH21]. The prior information extracted by the image encoder is fused with the features extracted by Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in residual blocks through semantic fusion blocks (SFB), while the text features extracted by the text encoder are used to define a clip loss (introduced in Sec.4.3) to further constrain the image recovered by Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

SFB aims to better perceive prior information, using this more reliably perceived content to assist Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in restoration. It also controls the propagation of perceived prior information, enabling it to adaptively learn more useful features in Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for better restoration. Given a recovery tensor Xk1subscript𝑋𝑘1X_{k-1}italic_X start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT from the (k1)thsubscript𝑘1𝑡(k-1)_{th}( italic_k - 1 ) start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT residual block in Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and a semantic tensor Y𝑌Yitalic_Y extracted by Cimagesubscript𝐶𝑖𝑚𝑎𝑔𝑒C_{image}italic_C start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT, they are first fused through a cross attention mechanism to obtain Zk1subscript𝑍𝑘1Z_{k-1}italic_Z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT:

Zk1=Acattn(Xk1,Y)subscript𝑍𝑘1subscript𝐴𝑐𝑎𝑡𝑡𝑛subscript𝑋𝑘1𝑌Z_{k-1}=A_{c-attn}(X_{k-1},Y)italic_Z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_c - italic_a italic_t italic_t italic_n end_POSTSUBSCRIPT ( italic_X start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_Y ) (3)

where Acattnsubscript𝐴𝑐𝑎𝑡𝑡𝑛A_{c-attn}italic_A start_POSTSUBSCRIPT italic_c - italic_a italic_t italic_t italic_n end_POSTSUBSCRIPT refers to the cross attention. Then, using a 1×1111\times 11 × 1 convolution and a gated control function sigmoid, the input for the next residual block and can be obtained:

𝒮(Xk1,Y)=σ(Wp(Zk1))Xk1+Xk1𝒮subscript𝑋𝑘1𝑌direct-product𝜎subscript𝑊𝑝subscript𝑍𝑘1subscript𝑋𝑘1subscript𝑋𝑘1\mathcal{S}(X_{k-1},Y)=\sigma(W_{p}(Z_{k-1}))\odot X_{k-1}+X_{k-1}caligraphic_S ( italic_X start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT , italic_Y ) = italic_σ ( italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_Z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT ) ) ⊙ italic_X start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT (4)

where σ()𝜎\sigma(\cdot)italic_σ ( ⋅ ) represents the sigmoid function and Wp()subscript𝑊𝑝W_{p}(\cdot)italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( ⋅ ) means the 1 × 1 point-wise convolution.

4.3 Semantic-guided generative adversarial network

SG-GAN consists of two branches: shadow to shadow-free (S2F) and shadow-free to shadow-free (F2F), of which the input is a real-world image and a shadow mask. The input shadow mask is a binary map where 0 represents non-shadow (black) regions and 1 represents shadow (white) regions.

S2F and F2F sub-branches In S2F, a shadow mask image M1superscript𝑀1M^{1}italic_M start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT consisting entirely of zeros and a real shadow image Rssubscript𝑅𝑠R_{s}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are used as inputs to the generator Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to produce an image R~s1subscriptsuperscript~𝑅1𝑠{\widetilde{R}}^{1}_{s}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT without additional shadows:

R~s1=Gs(Rs,M1)subscriptsuperscript~𝑅1𝑠subscript𝐺𝑠subscript𝑅𝑠superscript𝑀1{\widetilde{R}}^{1}_{s}=G_{s}(R_{s},M^{1})over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) (5)

The generated shadow image R~s1subscriptsuperscript~𝑅1𝑠{\widetilde{R}}^{1}_{s}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is then transformed into a shadow-free image R~n1subscriptsuperscript~𝑅1𝑛{\widetilde{R}}^{1}_{n}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT using the generator Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. Subsequently, the discriminator Drsubscript𝐷𝑟D_{r}italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is employed to distinguish whether R~nsubscript~𝑅𝑛{\widetilde{R}}_{n}over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is a genuine shadow-free image:

R~n1=Gr(R~s1),Dr(R~n1)=real/fake?formulae-sequencesubscriptsuperscript~𝑅1𝑛subscript𝐺𝑟subscriptsuperscript~𝑅1𝑠subscript𝐷𝑟subscriptsuperscript~𝑅1𝑛𝑟𝑒𝑎𝑙𝑓𝑎𝑘𝑒?{\widetilde{R}}^{1}_{n}=G_{r}({\widetilde{R}}^{1}_{s}),D_{r}({\widetilde{R}}^{% 1}_{n})=real/fake?over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = italic_r italic_e italic_a italic_l / italic_f italic_a italic_k italic_e ? (6)

In F2F, a real shadow-free image Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and a shadow mask M2superscript𝑀2M^{2}italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT which indicates the shadow region are required as input. We use the method described in [LYW21] to generate the shadow mask M2superscript𝑀2M^{2}italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Then, with above inputs, generator Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT generates a shadow image R~s2subscriptsuperscript~𝑅2𝑠{\widetilde{R}}^{2}_{s}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to deceive the discriminator Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, making it difficult for Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to discern whether it is a genuine shadow image:

R~s2=Gs(Rn,M2),Ds(R~s2)=real/fake?formulae-sequencesubscriptsuperscript~𝑅2𝑠subscript𝐺𝑠𝑅𝑛superscript𝑀2subscript𝐷𝑠subscriptsuperscript~𝑅2𝑠𝑟𝑒𝑎𝑙𝑓𝑎𝑘𝑒?{\widetilde{R}}^{2}_{s}=G_{s}(Rn,M^{2}),D_{s}({\widetilde{R}}^{2}_{s})=real/fake?over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_R italic_n , italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_r italic_e italic_a italic_l / italic_f italic_a italic_k italic_e ? (7)

Then the synthesized R~s2subscriptsuperscript~𝑅2𝑠{\widetilde{R}}^{2}_{s}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as input to generator Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT for shadow removal to obtain a coarse shadow-free image R~n2subscriptsuperscript~𝑅2𝑛{\widetilde{R}}^{2}_{n}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

In the above process, we find that the input to generator Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is always the synthesized shadow image generated by Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. In order to improve the shadow removal performance of Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, we add MSP in Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to reduce the impact of synthesized data in both S2F and F2F.

The architectures of Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT are same, which mainly follow the generator proposed by Hu [HJFH19]. They consist of three convolution layers with a stride of 2, followed by nine residual blocks for feature extraction, and finally three deconvolution layers to upsample feature map. Instance normalization is applied after each convolution operation. For the discriminators Drsubscript𝐷𝑟D_{r}italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT we adopt the architecture proposed in PatchGAN [IZZE17].

Loss function In S2F, we use identity loss to make the generated shadow image R~s1subscriptsuperscript~𝑅1𝑠{\widetilde{R}}^{1}_{s}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT close to the input shadow image Rssubscript𝑅𝑠R_{s}italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT :

identity(Gs)=𝔼Rsp(Rs)[Gs(Rs,M1)Rs1]subscriptidentitysubscript𝐺𝑠subscript𝔼similar-tosubscript𝑅𝑠𝑝subscript𝑅𝑠delimited-[]subscriptnormsubscript𝐺𝑠subscript𝑅𝑠superscript𝑀1subscript𝑅𝑠1\mathcal{L}_{\textit{identity}}(G_{s})=\mathbb{E}_{R_{s}\sim p(R_{s})}\left[\|% G_{s}(R_{s},M^{1})-R_{s}\|_{1}\right]caligraphic_L start_POSTSUBSCRIPT identity end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ italic_p ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∥ italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ) - italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] (8)

For generator Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and its discriminatorDrsubscript𝐷𝑟D_{r}italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, the objective function is optimized as follows:

GANrsubscriptsubscriptGAN𝑟\displaystyle\mathcal{L}_{\textit{GAN}_{r}}caligraphic_L start_POSTSUBSCRIPT GAN start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT =𝔼Rnp(Rn)[log(Dr(Rn))]absentsubscript𝔼similar-tosubscript𝑅𝑛𝑝subscript𝑅𝑛delimited-[]subscript𝐷𝑟subscript𝑅𝑛\displaystyle=\mathbb{E}_{R_{n}\sim p(R_{n})}\left[\log(D_{r}(R_{n}))\right]= blackboard_E start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_p ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ( italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ] (9)
+𝔼Rsp(Rs)[log(1Dr(Gr(R~s1)))]subscript𝔼similar-tosubscript𝑅𝑠𝑝subscript𝑅𝑠delimited-[]1subscript𝐷𝑟subscript𝐺𝑟subscriptsuperscript~𝑅1𝑠\displaystyle\quad+\mathbb{E}_{R_{s}\sim p(R_{s})}\left[\log(1-D_{r}(G_{r}(% \widetilde{R}^{1}_{s})))\right]+ blackboard_E start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ italic_p ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ) ]

We incorporated the MSP into Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT. This process introduces a clip loss to help Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT better remove shadows. As shown in Figure 3, the clip loss constraint between the output 𝒮(X,Cimage(Rs)\mathcal{S}(X,C_{image}(R_{s})caligraphic_S ( italic_X , italic_C start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) of the last SFB and the semantic features extracted from input text T𝑇Titalic_T by clip text encoder Ctextsubscript𝐶𝑡𝑒𝑥𝑡C_{text}italic_C start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT is defined as follows:

clipS2F=ecos(𝒮(X,Cimage(Rs),Ctext(T)))/τecos(𝒮(X,Cimage(Rs),Ctext(T))/τ+e1τ\mathcal{L}_{\textit{clip}_{S2F}}=\frac{e^{cos\left(\mathcal{S}(X,C_{image}(R_% {s}),C_{text}(T))\right)/\tau}}{e^{cos\left(\mathcal{S}(X,C_{image}(R_{s}),C_{% text}(T)\right)/\tau}+e^{\frac{1}{\tau}}}caligraphic_L start_POSTSUBSCRIPT clip start_POSTSUBSCRIPT italic_S 2 italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_c italic_o italic_s ( caligraphic_S ( italic_X , italic_C start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_C start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_T ) ) ) / italic_τ end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_c italic_o italic_s ( caligraphic_S ( italic_X , italic_C start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_C start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_T ) ) / italic_τ end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_τ end_ARG end_POSTSUPERSCRIPT end_ARG (10)

where X𝑋Xitalic_X is the output of the penultimate residual block in Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, τ𝜏\tauitalic_τ denotes the temperature parameter which we set to 0.5 in experiment. In F2F, the adversarial loss for generator Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and discriminator Dssubscript𝐷𝑠D_{s}italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is formulated as:

GANssubscriptsubscriptGAN𝑠\displaystyle\mathcal{L}_{\text{GAN}_{s}}caligraphic_L start_POSTSUBSCRIPT GAN start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT =𝔼Rsp(Rs)[log(Ds(Rs))]absentsubscript𝔼similar-tosubscript𝑅𝑠𝑝subscript𝑅𝑠delimited-[]subscript𝐷𝑠subscript𝑅𝑠\displaystyle=\mathbb{E}_{R_{s}\sim p(R_{s})}\left[\log(D_{s}(R_{s}))\right]= blackboard_E start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∼ italic_p ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ( italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ] (11)
+𝔼Rnp(Rn)[log(1Ds(Gs(Rn,M2)))]subscript𝔼similar-tosubscript𝑅𝑛𝑝subscript𝑅𝑛delimited-[]1subscript𝐷𝑠subscript𝐺𝑠subscript𝑅𝑛superscript𝑀2\displaystyle+\mathbb{E}_{R_{n}\sim p(R_{n})}\left[\log(1-D_{s}(G_{s}(R_{n},M^% {2})))\right]+ blackboard_E start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_p ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ( 1 - italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ) ]

MSP is also used in F2F, so we define a loss similair to clipS2Fsubscriptsubscriptclip𝑆2𝐹\mathcal{L}_{\textit{clip}_{S2F}}caligraphic_L start_POSTSUBSCRIPT clip start_POSTSUBSCRIPT italic_S 2 italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT that makes the shadow removal result close to the input text T𝑇Titalic_T:

clipF2F=ecos(𝒮(X,Cimage(Rs),Ctext(T))/tecos(𝒮(X,Cimage(Rs),Ctext(T)))/t+e1t\mathcal{L}_{\textit{clip}_{F2F}}=\frac{e^{cos\left(\mathcal{S}(X,C_{image}(R_% {s}),C_{text}(T)\right)/t}}{e^{cos\left(\mathcal{S}(X,C_{image}(R_{s}),C_{text% }(T))\right)/t}+e^{\frac{1}{t}}}caligraphic_L start_POSTSUBSCRIPT clip start_POSTSUBSCRIPT italic_F 2 italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_c italic_o italic_s ( caligraphic_S ( italic_X , italic_C start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_C start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_T ) ) / italic_t end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT italic_c italic_o italic_s ( caligraphic_S ( italic_X , italic_C start_POSTSUBSCRIPT italic_i italic_m italic_a italic_g italic_e end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_C start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_T ) ) ) / italic_t end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_t end_ARG end_POSTSUPERSCRIPT end_ARG (12)

To ensure that R~2nsubscriptsuperscript~𝑅2𝑛{\widetilde{R}^{2}}_{n}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is similar to the original input real shadow-free image Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we use cycle consistency loss to optimize the map** functions in Gssubscript𝐺𝑠G_{s}italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and Grsubscript𝐺𝑟G_{r}italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT:

cycle(Gs,Gr)=𝔼Rnp(Rn)[Gr(Gs(Rn,M2))Rn1]subscriptcyclesubscript𝐺𝑠subscript𝐺𝑟subscript𝔼similar-tosubscript𝑅𝑛𝑝subscript𝑅𝑛delimited-[]subscriptnormsubscript𝐺𝑟subscript𝐺𝑠subscript𝑅𝑛superscript𝑀2subscript𝑅𝑛1\mathcal{L}_{\textit{cycle}}(G_{s},G_{r})=\mathbb{E}_{R_{n}\sim p(R_{n})}\left% [\|G_{r}(G_{s}(R_{n},M^{2}))-R_{n}\|_{1}\right]caligraphic_L start_POSTSUBSCRIPT cycle end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_p ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∥ italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] (13)

To further emphasize that the shadow region in R~n2subscriptsuperscript~𝑅2𝑛{\widetilde{R}}^{2}_{n}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT guided by the shadow mask matches the content in the input image Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we apply a shadow loss as follow:

shadow(Gs,Gr)=𝔼Rnp(Rn)[M(Gr(Gs(Rn,M2))Rn)1]subscriptshadowsubscript𝐺𝑠subscript𝐺𝑟subscript𝔼similar-tosubscript𝑅𝑛𝑝subscript𝑅𝑛delimited-[]subscriptnormdirect-product𝑀subscript𝐺𝑟subscript𝐺𝑠subscript𝑅𝑛superscript𝑀2subscript𝑅𝑛1\mathcal{L}_{\textit{shadow}}(G_{s},G_{r})=\mathbb{E}_{R_{n}\sim p(R_{n})}% \left[\|M\odot(G_{r}(G_{s}(R_{n},M^{2}))-R_{n})\|_{1}\right]caligraphic_L start_POSTSUBSCRIPT shadow end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∼ italic_p ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ ∥ italic_M ⊙ ( italic_G start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) - italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] (14)

In summary, the total loss in SG-GAN is defined as:

=absent\displaystyle\mathcal{L}=caligraphic_L = λ1identity+λ2(GANs+GANr)subscript𝜆1subscriptidentitysubscript𝜆2subscriptsubscriptGAN𝑠subscriptsubscriptGAN𝑟\displaystyle\ \mathrm{\lambda}_{1}\mathcal{L}_{\text{identity}}+\mathrm{% \lambda}_{2}(\mathcal{L}_{\text{GAN}_{s}}+\mathcal{L}_{\text{GAN}_{r}})italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT identity end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT GAN start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT GAN start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (15)
+λ3(clipS2F+clipF2F)+λ4(cycle+shadow)subscript𝜆3subscriptsubscriptclip𝑆2𝐹subscriptsubscriptclip𝐹2𝐹subscript𝜆4subscriptcyclesubscriptshadow\displaystyle\ +\mathrm{\lambda}_{3}(\mathcal{L}_{\textit{clip}_{S2F}}+% \mathcal{L}_{\textit{clip}_{F2F}})+\mathrm{\lambda}_{4}(\mathcal{L}_{\text{% cycle}}+\mathcal{L}_{\text{shadow}})+ italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT clip start_POSTSUBSCRIPT italic_S 2 italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT clip start_POSTSUBSCRIPT italic_F 2 italic_F end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( caligraphic_L start_POSTSUBSCRIPT cycle end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT shadow end_POSTSUBSCRIPT )

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, and λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are weights balancing different loss terms. In our experiment, we set λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, λ3subscript𝜆3\lambda_{3}italic_λ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , and λ4subscript𝜆4\lambda_{4}italic_λ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT to 5, 1, 0.2, and 10, respectively.

4.4 Diffusion-based restoration module

Refer to caption

Figure 5: Visualisation comparisons results on five real-world challenging samples from the ISTD dataset.

At this stage, we employ IR-SDE [QTH17b] as the diffusion framework for diffusion model. It allows us to better understand and control the image generation process by simulating the image degradation process. The key idea of our diffusion model is to combine a mean-reverting SDE with a maximum likelihood objective for neural network training, which naturally transforms high-quality images into degraded low-quality images regardless of the complexity of the degradation.

Diffusion framework According to [QTH17b], the forward process of a mean-reverting SDE is defined as:

dxt=θt(μxt)dt+σtdw,𝑑subscript𝑥𝑡subscript𝜃𝑡𝜇subscript𝑥𝑡𝑑𝑡subscript𝜎𝑡𝑑𝑤dx_{t}=\theta_{t}(\mu-x_{t})dt+\sigma_{t}dw,italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_μ - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_t + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_w , (16)

where θtsubscript𝜃𝑡\theta_{t}italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are time-varying positive parameters representing the mean regression rate and stochastic volatility, respectively. Here, μ𝜇\muitalic_μ is the state mean, and w𝑤witalic_w denotes Brownian motion.

In our model, a coarse shadow removal result R~n2subscriptsuperscript~𝑅2𝑛{\widetilde{R}}^{2}_{n}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is obtained in the second branch of SG-GAN. We use R~n2subscriptsuperscript~𝑅2𝑛{\widetilde{R}}^{2}_{n}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and the corresponding input clean image Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to construct paired data (LQ and GT) for our DBRM. Given paired images, we set Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as the initial state x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and R~n2subscriptsuperscript~𝑅2𝑛{\widetilde{R}}^{2}_{n}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as μ𝜇\muitalic_μ, with a fixed noise level λ𝜆\lambdaitalic_λ.

By ensuring σt2/θt=2λ2superscriptsubscript𝜎𝑡2subscript𝜃𝑡2superscript𝜆2\sigma_{t}^{2}/\theta_{t}=2\lambda^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 2 italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for all t𝑡titalic_t, the forward process solution is:

x(t)=R~n2+(x(s)R~n2)eθ¯s:t+stσzeθ¯z:t𝑑w(z),𝑥𝑡subscriptsuperscript~𝑅2𝑛𝑥𝑠subscriptsuperscript~𝑅2𝑛superscript𝑒:¯𝜃𝑠𝑡superscript𝑠𝑡subscript𝜎𝑧superscript𝑒:¯𝜃𝑧𝑡differential-d𝑤𝑧x(t)={\widetilde{R}}^{2}_{n}+(x(s)-{\widetilde{R}}^{2}_{n})e^{-\bar{\theta}{s:% t}}+\int{s}^{t}\sigma_{z}e^{-\bar{\theta}{z:t}}dw(z),italic_x ( italic_t ) = over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ( italic_x ( italic_s ) - over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT - over¯ start_ARG italic_θ end_ARG italic_s : italic_t end_POSTSUPERSCRIPT + ∫ italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT - over¯ start_ARG italic_θ end_ARG italic_z : italic_t end_POSTSUPERSCRIPT italic_d italic_w ( italic_z ) , (17)

where θ¯s:t=stθz,dz:¯𝜃𝑠𝑡superscriptsubscript𝑠𝑡subscript𝜃𝑧𝑑𝑧\bar{\theta}{s:t}=\int_{s}^{t}\theta_{z},dzover¯ start_ARG italic_θ end_ARG italic_s : italic_t = ∫ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_d italic_z. The transition kernel is:

p(x(t)x(s))=𝒩(x(t)ms:t(x(s)),vs:t),𝑝conditional𝑥𝑡𝑥𝑠𝒩conditional𝑥𝑡subscript𝑚:𝑠𝑡𝑥𝑠subscript𝑣:𝑠𝑡p(x(t)\mid x(s))=\mathcal{N}(x(t)\mid m_{s:t}(x(s)),v_{s:t}),italic_p ( italic_x ( italic_t ) ∣ italic_x ( italic_s ) ) = caligraphic_N ( italic_x ( italic_t ) ∣ italic_m start_POSTSUBSCRIPT italic_s : italic_t end_POSTSUBSCRIPT ( italic_x ( italic_s ) ) , italic_v start_POSTSUBSCRIPT italic_s : italic_t end_POSTSUBSCRIPT ) , (18)

where ms:tsubscript𝑚:𝑠𝑡m_{s:t}italic_m start_POSTSUBSCRIPT italic_s : italic_t end_POSTSUBSCRIPT is the mean and vs:tsubscript𝑣:𝑠𝑡v_{s:t}italic_v start_POSTSUBSCRIPT italic_s : italic_t end_POSTSUBSCRIPT is the variance. The forward SDE iteratively transforms the GT image Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT into the LQ image R~n2subscriptsuperscript~𝑅2𝑛\widetilde{R}^{2}_{n}over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with added noise:

p(xtxt1)=𝒩(xtmt1(xt1),vt1).𝑝conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩conditionalsubscript𝑥𝑡subscript𝑚𝑡1subscript𝑥𝑡1subscript𝑣𝑡1p(x_{t}\mid x_{t-1})=\mathcal{N}\left(x_{t}\mid m_{t-1}(x_{t-1}),v_{t-1}\right).italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_m start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) , italic_v start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . (19)

A notable property of this process is that noisy data xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be sampled from x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in closed form:

pt(x)=𝒩(x(t)mt(x),vt)subscript𝑝𝑡𝑥𝒩conditional𝑥𝑡subscript𝑚𝑡𝑥subscript𝑣𝑡p_{t}(x)=\mathcal{N}\left(x(t)\mid m_{t}(x),v_{t}\right)italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = caligraphic_N ( italic_x ( italic_t ) ∣ italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) , italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (20)

with mt=R~n2+(x(0)R~n2)eθ¯tsubscript𝑚𝑡subscriptsuperscript~𝑅2𝑛𝑥0subscriptsuperscript~𝑅2𝑛superscript𝑒¯𝜃𝑡m_{t}=\widetilde{R}^{2}_{n}+(x(0)-\widetilde{R}^{2}_{n})e^{-\bar{\theta}t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT + ( italic_x ( 0 ) - over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) italic_e start_POSTSUPERSCRIPT - over¯ start_ARG italic_θ end_ARG italic_t end_POSTSUPERSCRIPT, vt=λ2(1e2θ¯t)subscript𝑣𝑡superscript𝜆21superscript𝑒2¯𝜃𝑡v_{t}=\lambda^{2}\left(1-e^{-2\bar{\theta}t}\right)italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - 2 over¯ start_ARG italic_θ end_ARG italic_t end_POSTSUPERSCRIPT ) and θ¯tsubscript¯𝜃𝑡\bar{\theta}_{t}over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT referring to m0:tsubscript𝑚:0𝑡m_{0:t}italic_m start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT, v0:tsubscript𝑣:0𝑡v_{0:t}italic_v start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT and θ¯0:tsubscript¯𝜃:0𝑡\bar{\theta}_{0:t}over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT 0 : italic_t end_POSTSUBSCRIPT, respectively.

The reverse-time representation from [And82] is:

dx=[θt(R~n2x)σt2xlogpt(x)]dt+σtdw,𝑑𝑥delimited-[]subscript𝜃𝑡subscriptsuperscript~𝑅2𝑛𝑥superscriptsubscript𝜎𝑡2subscript𝑥subscript𝑝𝑡𝑥𝑑𝑡subscript𝜎𝑡𝑑𝑤dx=[\theta_{t}({\widetilde{R}}^{2}_{n}-x)-\sigma_{t}^{2}\nabla_{x}\log p_{t}(x% )]dt+\sigma_{t}dw,italic_d italic_x = [ italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_x ) - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) ] italic_d italic_t + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_w , (21)

where xlogpt(x)subscript𝑥subscript𝑝𝑡𝑥\nabla_{x}\log p_{t}(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) is the score of the marginal distribution at time t𝑡titalic_t. Given the GT image Rnsubscript𝑅𝑛R_{n}italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we compute the score function as:

xlogpt(x)=x(t)mtvt.subscript𝑥subscript𝑝𝑡𝑥𝑥𝑡subscript𝑚𝑡subscript𝑣𝑡\nabla_{x}\log p_{t}(x)=-\frac{x(t)-m_{t}}{v_{t}}.∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = - divide start_ARG italic_x ( italic_t ) - italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG . (22)

We then train a conditional time-dependent neural network ϵ~ϕ(xt,R~n2,t)subscript~italic-ϵitalic-ϕsubscript𝑥𝑡subscriptsuperscript~𝑅2𝑛𝑡\tilde{\epsilon}_{\phi}(x_{t},\widetilde{R}^{2}_{n},t)over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_t ) to estimate noise. Sampling xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT according to xt=mt(x)+vtϵtsubscript𝑥𝑡subscript𝑚𝑡𝑥subscript𝑣𝑡subscriptitalic-ϵ𝑡x_{t}=m_{t}(x)+\sqrt{v_{t}}\ \epsilon_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) + square-root start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where ϵt𝒩(0,I)similar-tosubscriptitalic-ϵ𝑡𝒩0𝐼\epsilon_{t}\sim\mathcal{N}(0,I)italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ). The score can then be directly computed from the noise:

xlogpt(x)=ϵtvt.subscript𝑥subscript𝑝𝑡𝑥subscriptitalic-ϵ𝑡subscript𝑣𝑡\nabla_{x}\log p_{t}(x)=-\frac{\epsilon_{t}}{\sqrt{v_{t}}}.∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x ) = - divide start_ARG italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG . (23)

Refer to caption

Figure 6: Visualisation comparisons results on five real-world challenging samples from the AISTD dataset.

Network architecture As shown the architecture of DBRM in Figure 3, our noise prediction network is based on nonlinear activated free block (NAF block) [CCZS22]. NAF block replaces nonlinear activations (such as ReLU and GELU) with a SimpleGate unit. Given an input, SimpleGate splits it into two features along the channel dimension and then uses a linear gate to compute the output. The SimpleGate unit is added after the depth-wise convolution and between the two fully connected layers. Meanwhile, we add an additional multi-layer perceptual processing time embedding for each NAF block.

Loss function An alternative maximum likelihood objective aims to find the optimal trajectory x1:Tsubscript𝑥:1𝑇x_{1:T}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT given the high-quality image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, stabilizing training and recovering more accurate images. Following IR-SDE [QTH17b], we train our prediction network with a maximum likelihood loss which specifies the optimal reverse path xt1subscriptsuperscript𝑥𝑡1x^{*}_{t-1}italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT for all times:

xt1=subscriptsuperscript𝑥𝑡1absent\displaystyle x^{*}_{t-1}=italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = 1e2θ¯t11e2θ¯teθt(xtR~n2)1superscript𝑒2subscript¯𝜃𝑡11superscript𝑒2subscript¯𝜃𝑡superscript𝑒subscriptsuperscript𝜃𝑡subscript𝑥𝑡subscriptsuperscript~𝑅2𝑛\displaystyle\frac{1-e^{-2\bar{\theta}_{t-1}}}{1-e^{-2\bar{\theta}_{t}}}e^{-% \theta^{\prime}_{t}}(x_{t}-\widetilde{R}^{2}_{n})divide start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 2 over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 2 over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) (24)
+1e2θt1e2θ¯teθ¯t1(RnR~n2)+R~n21superscript𝑒2subscriptsuperscript𝜃𝑡1superscript𝑒2subscript¯𝜃𝑡superscript𝑒subscript¯𝜃𝑡1subscript𝑅𝑛subscriptsuperscript~𝑅2𝑛subscriptsuperscript~𝑅2𝑛\displaystyle+\frac{1-e^{-2\theta^{\prime}_{t}}}{1-e^{-2\bar{\theta}_{t}}}e^{-% \bar{\theta}_{t-1}}(R_{n}-\widetilde{R}^{2}_{n})+\widetilde{R}^{2}_{n}+ divide start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 2 italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 2 over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT - over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + over~ start_ARG italic_R end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT

where θi=i1iθt𝑑tsubscriptsuperscript𝜃𝑖superscriptsubscript𝑖1𝑖subscript𝜃𝑡differential-d𝑡\theta^{\prime}_{i}=\int_{i-1}^{i}\theta_{t}\,dtitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_t. Then, we define the maximum likelihood loss :

diff=t=1Tγt𝔼[xt(dxt)ϵ~ϕxt1]subscriptdiffsuperscriptsubscript𝑡1𝑇subscript𝛾𝑡𝔼delimited-[]normsubscript𝑥𝑡𝑑subscript𝑥𝑡subscript~italic-ϵitalic-ϕsubscriptsuperscript𝑥𝑡1\mathcal{L}_{\textit{diff}}=\sum_{t=1}^{T}\gamma_{t}\mathbb{E}\left[\left\|x_{% t}-(dx_{t})\tilde{\epsilon}_{\phi}-{x^{*}_{t-1}}\right\|\right]caligraphic_L start_POSTSUBSCRIPT diff end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT blackboard_E [ ∥ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ( italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∥ ] (25)

where γ1,,γTsubscript𝛾1subscript𝛾𝑇\gamma_{1},...,\gamma_{T}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_γ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are positive weights and {xt}t=0Tsubscriptsuperscriptsubscript𝑥𝑡𝑇𝑡0{\{x_{t}\}}^{T}_{t=0}{ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT denotes the discretization of the diffusion process. (dx)ϵ~ϕ𝑑𝑥subscript~italic-ϵitalic-ϕ(dx)\tilde{\epsilon}_{\phi}( italic_d italic_x ) over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT denotes the reverse-time SDE in Eq.17 and its score is predicted by the noise network ϵ~ϕsubscript~italic-ϵitalic-ϕ\tilde{\epsilon}_{\phi}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT. xt(dxt)ϵ~ϕsubscript𝑥𝑡𝑑subscript𝑥𝑡subscript~italic-ϵitalic-ϕx_{t}-(dx_{t})\tilde{\epsilon}_{\phi}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - ( italic_d italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT is the reverse xt1subscript𝑥𝑡1x_{t-1}italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. Once trained, we can use the network ϵ~ϕsubscript~italic-ϵitalic-ϕ\tilde{\epsilon}_{\phi}over~ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT to generate high-quality images by sampling a noisy state xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and iteratively solving the Eq.17 with a numerical scheme.

5 Experiment

Scheme Methods Shadow Region Non-shadow Region All Image RMSE\downarrow PSNR\uparrow SSIM\uparrow RMSE\downarrow PSNR\uparrow SSIM\uparrow RMSE\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow Supervised DHAN 7.53 35.82 0.989 5.33 30.95 0.971 5.68 29.09 0.953 0.027 Sp+M-Net 7.13 35.08 0.984 3.16 36.38 0.979 3.92 31.89 0.953 0.079 AEFNet 7.91 34.71 0.975 5.51 28.61 0.880 5.88 27.19 0.945 0.045 ShadowDiffusion 4.10 40.06 0.996 4.18 33.00 0.973 4.16 32.20 0.967 - Unsupervised Mask-ShadowGAN 13.00 30.53 0.977 6.07 28.86 0.960 6.96 25.72 0.925 0.064 DC-ShadowNet 11.89 31.27 0.969 7.84 27.20 0.910 7.03 25.51 0.865 0.104 LG-ShadowNet 10.72 31.53 0.979 6.30 27.67 0.967 6.08 26.62 0.936 0.056 S3R-Net 12.16 - - 6.38 - - 7.12 - - - Ours 10.53 32.68 0.966 5.54 30.96 0.970 6.29 27.94 0.930 0.038 Shadow image 32.10 22.40 0.936 7.09 27.32 0.976 10.89 20.56 0.893 0.116


Table 1: Quantitative comparison results of our methods with the state-of-the-art methods on ISTD dataset. The best and second performances for supervised learning and unsupervised learning methods are highlighted in Bold and underlined, respectively. ‘-’ denotes the results are not publicly available.
Scheme Methods Shadow Region Non-shadow Region All
RMSE\downarrow PSNR\uparrow SSIM\uparrow RMSE\downarrow PSNR\uparrow SSIM\uparrow RMSE\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow
Supervised DHAN 9.57 32.92 0.987 7.41 27.15 0.972 7.77 25.66 0.954 0.026
Sp+M-Net 5.91 37.60 0.99 0 2.99 36.02 0.976 3.46 32.94 0.962 0.085
AEFNet 6.55 36.04 0.978 3.77 31.16 0.892 4.22 29.45 0.861 0.046
ShadowDiffusion 4.60 40.13 0.997 2.74 36.48 0.979 2.91 35.66 0.974 -
Unsupervised Mask-ShadowGAN 11.28 31.50 0.981 3.90 32.63 0.967 4.97 28.11 0.936 0.063
DC-ShadowNet 10.81 32.15 0.978 3.46 35.50 0.974 4.61 29.09 0.940 0.051
LG-ShadowNet 9.90 32.42 0.982 3.18 34.01 0.976 4.25 29.31 0.947 0.049
S3R-Net 12.86 - - 4.43 - - 5.71 - - -
Ours 9.48 33.63 0.968 3.09 36.01 0.978 4.06 31.09 0.940 0.034
Shadow image 36.95 20.83 0.927 2.42 37.46 0.985 8.40 20.46 0.894 0.117
Table 2: Quantitative comparison results of our methods with the state-of-the-art methods on AISTD dataset.

5.1 Datasets and evaluation metrics

Dataset We utilize two state-of-the-art shadow removal datasets: ISTD [WLY18] and AISTD [LS19]. ISTD comprising 1870 triplets of shadow, shadow-free images and shadow masks, with 1330 triplets for training and 540 triplets for testing. AISTD is the adjust dataset which further corrects the color inconsistency problem of images from the ISTD.

Evaluation metrics In our experiments, we follow [LS21] calculate root mean square error (RMSE) in the LAB color space , and employ the structure similarity (SSIM), peak signal-to-noise ratio (PSNR), and learned perceptual image patch similarity (LPIPS) as evaluation metrics for comparisons. Generally, higher PSNR and SSIM values are preferred, while lower RMSE and LPIPS values indicate better performance. We provide the metrics measured on the shadow region, non-shadow region and whole image for reference.

5.2 Experimental settings

We implement our methods using PyTorch [PGM19] and a single NVIDIA GeForce GTX 3090 GPU. At first stage, we initialise our SG-GAN using a Gaussian distribution with a mean of 0 and a standard deviation of 0.02. We employ the Adam optimiser to train our network with the first and the second momentum setting to 0.5 and 0.999, respectively. We train the whole model for 200 epochs and the base learning rate is set to 2×1042superscript1042\times 10^{-4}2 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT for the first 100 epochs and then we apply a linear decay strategy to decrease it to 0 for the rest epochs. Additionally, horizontal flip** and random crop** strategy purposed in [LYW21] is applied to the training data for data augmentation. The network training involves both two branch, and they impact each other. At second stage, for training the diffusion model, we fix the noise level at 50 and set the number of diffusion denoising steps to 100. The batch sizes are set to 8 and the training patches are 256×256256256256\times 256256 × 256 pixels. We use the Lion optimizer [CLH24] with β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99. The initial learning rate is set to 3×1053superscript1053\times 10^{-5}3 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and decayed to 1e-7 by the Cosine scheduler. The noise level is fixed to 50 and the number of diffusion denoising steps is set to 100. We train DBRM for 400 000 iterations, which takes for about 4 days on the GPU.

5.3 Comparison with the state-of-the-arts

In this subsection, we compare our full model on the ISTD and AISTD datasets with several state-of-the-art methods, including supervised methods which are trained with paried shadow and shadow-free images: DHAN [CPS20], SP+M-Net [LS19], AEFNet [FZG21] and ShadowDiffusion [GWY23a]; unsupervised methods training without paired shadow and shadow-free images: G2R-ShadowNet [LYW21], Mask-ShadowGAN [HJFH19], DC-ShadowNet [JST21], LG-ShadowNet [LYM21] and S3R-Net [KMP24]. All of the shadow removal results by the competing methods are quoted from the original papers or reproduced using their official implementations.

Refer to caption

Figure 7: Visual comparisons result for ablation study on the use of each loss terms in SG-GAN.

Table 1 shows the quantitative results on the ISTD dataset. The supervised methods share the same type of training data, including shadow and shadow-free image pairs. They learn the map** from shadow images to shadow-free images based on training pairs. However, their performance might be largely degraded when extended to some unseen scenes. Our method achieves results in shadow-free region and all image that are comparable to other deep neural networks trained on paired images, and in some metrics, it even surpasses some supervised methods. For instance, LPIPS metric of our results for the all image is better than those of Sp+M-net [LS19] and AEFNet [FZG21]. Moreover, several unsupervised methods, such as Mask-ShadowGAN [HJFH19], LG-ShadowNet [LYM21], and DC-ShadowNet[JST21], we can see that our method significantly outperforms these three methods. Our method outperforms LG-ShadowNet [LYM21] which is suboptimal on most metrics, especially the results for non-shadow region and the all image improve by about 2.1dB and 2.2dB in PSNR, respectively, except a SSIM and RMSE value is slightly lower. Additionally, our LPIPS and PSNR are clearly better than all the unsupervised methods in the table. When compared to the state-of-the-art unsupervised method S3R [KMP24], our method performs better on RMSE. Note that the code of S3R [KMP24] is not publicly available, so we could only report their results on the dataset used in their published paper. Qualitative results on four challenging samples drawn from the testing set of ISTD are shown in Figure 5. Compared with other methods, our method can produce more realistic results with less artifacts and better preserve the texture details occluded by shadows. Moreover, the colour in the shadow region is more consistent with the surrounding area using our method.

We also retrain and test our model on the AISTD dataset, with quantitative results presented in Table 2. The comparative results are shown in Table 2 and Figure 6. Among the unsupervised methods, just like before our method performs the best on AISTD, followed by LG-ShadowNet [LYM21]. In Figure 6, the outputs generated by our competitors exhibit sharp shadow edges, whereas our results have significantly smoother shadow boundaries. Additionally, our samples show a noticeably sharper appearance in shadow region. We believe we have demonstrated that our approach can produce the most visually pleasing results to the human eye.

Methods RMSE\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow
w/o DBRM 7.27 25.44 0.935 0.054
w/o S2F 10.36 21.45 0.892 0.122
w/o MSP 7.99 24.89 0.902 0.065
Ours 6.29 27.94 0.930 0.038
Table 3: Ablation study on the choices of different component for our method on ISTD testing set.
Refer to caption
Figure 8: Visual comparisons result of ablation study on the use of each component for our method.

5.4 Ablation study

To validate the efficacy of each pivotal component within our proposed method, we trained and evaluated several model variants on the ISTD dataset. First, We propose to utilize the DBRM to suppress the artifacts of shadow removal results. So we validate DBRM by removing it from our complete model, retaining only SG-GAN. Additionally, we train SG-GAN without the S2F and MSP components to evaluate their individual contributions. The quantitative results are reported in Table 3. Subsequently, we conduct ablation studies on loss functions we proposed. These studies involved training SG-GAN without specific loss terms to demonstrate the effectiveness of each loss function. The quantitative results are reported in Table 4.

Methods RMSE\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow
w/o GANssubscript𝐺𝐴subscript𝑁𝑠\mathcal{L}_{GAN_{s}}caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT 8.75 24.34 0.838 0.119
w/o GANrsubscript𝐺𝐴subscript𝑁𝑟\mathcal{L}_{GAN_{r}}caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT 10.05 21.64 0.900 0.115
w/o cyclesubscriptcycle\mathcal{L}_{\textit{cycle}}caligraphic_L start_POSTSUBSCRIPT cycle end_POSTSUBSCRIPT 8.40 24.58 0.825 0.126
w/o shadowsubscriptshadow\mathcal{L}_{\textit{shadow}}caligraphic_L start_POSTSUBSCRIPT shadow end_POSTSUBSCRIPT 8.03 24.79 0.877 0.111
w/o identitysubscriptidentity\mathcal{L}_{\textit{identity}}caligraphic_L start_POSTSUBSCRIPT identity end_POSTSUBSCRIPT 13.54 19.39 0.845 0.182
w/o clipsubscriptclip\mathcal{L}_{\textit{clip}}caligraphic_L start_POSTSUBSCRIPT clip end_POSTSUBSCRIPT 7.41 25.07 0.927 0.068
SG-GAN 7.27 25.44 0.935 0.054
Table 4: Ablation study on the choices of the loss functions for our SA-GAN on ISTD testing set.
Methods RMSE\downarrow PSNR\uparrow SSIM\uparrow LPIPS\downarrow
Ours w/o MSP 6.40 27.16 0.917 0.046
Ours 6.29 27.94 0.930 0.038
Msak-ShadowGAN 6.96 25.72 0.925 0.064
Mask-ShadowGAN w/ MSP 6.73 26.19 0.933 0.060
G2R-ShadowNet 7.33 25.72 0.941 0.047
G2R-ShadowNet w/ MSP 6.83 25.87 0.952 0.053
Table 5: Ablation studies on the effectiveness of the MSP on the ISTD testing set.

As show in Table 3, we observe that performances of our method reduces across all metrics except SSIM when DBRM is omitted. Comparing row 2 to row 4, we find that the S2F branch is crucial, providing substantial performance gains in terms of all the metrics. Then, when MSP is excluded, there is also a certain drop in performance. In Table 4, we find that GANssubscript𝐺𝐴subscript𝑁𝑠\mathcal{L}_{GAN_{s}}caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT is important and brings performance improvement. Rows 2 and 5 show a significant decline in SG-GAN’s performance without GANrsubscript𝐺𝐴subscript𝑁𝑟\mathcal{L}_{GAN_{r}}caligraphic_L start_POSTSUBSCRIPT italic_G italic_A italic_N start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT and identitysubscript𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦\mathcal{L}_{identity}caligraphic_L start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT, especially for the identitysubscript𝑖𝑑𝑒𝑛𝑡𝑖𝑡𝑦\mathcal{L}_{identity}caligraphic_L start_POSTSUBSCRIPT italic_i italic_d italic_e italic_n italic_t italic_i italic_t italic_y end_POSTSUBSCRIPT. Then, when clipsubscriptclip\mathcal{L}_{\textit{clip}}caligraphic_L start_POSTSUBSCRIPT clip end_POSTSUBSCRIPT, cyclesubscript𝑐𝑦𝑐𝑙𝑒\mathcal{L}_{cycle}caligraphic_L start_POSTSUBSCRIPT italic_c italic_y italic_c italic_l italic_e end_POSTSUBSCRIPT, and shadowsubscript𝑠𝑎𝑑𝑜𝑤\mathcal{L}_{shadow}caligraphic_L start_POSTSUBSCRIPT italic_s italic_h italic_a italic_d italic_o italic_w end_POSTSUBSCRIPT is respectively removed from the total loss, the performance of SG-GAN shows a slight decrease. As shown in Figure 8 and Figure 7, the qualitative results are generally consistent with the aforementioned quantitative results in demonstrating the effectiveness of each component. Compared to the model trained with all components, other variants trained with subsets of these components may exhibit noticeable artifacts in the results. For instance, the result of SG-GAN w/o shadowsubscriptshadow\mathcal{L}_{\textit{shadow}}caligraphic_L start_POSTSUBSCRIPT shadow end_POSTSUBSCRIPT in Figure 7 and the result of w/o S2F in Figure 8, artifact in the shadow region is very prominent.

5.5 Effectiveness of general-purpose MSP

We propose a general MSP to mitigate the impact of synthetic images on the performance of the shadow removal network. We experimentally validate the effectiveness of MSP on the ISTD dataset. We conduct experiments not only on our own model but also by integrating our proposed MSP into two unsupervised methods, G2R-ShadowNet [LYW21] and Mask-ShadowGAN[HJFH19], which require synthetic shadows for shadow removal network training.

Quantitative results are shown in Table 5. The results clearly demonstrate that the performance of our method and the other two methods has shown varying degrees of improvement after integrating MSP, particularly in the case of G2R-ShadowNet [LYW21]. It is noteworthy that both our method and Mask-ShadowGAN [HJFH19] are trained in RGB space, whereas G2R-ShadowNet [LYW21] is trained in LAB space. This outcome indicates the universal applicability of our MSP across different color spaces.

6 Conclusion

In summary, we propose a novel adversarial diffusion framework for self-supervised shadow removal. Our method consists of two stages: a coarse processing stage and a refined restoration stage, with the main networks being SG-GAN and DBRM, respectively. In SG-GAN, shadows are first generated and then removed, constructing paired training sets used for refinement in DBRM. Additionally, we design a general-purpose MSP module to mitigate the impact of synthetic data on network performance. The coarse results are refined by the powerful generative capabilities of DBRM, restoring texture details and edge artifacts in shadow regions. Extensive experiments demonstrate the effectiveness of our proposed method, showing competitive performance on the ISTD and AISTD datasets compared to other state-of-the-art techniques.

References

  • [AHO11] Arbel E., Hel-Or H.: Shadow removal using intensity surfaces and texture anchor points. IEEE Computer Society, 6 (2011).
  • [And82] Anderson B. D.: Reverse-time diffusion equation models. Stochastic Processes and their Applications 12, 3 (1982), 313–326.
  • [CCZS22] Chen L., Chu X., Zhang X., Sun J.: Simple baselines for image restoration. In European conference on computer vision (2022), Springer, pp. 17–33.
  • [CLH24] Chen X., Liang C., Huang D., Real E., Wang K., Pham H., Dong X., Luong T., Hsieh C.-J., Lu Y., et al.: Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems 36 (2024).
  • [CPS20] Cun X., Pun C.-M., Shi C.: Towards ghost-free shadow removal via dual hierarchical aggregation network and shadow matting gan. In Proceedings of the AAAI Conference on Artificial Intelligence (2020), vol. 34, pp. 10680–10687.
  • [DBMH22] De Bortoli V., Mathieu E., Hutchinson M., Thornton J., Teh Y. W., Doucet A.: Riemannian score-based generative modelling. Advances in Neural Information Processing Systems 35 (2022), 2406–2422.
  • [FHLD06] Finlayson G. D., Hordley S. D., Lu C., Drew M. S.: On the removal of shadows from images. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 1 (2006), 59–68.
  • [FSZY24] Fu Z., Song K., Zhou L., Yang Y.: Noise-aware image captioning with progressively exploring mismatched words. In Proceedings of the AAAI Conference on Artificial Intelligence (2024), vol. 38, pp. 12091–12099.
  • [FY19] Fang L., Yu F.: Moving object detection algorithm based on removed ghost and shadow visual background extractor. Laser & Optoelectronics Progress 56, 13 (2019), 131002.
  • [FZG21] Fu L., Zhou C., Guo Q., Juefei-Xu F., Yu H., Feng W., Liu Y., Wang S.: Auto-exposure fusion for single-image shadow removal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021), pp. 10571–10580.
  • [GDH13] Guo R., Dai Q., Hoiem D.: Paired regions for shadow detection and removal. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2013), 2956–2967.
  • [GWY23a] Guo L., Wang C., Yang W., Huang S., Wang Y., Pfister H., Wen B.: Shadowdiffusion: When degradation prior meets diffusion model for shadow removal. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2023), IEEE, pp. 14049–14058.
  • [GWY23b] Guo L., Wang C., Yang W., Wang Y., Wen B.: Boundary-aware divide and conquer: A diffusion-based solution for unsupervised shadow removal. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2023), pp. 13045–13054.
  • [HJA20] Ho J., Jain A., Abbeel P.: Denoising diffusion probabilistic models. Advances in neural information processing systems 33 (2020), 6840–6851.
  • [HJFH19] Hu X., Jiang Y., Fu C. W., Heng P. A.: Mask-shadowgan: Learning to remove shadows from unpaired data. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019).
  • [IZZE17] Isola P., Zhu J.-Y., Zhou T., Efros A. A.: Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), pp. 1125–1134.
  • [JST21] ** Y., Sharma A., Tan R. T.: Dc-shadownet: Single-image hard and soft shadow removal using unsupervised domain-classifier guided network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2021), pp. 5027–5036.
  • [KMP24] Kubiak N., Mustafa A., Phillipson G., Jolly S., Hadfield S.: S3r-net: A single-stage approach to self-supervised shadow removal. arXiv preprint arXiv:2404.12103 (2024).
  • [KOY22] Kim B., Oh Y., Ye J. C.: Diffusion adversarial representation learning for self-supervised vessel segmentation. arXiv preprint arXiv:2209.14566 (2022).
  • [LDR22] Lugmayr A., Danelljan M., Romero A., Yu F., Timofte R., Van Gool L.: Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022), pp. 11461–11471.
  • [LKX24] Liu Y., Ke Z., Xu K., Liu F., Wang Z., Lau R. W.: Recasting regional lighting for shadow removal. arXiv e-prints (2024), arXiv–2402.
  • [LLK23] Lu S., Liu Y., Kong A. W.-K.: Tf-icon: Diffusion-based training-free cross-domain image composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2023), pp. 2294–2305.
  • [LRJ23] Li X., Ren Y., ** X., Lan C., Wang X., Zeng W., Wang X., Chen Z.: Diffusion models for image restoration and enhancement–a comprehensive survey. arXiv preprint arXiv:2308.09388 (2023).
  • [LS18] Li Z., Snavely N.: Learning intrinsic image decomposition from watching the world. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 9039–9048.
  • [LS19] Le H., Samaras D.: Shadow removal via shadow image decomposition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), pp. 8578–8587.
  • [LS21] Le H., Samaras D.: Physics-based shadow image decomposition for shadow removal. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 12 (2021), 9088–9101.
  • [LWL24] Lu S., Wang Z., Li L., Liu Y., Kong A. W.-K.: Mace: Mass concept erasure in diffusion models. arXiv preprint arXiv:2403.06135 (2024).
  • [LYM21] Liu Z., Yin H., Mi Y., Pu M., Wang S.: Shadow removal by a lightness-guided network with training on unpaired data. IEEE Transactions on Image Processing 30 (2021), 1853–1865.
  • [LYW21] Liu Z., Yin H., Wu X., Wu Z., Mi Y., Wang S.: From shadow generation to shadow removal. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2021), pp. 4927–4936.
  • [LZB22] Lu C., Zhou Y., Bao F., Chen J., Li C., Zhu J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems 35 (2022), 5775–5787.
  • [MFL24] Mei K., Figueroa L., Lin Z., Ding Z., Cohen S., Patel V. M.: Latent feature-guided diffusion models for shadow removal. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2024), pp. 4313–4322.
  • [PGM19] Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
  • [QTH17a] Qu L., Tian J., He S., Tang Y., Lau R. W. H.: Deshadownet: A multi-context embedding deep network for shadow removal. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
  • [QTH17b] Qu L., Tian J., He S., Tang Y., Lau R. W. H.: Deshadownet: A multi-context embedding deep network for shadow removal. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).
  • [RBL22] Rombach R., Blattmann A., Lorenz D., Esser P., Ommer B.: High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022), pp. 10684–10695.
  • [RKH21] Radford A., Kim J. W., Hallacy C., Ramesh A., Goh G., Agarwal S., Sastry G., Askell A., Mishkin P., Clark J., et al.: Learning transferable visual models from natural language supervision. In International conference on machine learning (2021), PMLR, pp. 8748–8763.
  • [SCC22] Saharia C., Chan W., Chang H., Lee C., Ho J., Salimans T., Fleet D., Norouzi M.: Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 conference proceedings (2022), pp. 1–10.
  • [SDWMG15] Sohl-Dickstein J., Weiss E., Maheswaranathan N., Ganguli S.: Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning (2015), PMLR, pp. 2256–2265.
  • [SHC22] Saharia C., Ho J., Chan W., Salimans T., Fleet D. J., Norouzi M.: Image super-resolution via iterative refinement. IEEE transactions on pattern analysis and machine intelligence 45, 4 (2022), 4713–4726.
  • [SSDK20] Song Y., Sohl-Dickstein J., Kingma D. P., Kumar A., Ermon S., Poole B.: Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020).
  • [WLY18] Wang J., Li X., Yang J.: Stacked conditional generative adversarial networks for jointly learning shadow detection and shadow removal. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018).
  • [XHL23] Xu K., Hancke G. P., Lau R. W.: Learning image harmonization in the linear color space. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2023), pp. 12570–12579.
  • [XKV21] Xiao Z., Kreis K., Vahdat A.: Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804 (2021).
  • [XSXM13] Xiao C., She R., Xiao D., Ma K. L.: Fast shadow removal using adaptive multi-scale illumination transfer. Computer Graphics Forum 32, 8 (2013), 207–218.
  • [YWZ22] Yang Y., Wei H., Zhu H., Yu D., Xiong H., Yang J.: Exploiting cross-modal prediction and relation consistency for semisupervised image captioning. IEEE Transactions on Cybernetics 54, 2 (2022), 890–902.
  • [YYB21] Yang Y., Yang J.-Q., Bao R., Zhan D.-C., Zhu H., Gao X.-R., Xiong H., Yang J.: Corporate relative valuation using heterogeneous multi-modal graph neural network. IEEE Transactions on Knowledge and Data Engineering 35, 1 (2021), 211–224.
  • [YZG22] Yang Y., Zhang J., Gao F., Gao X., Zhu H.: Domfn: A divergence-orientated multi-modal fusion network for resume assessment. In Proceedings of the 30th ACM International Conference on Multimedia (2022), pp. 1612–1620.
  • [ZCDH24] Zhao C., Cai W., Dong C., Hu C.: Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024).
  • [ZCHY24] Zhao C., Cai W., Hu C., Yuan Z.: Cycle contrastive adversarial learning with structural consistency for unsupervised high-quality image deraining transformer. Neural Networks (2024), 106428.
  • [ZLM24] Zhou D., Li Y., Ma F., Yang Z., Yang Y.: Migc: Multi-instance generation controller for text-to-image synthesis. arXiv preprint arXiv:2402.05408 (2024).
  • [ZLY20] Zhang L., Long C., Yan Q., Zhang X., Xiao C.: Cla-gan: A context and lightness aware generative adversarial network for shadow removal. Computer Graphics Forum (2020).
  • [ZXF22] Zhu Y., Xiao Z., Fang Y., Fu X., Xiong Z., Zha Z.-J.: Efficient model-driven network for shadow removal. In Proceedings of the AAAI conference on artificial intelligence (2022), vol. 36, pp. 3635–3643.
  • [ZYY23] Zhou D., Yang Z., Yang Y.: Pyramid diffusion models for low-light image enhancement. arXiv preprint arXiv:2305.10028 (2023).