Rethink Arbitrary Style Transfer with Transformer and Contrastive Learning

Zhanjie Zhang Jiakai Sun Guangyuan Li Lei Zhao Quanwei Zhang Zehua Lan Haolin Yin Wei Xing Huaizhong Lin Zhiwen Zuo [email protected] College of Computer Science and Technology, Zhejiang University, No. 38, Zheda Road, Hangzhou 310000, China College of Computer Science and Technology, Zhejiang Gongshang University, No. 18 Xuezheng Street, Hangzhou 310018, China

Abstract

Arbitrary style transfer holds widespread attention in research and boasts numerous practical applications. The existing methods, which either employ cross-attention to incorporate deep style attributes into content attributes or use adaptive normalization to adjust content features, fail to generate high-quality stylized images. In this paper, we introduce an innovative technique to improve the quality of stylized images. Firstly, we propose Style Consistency Instance Normalization (SCIN), a method to refine the alignment between content and style features. In addition, we have developed an Instance-based Contrastive Learning (ICL) approach designed to understand the relationships among various styles, thereby enhancing the quality of the resulting stylized images. Recognizing that VGG networks are more adept at extracting classification features and need to be better suited for capturing style features, we have also introduced the Perception Encoder (PE) to capture style features. Extensive experiments demonstrate that our proposed method generates high-quality stylized images and effectively prevents artifacts compared with the existing state-of-the-art methods.

keywords:

Arbitrary Style transfer, Transformer, Contrastive Learning

^†^†journal: Computer Vision and Image Understanding

Refer to caption — Fig. 1: Stylized examples of the existing arbitrary style transfer method. Although the attention-based arbitrary style transfer method can learn local texture and content-style correlation, they sometimes bring in the content feature of style images in Row 1. Non-attention-based arbitrary style transfer failed to learn detailed texture and also generated artifacts.

1 Introduction

Style transfer refers to the process of generating a new image that retains the content of a given content image while incorporating the style of a given style image. Gatys et al. (Gatys et al., 2015; Cai et al., 2023) proposed a seminal work introducing an online optimization method for style transfer. The proposed method used a fixed model (VGG) (Simonyan and Zisserman, 2014) to extract content and style features. It iteratively optimized the content features to match the content of the input image and the style features to match the style of the reference image. Additionally, multi-style methods (Dumoulin et al., 2016; Chen et al., 2017) have expanded the capability of style transfer by using a single pre-trained model to perform style transfer with multiple style references. Then, arbitrary style transfer methods (Huang and Belongie, 2017) become a research hot spot. Arbitrary style transfer methods can generate stylized images from arbitrary content and style images based on a single model.

Existing arbitrary style transfer methods can be divided into two categories: (a) attention-based style transfer methods. (b) non-attention-based style transfer methods. As representatives of the former, IEAST (Chen et al., 2021a), AdaAttN (Liu et al., 2021a), MAST (Deng et al., 2020), SANet (Park and Lee, 2019), 3DPS (Mu et al., 2022), RAST (Ma et al., 2023) and AesUST (Wang et al., 2022b) utilize attention mechanisms to locally fuse style features into content features. While attention-based arbitrary style transfer is highly superior in learning local texture and content-style semantic correlation, these methods often bring evident artifacts, making their results easily distinguishable from real paintings. Although StyTr2 (Deng et al., 2022) use a transformer to avoid introducing artifacts, drop** high-frequency content and style information. As representatives of the latter, AdaIN (Huang and Belongie, 2017), DEAST (Zhang et al., 2022), WCT (Li et al., 2017), ArtFlow (An et al., 2021), LST (Li et al., 2019), Gating (Yang et al., 2022), Caster (Zhang et al., 2023b), UAST (Cheng et al., 2023), TeSTNerf (Chen et al., 2023), AAST (Hu et al., 2020), CCPL (Wu et al., 2022) and MCCNet (Deng et al., 2021) transform the content features to match the second-order global statistics of style features without considering local distribution. They may often integrate messy style textures and patterns into the content target. DiffuseIT (Kwon and Ye, 2022b) utilizes a pre-trained ViT model to guide the generation process of DDPM models in terms of content preservation and style information changes. Compared to stylized images, artworks can be distinguished as true due to their artistic characteristics, such as colors, strokes, tones, textures, etc. The methods above cannot generate high-quality images that possess these artistic elements (See Fig. 1).

To solve these problems, we propose Style Consistency Instance Normalization (SCIN) to align content features with style features from feature distribution, which helps to supply global style information. Specifically, we use a transformer (Vaswani et al., 2017; Lyu et al., 2023a, b; Li et al., 2022b, a) as a global style extractor to capture non-local, long-range dependencies of style information from the style image. The transformer outputs scale and bias parameters, which are used to adjust the global information of the content features and match the feature distribution of the style image. Existing style transfer methods often use a content loss and a style loss to ensure the content-to-stylization and style-to-stylization relations, respectively. However, they tend to neglect the stylization-to-stylization relations, which are also crucial for style transfer. Based on this analysis, we propose a novel Instance-based Contrastive Learning (ICL) that can pull the multiple stylization embeddings closer to each other when they share the same content or style but push far away otherwise. Besides, we have observed that existing style transfer methods use VGG as the feature extractor, which is trained on the ImageNet Dataset (Deng et al., 2009) and is effective at extracting classification features but not suitable for removing style features (See Fig. 8). Expired by Inception Transformer (Si et al., ), we propose a Perception Encoder (PE) to extract style information and avoid paying too much attention to remarkable classification features.

To summarize, the main contribution of this paper is as follows:

1.

We propose a novel Style Consistency Instance Normalization (SCIN) to capture long-range and non-local style correlation. This can align the content feature with the style feature instead of the mean and variance computed by fixed VGG.
2.

Considering existing methods always generate low-quality stylized images with artifacts or stylized images with semantic errors, we introduce a novel Instance-based Contrastive Learning (ICL) to learn stylization-to-stylization relation and remove artifacts.
3.

We analyze the defects of attention-based arbitrary style transfer due to fixed VGG and propose a novel Perception Encoder (PE) that can capture style information and avoid paying too much attention on the remarkable classification feature of style images.
4.

Compared to the state-of-the-art method, extensive experiments demonstrate our proposed method can learn detailed texture and global style correlation and remove artifacts.

2 Related Work

Arbitrary Style Transfer. Recent arbitrary style transfer methods can be divided into two categories: attention-based style transfer methods and non-attention-based style transfer methods. The common idea of the former category is to apply feature modification globally. Dumoulin et al. propose conditional instance normalization(CIN) (Gulrajani et al., 2017) to scale and shift the activations in the IN layer. Based on CIN, a simple network can transfer multiple styles. Huang et al. (Huang and Belongie, 2017) propose to use adaptive affine parameters for arbitrary style transfer. **g et al. (**g et al., 2020) propose Dynamic Instance Normalization, but they still depend on VGG’s feature space. AdaIN and DIN are very effective methods, but they depend on VGG and can’t learn global-style feature maps. Park et al. (Park et al., 2019) project the style image onto a convolved embedding space to produce modulation parameters. The convolutional neural network itself constrains spade, and convolutional neural networks fail to learn global information and long-range dependence. Li et al. propose to use feature transforms Li et al. (2017), i.e., whitening and coloring, to directly match content feature statistics to those of a style image in the deep feature space.

For the latter, Park et al. first introduces attention module Park and Lee (2019) to learn local textures and content-style correlation. AdaAttN introduces a novel AdaAttN module for arbitrary style transfer. It takes both shallow and deep features into account for attention score calculation and properly normalizes content features such that feature statistics are well aligned with attention-weighted mean and variance maps of style features on a per-point basis. Deng et al. propose a transformer-based style transfer framework called StyTr2 , to generate stylization results with well-preserved structures and details of the input content image. InST (Zhang et al., 2023a) proposes to learn artistic style directly from a single painting and guide diffusion to generate a stylized image. DiffuseIT (Kwon and Ye, 2022b) utilizes a pre-trained ViT model to guide the generation process of DDPM models in terms of content preservation and style information changes.

Contrastive Learning. Contrastive learning consists of three key ingredients: query, positive examples, and negative samples. The goal of contrastive learning is to make query push negative and pull force positive samples. ContraGAN proposed a conditional contrastive learning loss function to learn sample-to-class and sample-to-sample relations. Chen et al. (Chen et al., 2021a) first build positive and negative samples based on a pre-trained model(VGG) to learn stylization-to-stylization relations. Zhang et al. (Zhang et al., 2022) introduce contrastive learning for style representation using visual features comprehensively to represent style for arbitrary style transfer. Wu et al. (Wu et al., 2022) devise a generic contrastive coherence preserving loss to learn local patches. Recently, OpenAI’s Contrastive Learning-Image Pre-training (CLIP) (Radford et al., 2021) built an effective relationship between language and images. CLIPstyler (Kwon and Ye, 2022a) introduces a text-guided synthesis model that can transfer the style of an image according to specific text.

Image-to-Image Translation. Recently, image-to-image translation (Zhu et al., 2017; Lin et al., 2020; Chen et al., 2021b; Xu et al., 2021; Zhang et al., 2021, 2024b, 2024a; Li et al., 2024; Zuo et al., 2023) methods have achieved significant progresses. These methods learn to generate stylized images from the style domain. However, the Per-Style Per-Model method cannot meet the needs of arbitrary style transfer. GAN can learn the statistical information of the style domain, and it has achieved great success in eliminating the artifact of the generated image. Based on this, Lin et al. (Lin et al., 2021) propose a novel feed-forward style transfer method named LapStyle. It uses a Drafting Network to transfer global style patterns in low resolution. It adopts higher resolution Revision Networks to revise local style patterns in a pyramid manner according to outputs of multi-level Laplacian filtering of the content image. Besides, Chen et al. (Chen et al., 2021a) extend the GAN-based method on arbitrary style transfer. DualAST (Chen et al., 2021b) proposes to use a learnable projection network that maps the VGG feature to learn a dynamic parameter $\alpha$ , readjusting affine parameters dynamically. Wang et al. (Wang and Yu, 2020) introduce white-box cartoon representations to decouple cartoon style transfer into three controllable components and use total-variation loss (Aly and Dubois, 2005; Sun et al., 2023, 2024) to impose spatial smoothness on stylized images.

3 Proposed Method

The overview of the proposed method is shown in Fig. 2. We propose to use a pre-trained VGG network to extract content features and a Perception Encoder (PE) to extract style features. Then, we use cross-attention to combine the style features into the content features, enabling the network to learn local style information better. Next, the Spatial and Channel-wise Intelligence Normalization (SCIN) module aligns the content and style features in the spatial domain to ensure that they have global style information. The aligned features are fed into the decoder to reconstruct the stylized image. Additionally, we use a contrastive learning method to learn the relationship between the stylized image and the style image, which complements the content loss and style loss that only considers the relationship between the stylized image and the content image or the style image separately. This allows us to generate high-quality stylized images.

3.1 Preliminaries

AdaIN (Huang and Belongie, 2017) proposes adaptive instance normalization to align the channel-wise mean and variance of content feature $F_{c}$ to match those style feature $F_{s}$ . It adaptively computes the affine parameters from the style input:

AdaIN(F_{c},F_{s})=\sigma(F_{s})\left(\frac{F_{c}-\mu(F_{c})}{\sigma(F_{c})}% \right)+\mu(F_{s}),

(1)

where $\mu(x)$ and $\sigma(x)$ are the mean and standard deviation computed across spatial locations. Given an input batch $x\in{R}^{N\times C\times H\times W}$ , $\mu(x)$ and $\sigma(x)$ can be computed as below:

\mu_{nc}(x)=\frac{1}{HW}\sum_{h=1}^{H}\sum_{w=1}^{W}x_{nchw},

(2)

\sigma_{nc}(x)=\sqrt{\frac{1}{HW}\sum_{h=1}^{H}\sum_{w=1}^{W}\left(x_{nchw}-% \mu_{nc}(x)\right)^{2}+\epsilon},

(3)

3.2 Style Consistency Instance Normalization

Given the embedding of an input style sequence $Z_{s}=(\mathcal{E}_{s1},\mathcal{E}_{s2},\ldots,\mathcal{E}_{sL})$ , we first feed it into the transformer encoder. The input sequence is encoded into query ( $Q$ ), key ( $K$ ), and value ( $V$ ):

Q=Z_{s}W_{q},\quad K=Z_{s}W_{k},\quad V=Z_{s}W_{v},

(4)

where $W_{q}$ , $W_{k}$ , $W_{v}$ $\in R^{C\times d_{head}}$ , then the multihead attention is calculate by

		$\displaystyle{F}_{\text{MSA }}(Q,K,V)=\operatorname{Concat}\left(\text{ % Attention }_{1}(Q,K,V)\right.\text{, }$		(5)
		$\displaystyle\left.\ldots,\text{ Attention }_{N}(Q,K,V)\right)W_{o},$		(5)

where $W_{o}\in R^{C\times C}$ are learnable parameters, $N$ is the number of attention heads, and $d_{head}=\frac{C}{N}$ . Then the encoded style sequency $Y_{s}$ can be obatained by below:

		$\displaystyle Y_{s}^{\prime}={F}_{\mathrm{MSA}}(Q,K,V)+Q,$		(6)
		$\displaystyle Y_{s}={F}_{\mathrm{FFN}}\left(Y_{s}^{\prime}\right)+Y_{s}^{% \prime},$		(6)

where ${F}_{\mathrm{FFN}}\left(Y_{s}^{\prime}\right)=\max\left(0,Y_{s}^{\prime}W_{1}+% b_{1}\right)W_{2}+b_{2}$ and Layer normalization (Ba et al., 2016) is applied after each block. Futhermore, we propose a new SCIN, Let $\gamma^{i}_{s}$ and $\beta^{i}_{s}$ convert $I_{s}$ to the sacling and bias in the $i-th$ activation map.

\displaystyle\gamma^{i}_{s}=F^{i}_{FFN}(

\displaystyle Y_{s}),\beta^{i}_{s}=F^{i}_{FFN}(Y_{s}),

(7)

Given a pair of content image $I_{c}$ and style image $I_{s}$ as input, the proposed SCIN layer can be modeled as:

SCIN(F_{c},F_{s})=\gamma^{s}\times\operatorname{IN}\left(\mathcal{F}_{c}\right% )+\beta^{s},

(8)

where $\gamma^{s}$ , $\beta^{s}$ $\in{R}^{N\times C\times 1\times 1}$ . Unlike AdaIN (Huang and Belongie, 2017), the scaling $\gamma^{i}_{s}$ and bias $\beta^{i}_{s}$ are learnable variables from global style information.

In our proposed method, we realign $F_{cs}$ with $\bar{F_{s}}$ based SCIN as below:

\displaystyle\bar{F_{cs}}=SCIN(F_{cs},\bar{F_{s}}),

(9)

In the decoder, every intermediate feature $F_{cs}$ will be aligned with style feature $F_{s}$ . We redefine the reconstructed stylized image. $I_{cs}$ as below:

\displaystyle I_{cs}=D(\bar{F}_{cs},Transformer(I^{1:x}_{s})),

(10)

where $x=4$ , $I^{1:x}_{s}$ represent multi-scale style images in Fig. 2.

3.3 Instance-based Contrastive Learning

Contrastive learning (Wu et al., 2021; Santa Cruz et al., 2018; Chen et al., 2020; Han et al., 2021) has been used in many fields which can preserve the content (Han et al., 2021) of the content image and enhance the style (Chen et al., 2021a) of stylized image. A novel ICL is proposed to learn stylization-to-stylization relations. Instead of previous contrastive learning based on VGG (Chen et al., 2021a), we utilize an image encoder of CLIP to obtain instance-based latent code space to improve stylized images. CLIP (Radford et al., 2021) is an image-text model in which the text encoder and image encoder can project image and text into the same space. So, every image will have its special clip space, including content and style information. So, this is more suitable for learning stylization-to-stylization relations than VGG. Assume the $b^{n}_{s}$ and $b^{n}_{c}$ represent the bach size number $n$ of style images and content images. For every style image $s_{j}$ and content image $c_{i}$ of $b^{n}_{s}$ and $b^{n}_{c}$ , i.e. $i\in[0,n-1],j\in[0,n-1]$ . We use $s_{i}c_{j}$ to denote the corresponding stylized images. For all the content images and style images, we build stylized images as ${s_{1}c_{1},s_{1}c_{2},...,s_{1}c_{n};...;s_{n}c_{1},s_{n}c_{2},...,s_{n}c_{n}}$ . For every stylized image $s_{i}c_{j}$ , we build “Positive Examples:” $s_{m}c_{n}$ , i.e. $m=i,n\neq j$ and “Negative Samples”: $s_{m}c_{n}$ , i.e. $m\neq i,n\neq j$ . Based on these, Instance-based Contrastive Learning can be calculated by:

	$\displaystyle L_{contra}=L_{pos}+L_{neg},$		(11)
	$\displaystyle L_{pos}=-log(\frac{P_{s}}{P_{s}+N_{s}}),L_{neg}=-log(\frac{P_{c}% }{P_{c}+N_{c}}),$
	$\displaystyle P_{s}=exp((M_{s}(s_{i}c_{j})^{\top}M_{s}(s_{i}c_{j})/\tau),$
	$\displaystyle N_{s}=\sum exp(M_{s}(s_{i}c_{j})^{\top})M_{s}(s_{i}c_{j})/\tau),$
	$\displaystyle P_{c}=exp(M_{c}(s_{i}c_{j})^{\top}M_{c}(s_{i}c_{j})/\tau),$
	$\displaystyle N_{c}=\sum exp(M_{c}(s_{i}c_{j})^{\top}M_{c}(s_{i}c_{j})/\tau),$

where $M_{s}=l_{s}(E_{clip}(\cdot))$ , $M_{c}=l_{c}(E_{clip}(\cdot))$ , in which $E_{clip}$ represents the image encoder of CLIP, $l_{c}$ and $l_{s}$ are the projection network to obtain content code and style code. $\tau$ is a high-parameters to control push and pull force set to 0.3. Based on the above analysis, we propose Instance-based Contrastive Learning to constrain pixel-level according to image latent code space.

3.4 Perception Encoder

Transformer (Liu et al., 2021b; Cao et al., 2021; Yu et al., 2022; Plizzari et al., 2021; Sortino et al., 2023; Wang et al., 2022a) have demonstrated its superiority, especially in long-range dependence and obtaining global information. However, the transformer is weak in high-frequency information processing. We propose Perception Encoder to solve this problem absolutely, and we have verified its superiority in Fig. 8, and detailed architecture is shown in Fig. 4. Given style image $I_{s}^{3\times 256\times 256}$ , it first embedding into patches $F_{s}^{512\times 64\times 64}$ . The and patches are dovided into $F_{h}^{256\times 64\times 64}$ and $F_{l}^{256\times 64\times 64}$ along the channel dimension. We propose a parallel structure to learn the high-frequency components $F_{h1}^{128\times 64\times 64}$ and $F_{h2}^{128\times 64\times 64}$ . $F_{h1}$ is embedded with a max-pooling and a linear layer (Szegedy et al., 2015) and $F_{h2}$ is fed into a linear and a depthwise convolution layer (Chollet, 2017; Mamalet and Garcia, 2012; Sandler et al., 2018). To calculate the attention map, we divide the input to style feature $F_{s}^{3\times 16\times 16}$ . We define this process as below:

		$\displaystyle\boldsymbol{F}_{h1}=\operatorname{FC}\left(\operatorname{MaxPool}% \left(\boldsymbol{F}_{h1}\right)\right)$		(12)
		$\displaystyle\boldsymbol{F}_{h2}=\operatorname{DwConv}\left(\operatorname{FC}% \left(\boldsymbol{F}_{h2}\right)\right),$		(12)

where FC represents the fully connected layer. $Y_{h1}$ and $Y_{h2}$ denote the outputs of high-frequency mixers. For low-frequency mixer,

\boldsymbol{F}_{l}=\operatorname{Upsample}\left(\mathcal{F}_{\mathrm{MSA}}% \left(\operatorname{AvePooling}\left(\boldsymbol{F}_{l}\right)\right)\right)

(13)

then

\boldsymbol{F}_{\boldsymbol{s}}=\operatorname{Concat}\left(\boldsymbol{Y}_{l},% \boldsymbol{Y}_{h1},\boldsymbol{Y}_{h2}\right)

(14)

Then we repeat the above process by projecting style images into style feature $F_{s}^{512\times 32\times 32}$ .

3.5 Other loss functions

Perceptual Content and Style Loss. Following previous style transfer method (Liu et al., 2021a; Deng et al., 2020; Chen et al., 2021a; Park and Lee, 2019). For the layer $x$ of the VGG encoder, e.g., $E^{x}_{VGG}$ , the perceptual content and style loss can be calculated as below:

	$\displaystyle L_{c}=\sum^{L}_{i=4}\|\|E^{x}_{VGG}(I_{cs})-E^{x}_{VGG}(I_{c})\|\|_{% 2},$		(15)
	$\displaystyle L_{s}=\sum^{L}_{i=1}\|\|\mu(E^{x}_{VGG}(I_{cs})-\mu(E^{x}_{VGG}(I_% {s}))\|\|_{2}$
	$\displaystyle+\|\|\theta((E^{x}_{VGG}(I_{cs}))-\theta(E^{x}_{VGG}(I_{s})))\|\|_{2},$

Where $\mu$ , and $\theta$ denotes the channel-wise mean and standard deviation. For $E^{x}_{VGG}$ , we use the $ReLU4\_1$ and $ReLU5\_1$ to compute $L_{c}$ . Differently, we use $ReLU1\_1$ , $ReLU2\_1$ , $ReLU3\_1$ , $ReLU4\_1$ and $ReLU5\_1$ to compute style loss, paying more attention to different scale style feature information.

Adversarial Loss. Inspired by the Generative Adversarial Network (GAN) (Zhu et al., 2017; Chen et al., 2021a), which can effectively make the data distribution of the stylized image $I_{cs}$ more close to style images $I_{s}$ . The adversarial loss can make the stylized images $I_{cs}$ look more realistic and remove non-human visual artifacts. Then, we define Adversarial Loss can be computed as:

L_{Adv}=\underset{y\sim I_{s}}{E}\left[\log\left(D_{s}(y)\right)\right]+% \underset{x\sim I_{cs}}{E}\left[\log\left(1-D_{s}(x)\right]\right..

(16)

Identity Loss. Attention-based style transfer tends to lose content structure, and identity loss (Park and Lee, 2019; Lin et al., 2020; Zhao et al., 2020) is used to solve this problem. Following prior identity loss, $L_{identity}$ can be calculated as below:

	$\displaystyle L_{Identity}=\lambda_{identity1}(\|\|I_{cs}-I_{c}\|\|_{2}+\|\|I_{cs}-I% _{s}\|\|_{2}),$		(17)
	$\displaystyle+\sum_{i=0}^{L}\lambda_{identity1}(\|\|E^{x}_{VGG}(I_{cs})-E^{x}_{% VGG}(I_{c})\|\|_{2}$
	$\displaystyle+\|\|E^{x}_{VGG}(I_{cs})-E^{x}_{VGG}(I_{s})\|\|_{2}),$

where $\lambda_{1}=50$ , $\lambda_{2}=1$ , $ReLU1\_1$ , $ReLU2\_1$ , $ReLU3\_1$ , $ReLU4\_1$ and $ReLU5\_1$ are used.

3.6 Objective Loss Function

We summarize all the above losses to obtain the final objective loss function.

\displaystyle L=\lambda_{1}L_{s}+\lambda_{2}L_{c}+\lambda_{3}L_{identity}+% \lambda_{4}L_{Adv}+\lambda_{5}L_{contra}

(18)

where $\lambda_{1}=1$ , $\lambda_{2}=1$ , $\lambda_{3}=5$ , $\lambda_{4}=1$ , $\lambda_{5}=0.3$ .

Table 1: Quantitative comparison with other state-of-the-art arbitrary style transfer methods. “Times” means the inference time with a scale of

512\times 512

pixels using single RTX 2080.

	CF	GE+LP	Deception	Preference	Times
WikiArt	-	-	0.784	-	-
Ours	0.432	1.615	0.573	-	0.153
IEAST	0.412	1.405	0.476	0.381	0.064
AdaAttN	0.380	1.430	0.372	0.363	0.142
MAST	0.360	1.511	0.324	0.268	0.124
SANet	0.366	1.552	0.346	0.312	0.064
AdaIN	0.383	1.552	0.346	0.227	0.045
CAST	0.335	1.581	0.423	0.376	0.045
WCT	0.346	1.540	0.172	0.152	0.408
LST	0.408	1.459	0.408	0.350	0.036
StyTR2	0.336	1.602	0.545	0.392	2.781
AesUst	0.420	1.524	0.568	0.415	0.066
ArtFlow	0.411	1.505	0.356	0.350	0.382
AAST	0.368	1.461	0.307	0.263	1.153
CCPL	0.422	1.594	0.427	0.393	0.043
MCCNet	0.382	1.473	0.357	0.273	0.023
DiffuseIT	0.295	1.482	0.287	0.252	35.52

4 Experiments

4.1 Implementation Details

The content and style images come from MS-COCO (Lin et al., 2014) and WikiArt (Nichol, 2016). During training, all 82,783 content images and 79,433 style images are resized to $512\times 512$ pixels. Before feeding into a generator, they are cropped to $256\times 256$ pixels. For inference, some style pictures, including oil paintings, watercolor paintings, sketches, etc., are randomly collected. In addition, we also collect some pictures, including buildings, forests, and live photos, as content images and keep their original size sent to our proposed model to generate stylized images.

4.2 Comparisons with SOTA Methods

As shown in Fig. 5, we compare the proposed method with the state-of-art method, including attention-based style transfer: IEAST (Chen et al., 2021a), AdaAttn (Liu et al., 2021a), MAST (Deng et al., 2020), SANet (Park and Lee, 2019), AesUst (Wang et al., 2022b), StyTr2 (Deng et al., 2022), non-attention-based style transfer: AdaIN (Huang and Belongie, 2017), DEAST (Zhang et al., 2022), WCT (Li et al., 2017), ArtFlow (An et al., 2021), LST (Li et al., 2019), AAST (Hu et al., 2020), MCCNet (Deng et al., 2021), CCPL (Wu et al., 2022), DiffuseIT (Kwon and Ye, 2022b). IEST, AdaAttN, MAST, SANet, and AesUST sometimes introduce undesired semantic structures from the style image to the stylization result. WCT has severe problems with content preservation. AdaIN and MCCNet suffer from content structure blur and style pattern distortion issues. StyTr2, DeAST, ArtFlow, and LST fail to learn local texture from the style image. Although CCPL can effectively maintain the content structure, it cannot learn the content-style semantic correlation. For AAST, there is an obvious style deviation between the style image and the stylized image. DiffuseIT has the problem of content structure consistency and style oversaturation. Compared with these methods, our proposed method can generate stylized images with content structure and style texture and does not introduce unexpected semantic structures. Besides, we randomly chose high-quality stylized images, as shown in Fig. 6.

4.3 Qualitative Comparisons

CF, GE, and LP Scores. Wang et al. (Wang et al., 2021) proposed three novel evaluation metrics to evaluate the quality of style transfer: content fidelity (CF), global effects (GE), and local patterns (LP). Specifically, CF is used to measure the faithfulness of content characteristics and GE assesses the stylization of global colors and textures. LP assesses the stylization quality in terms of the similarity of the local style. Higher factors mean a better style transfer result. This section uses 5 content and 10 style images for other state-of-the-art methods to compare with our proposed method. The visually compared samples and metrics are shown in Fig. 5 and Tab. 1.

Deception Score. Deception scores represent whether a user can distinguish stylized and authentic art images. Higher scores mean a higher percentage of stylized images, distinguished as real art images. We randomly selected 20 synthesized images for each method and asked 50 subjects to guess. In order to compare the advantages of our method more intuitively, we randomly select the same number of WikiArt images to ask 50 subjects. As shown in Tab. 1, our proposed method gets higher deception scores, verifying that our proposed method can generate more real stylized images.

Preference Score. Preference Score means popularity between our and other SOTA methods. We conduct A/B Test user studies to compare the stylization effects of our method with the SOTA method. We selected randomly 10 content and 15 style images to synthesize 150 stylized images for each method. Then 20 content-style pairs are randomly selected for each participant, and we show them the stylized images generated by our and the other 15 state-of-art methods. Next, we ask each participant to choose his/her favorite stylization result for each content-style pair. We finally collected 1600 votes from 80 participants and presented the percentage of votes for each method in the fifth row of Tab. 1. The results show that users prefer the stylized images generated by our method more.

4.4 Abalation Studies

In this section, we conducted an ablation analysis to assess the effectiveness of the proposed method, as depicted in Fig. 7. When $L_{Adv}$ is removed, the generated stylized images show disharmonious patterns and obvious artifacts such as repetitive textures, leading to a significant degradation in quality. This indicates that adversarial training is crucial for generating more harmonious and realistic stylized images. When $L_{ICL}$ is removed, the quality of stylized images degrades in terms of content structure, texture, and brush stroke.The ablation study shows that the SCIN module plays a crucial role in enhancing the style consistency between the stylized images and the style images. Without the SCIN module, the stylized images showed significant style inconsistency. If VGG is used as the style extractor, artifacts may still appear due to its training to extract classification features. Using a trainable VGG structure to extract style information may not be sufficient, resulting in lower quality stylized images.

The impact of the Perception Encoder (PE) is shown in Fig. 8. Existing attention-based methods (Li et al., 2023b, a, c; Cui et al., 2022) usually use SANet as the backbone, which can learn detailed style texture. However, the attention map is calculated from VGG’s feature space (i.e., Relu4_1 and Relu5_1). VGG can effectively capture remarkable classification features (e.g., eyes). This effectively keeps the content structure of stylized images but does not encode style information. To demonstrate this, we use SANet as a baseline and find that the question comes from the style feature map Relu5_1 in VGG. So, we visualize the feature map ReLu5_1. “w/ Perception $E_{s}$ ” using our proposed Perception Encoder instead of fixed VGG. In this part of the experiment, we set the style and content weight to 1 as a baseline. The architecture of the Perception Encoder is shown in Fig. 4

5 Conclusion

This paper proposes a unified network architecture that can generate high-quality stylized images. Specifically, we introduce a novel Style Consistency Instance Normalization (SCIN) method to align the content feature with the style feature in the feature space. Then, we propose an Instance-based Contrastive Learning (ICL) method to learn the stylization-to-stylization relations to improve stylized quality. Additionally, we analyze the limitations of using fixed VGG as a feature extractor and propose a Perception Encoder (PE) to capture style information more effectively. In the future, we plan to explore more general methods to improve the quality of style transfer further.

6 CRediT authorship contribution statement

Zhanjie Zhang: Conceptualization, Methodology, Software, Writing – original draft. Jiakai Sun: Conceptualization, Methodology, Writing – original draft. Guangyuan Li: Conceptualization, Methodology, Writing – original draft. Lei Zhao: Conceptualization, Methodology, Writing – review editing. Quanwei Zhang: Software, Investigation, Data curation, Validation, Writing – review editing. Zehua Lan:Software, Validation, Data curation, Writing – review editing. Haolin Yin: Software, Validation, Visualization, Writing – review editing. Wei Xing: Methodology, Supervision, Writing – review editing. Huaizhong Lin: Methodology, Supervision, Writing – review editing. Zhiwen Zuo: Supervision, Writing – review editing.

7 Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

8 Acknowledgments

This work was supported by Zhejiang Elite Program project (2022C01222), National Natural Science Foundation of China (62172365), the Key Program of the National Social Science Foundation of China (19ZDA197), the Natural Science Foundation of Zhejiang Province (LY21F020005, 2021009, 2019011), MOE Frontier Science Center for Brain Science & Brain-Machine Integration (Zhejiang University).Brain-Machine Integration (Zhejiang University).

References

Aly and Dubois (2005) Aly, H.A., Dubois, E., 2005. Image up-sampling using total-variation regularization with a new observation model. IEEE Transactions on Image Processing 14, 1647–1659.
An et al. (2021) An, J., Huang, S., Song, Y., Dou, D., Liu, W., Luo, J., 2021. Artflow: Unbiased image style transfer via reversible neural flows, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 862–871.
Ba et al. (2016) Ba, J.L., Kiros, J.R., Hinton, G.E., 2016. Layer normalization. arXiv preprint arXiv:1607.06450 .
Cai et al. (2023) Cai, Q., Ma, M., Wang, C., Li, H., 2023. Image neural style transfer: A review. Computers and Electrical Engineering 108, 108723.
Cao et al. (2021) Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M., 2021. Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 .
Chen et al. (2017) Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G., 2017. Stylebank: An explicit representation for neural image style transfer, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1897–1906.
Chen et al. (2021a) Chen, H., Wang, Z., Zhang, H., Zuo, Z., Li, A., Xing, W., Lu, D., et al., 2021a. Artistic style transfer with internal-external learning and contrastive learning. Advances in Neural Information Processing Systems 34, 26561–26573.
Chen et al. (2021b) Chen, H., Zhao, L., Wang, Z., Zhang, H., Zuo, Z., Li, A., Xing, W., Lu, D., 2021b. Dualast: Dual style-learning networks for artistic style transfer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 872–881.
Chen et al. (2023) Chen, J., Ji, B., Zhanjie, Z., Tianyi, C., Zhiwen, Z., Lei, Z., Wei, X., Dongming, L., 2023. Testnerf: Text-driven 3d style transfer via cross-modal learning, in: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, pp. 5788–5796.
Chen et al. (2020) Chen, X., Fan, H., Girshick, R., He, K., 2020. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 .
Cheng et al. (2023) Cheng, J., Wu, Y., Jaiswal, A., Zhang, X., Natarajan, P., Natarajan, P., 2023. User-controllable arbitrary style transfer via entropy regularization, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 433–441.
Chollet (2017) Chollet, F., 2017. Xception: Deep learning with depthwise separable convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258.
Cui et al. (2022) Cui, X., Zhang, Z., Zhang, T., Yang, Z., Yang, J., 2022. Attention graph: Learning effective visual features for large-scale image classification. Journal of Algorithms & Computational Technology 16, 17483026211065375.
Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: A large-scale hierarchical image database, in: 2009 IEEE conference on computer vision and pattern recognition, Ieee. pp. 248–255.
Deng et al. (2021) Deng, Y., Tang, F., Dong, W., Huang, H., Ma, C., Xu, C., 2021. Arbitrary video style transfer via multi-channel correlation, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1210–1217.
Deng et al. (2022) Deng, Y., Tang, F., Dong, W., Ma, C., Pan, X., Wang, L., Xu, C., 2022. Stytr2: Image style transfer with transformers, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11326–11336.
Deng et al. (2020) Deng, Y., Tang, F., Dong, W., Sun, W., Huang, F., Xu, C., 2020. Arbitrary style transfer via multi-adaptation network, in: Proceedings of the 28th ACM international conference on multimedia, pp. 2719–2727.
Dumoulin et al. (2016) Dumoulin, V., Shlens, J., Kudlur, M., 2016. A learned representation for artistic style. arXiv preprint arXiv:1610.07629 .
Gatys et al. (2015) Gatys, L.A., Ecker, A.S., Bethge, M., 2015. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 .
Gulrajani et al. (2017) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C., 2017. Improved training of wasserstein gans, in: Advances in neural information processing systems, pp. 5767–5777.
Han et al. (2021) Han, J., Shoeiby, M., Petersson, L., Armin, M.A., 2021. Dual contrastive learning for unsupervised image-to-image translation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 746–755.
Hu et al. (2020) Hu, Z., Jia, J., Liu, B., Bu, Y., Fu, J., 2020. Aesthetic-aware image style transfer, in: Proceedings of the 28th ACM International Conference on Multimedia, pp. 3320–3329.
Huang and Belongie (2017) Huang, X., Belongie, S., 2017. Arbitrary style transfer in real-time with adaptive instance normalization, in: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1510–1519.
**g et al. (2020) **g, Y., Liu, X., Ding, Y., Wang, X., Ding, E., Song, M., Wen, S., 2020. Dynamic instance normalization for arbitrary style transfer, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4369–4376.
Kwon and Ye (2022a) Kwon, G., Ye, J.C., 2022a. Clipstyler: Image style transfer with a single text condition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18062–18071.
Kwon and Ye (2022b) Kwon, G., Ye, J.C., 2022b. Diffusion-based image translation using disentangled style and content representation. arXiv preprint arXiv:2209.15264 .
Li et al. (2022a) Li, G., Lv, J., Tian, Y., Dou, Q., Wang, C., Xu, C., Qin, J., 2022a. Transformer-empowered multi-scale contextual matching and aggregation for multi-contrast mri super-resolution, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20636–20645.
Li et al. (2022b) Li, G., Lyu, J., Wang, C., Dou, Q., Qin, J., 2022b. Wavtrans: Synergizing wavelet and cross-attention transformer for multi-contrast mri super-resolution, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer. pp. 463–473.
Li et al. (2024) Li, G., Rao, C., Mo, J., Zhang, Z., Xing, W., Zhao, L., 2024. Rethinking diffusion model for multi-contrast mri super-resolution. arXiv preprint arXiv:2404.04785 .
Li et al. (2023a) Li, G., Xing, W., Zhao, L., Lan, Z., Sun, J., Zhang, Z., Zhang, Q., Lin, H., Lin, Z., 2023a. Self-reference image super-resolution via pre-trained diffusion large model and window adjustable transformer, in: Proceedings of the 31st ACM International Conference on Multimedia, pp. 7981–7992.
Li et al. (2023b) Li, G., Xing, W., Zhao, L., Lan, Z., Zhang, Z., Sun, J., Yin, H., Lin, H., Lin, Z., 2023b. Dudoinet: Dual-domain implicit network for multi-modality mr image arbitrary-scale super-resolution, in: Proceedings of the 31st ACM International Conference on Multimedia, pp. 7335–7344.
Li et al. (2023c) Li, G., Zhao, L., Sun, J., Lan, Z., Zhang, Z., Chen, J., Lin, Z., Lin, H., Xing, W., 2023c. Rethinking multi-contrast mri super-resolution: Rectangle-window cross-attention transformer and arbitrary-scale upsampling, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 21230–21240.
Li et al. (2019) Li, X., Liu, S., Kautz, J., Yang, M.H., 2019. Learning linear transformations for fast image and video style transfer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3809–3817.
Li et al. (2017) Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H., 2017. Universal style transfer via feature transforms, in: Advances in neural information processing systems, pp. 386–396.
Lin et al. (2020) Lin, J., Pang, Y., Xia, Y., Chen, Z., Luo, J., 2020. Tuigan: Learning versatile image-to-image translation with two unpaired images, in: European Conference on Computer Vision, Springer. pp. 18–35.
Lin et al. (2021) Lin, T., Ma, Z., Li, F., He, D., Li, X., Ding, E., Wang, N., Li, J., Gao, X., 2021. Drafting and revision: Laplacian pyramid network for fast high-quality artistic style transfer, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5141–5150.
Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in: European conference on computer vision, Springer. pp. 740–755.
Liu et al. (2021a) Liu, S., Lin, T., He, D., Li, F., Wang, M., Li, X., Sun, Z., Li, Q., Ding, E., 2021a. Adaattn: Revisit attention mechanism in arbitrary neural style transfer, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6649–6658.
Liu et al. (2021b) Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021b. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022.
Lyu et al. (2023a) Lyu, J., Li, G., Wang, C., Cai, Q., Dou, Q., Zhang, D., Qin, J., 2023a. Multicontrast mri super-resolution via transformer-empowered multiscale contextual matching and aggregation. IEEE Transactions on Neural Networks and Learning Systems .
Lyu et al. (2023b) Lyu, J., Li, G., Wang, C., Qin, C., Wang, S., Dou, Q., Qin, J., 2023b. Region-focused multi-view transformer-based generative adversarial network for cardiac cine mri reconstruction. Medical Image Analysis 85, 102760.
Ma et al. (2023) Ma, Y., Zhao, C., Li, X., Basu, A., 2023. Rast: Restorable arbitrary style transfer via multi-restoration, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 331–340.
Mamalet and Garcia (2012) Mamalet, F., Garcia, C., 2012. Simplifying convnets for fast learning, in: Artificial Neural Networks and Machine Learning–ICANN 2012: 22nd International Conference on Artificial Neural Networks, Lausanne, Switzerland, September 11-14, 2012, Proceedings, Part II 22, Springer. pp. 58–65.
Mu et al. (2022) Mu, F., Wang, J., Wu, Y., Li, Y., 2022. 3d photo stylization: Learning to generate stylized novel views from a single image, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16273–16282.
Nichol (2016) Nichol, K., 2016. Painter by numbers, wikiart.
Park and Lee (2019) Park, D.Y., Lee, K.H., 2019. Arbitrary style transfer with style-attentional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5880–5888.
Park et al. (2019) Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y., 2019. Semantic image synthesis with spatially-adaptive normalization, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 2337–2346.
Plizzari et al. (2021) Plizzari, C., Cannici, M., Matteucci, M., 2021. Skeleton-based action recognition via spatial and temporal transformer networks. Computer Vision and Image Understanding 208, 103219.
Radford et al. (2021) Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al., 2021. Learning transferable visual models from natural language supervision, in: International Conference on Machine Learning, PMLR. pp. 8748–8763.
Sandler et al. (2018) Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C., 2018. Mobilenetv2: Inverted residuals and linear bottlenecks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520.
Santa Cruz et al. (2018) Santa Cruz, R., Fernando, B., Cherian, A., Gould, S., 2018. Visual permutation learning. IEEE transactions on pattern analysis and machine intelligence 41, 3100–3114.
(52) Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., Shuicheng, Y., . Inception transformer, in: Advances in Neural Information Processing Systems.
Simonyan and Zisserman (2014) Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
Sortino et al. (2023) Sortino, R., Palazzo, S., Rundo, F., Spampinato, C., 2023. Transformer-based image generation from scene graphs. Computer Vision and Image Understanding 233, 103721.
Sun et al. (2024) Sun, J., Jiao, H., Li, G., Zhang, Z., Zhao, L., Xing, W., 2024. 3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free-viewpoint videos. arXiv preprint arXiv:2403.01444 .
Sun et al. (2023) Sun, J., Zhang, Z., Chen, J., Li, G., Ji, B., Zhao, L., Xing, W., 2023. Vgos: Voxel grid optimization for view synthesis from sparse inputs. arXiv preprint arXiv:2304.13386 .
Szegedy et al. (2015) Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A., 2015. Going deeper with convolutions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in neural information processing systems 30.
Wang et al. (2022a) Wang, L., Chen, J., Liu, Y., 2022a. Frame-level refinement networks for skeleton-based gait recognition. Computer Vision and Image Understanding 222, 103500.
Wang and Yu (2020) Wang, X., Yu, J., 2020. Learning to cartoonize using white-box cartoon representations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8090–8099.
Wang et al. (2022b) Wang, Z., Zhang, Z., Zhao, L., Zuo, Z., Li, A., Xing, W., Lu, D., 2022b. Aesust: Towards aesthetic-enhanced universal style transfer, in: Proceedings of the 30th ACM International Conference on Multimedia, pp. 1095–1106.
Wang et al. (2021) Wang, Z., Zhao, L., Chen, H., Zuo, Z., Li, A., Xing, W., Lu, D., 2021. Evaluate and improve the quality of neural style transfer. Computer Vision and Image Understanding 207, 103203.
Wu et al. (2021) Wu, H., Qu, Y., Lin, S., Zhou, J., Qiao, R., Zhang, Z., Xie, Y., Ma, L., 2021. Contrastive learning for compact single image dehazing, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10551–10560.
Wu et al. (2022) Wu, Z., Zhu, Z., Du, J., Bai, X., 2022. Ccpl: contrastive coherence preserving loss for versatile style transfer, in: European Conference on Computer Vision, Springer. pp. 189–206.
Xu et al. (2021) Xu, W., Long, C., Wang, R., Wang, G., 2021. Drb-gan: A dynamic resblock generative adversarial network for artistic style transfer, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6383–6392.
Yang et al. (2022) Yang, F., Chen, H., Zhang, Z., Zhao, L., Lin, H., 2022. Gating patternpyramid for diversified image style transfer. Journal of Electronic Imaging 31, 063007–063007.
Yu et al. (2022) Yu, W., Si, C., Zhou, P., Luo, M., Zhou, Y., Feng, J., Yan, S., Wang, X., 2022. Metaformer baselines for vision. arXiv preprint arXiv:2210.13452 .
Zhang et al. (2021) Zhang, T., Zhang, Z., Jia, W., He, X., Yang, J., 2021. Generating cartoon images from face photos with cycle-consistent adversarial networks. Computers, Materials and Continua .
Zhang et al. (2023a) Zhang, Y., Huang, N., Tang, F., Huang, H., Ma, C., Dong, W., Xu, C., 2023a. Inversion-based style transfer with diffusion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10156.
Zhang et al. (2022) Zhang, Y., Tang, F., Dong, W., Huang, H., Ma, C., Lee, T.Y., Xu, C., 2022. Domain enhanced arbitrary image style transfer via contrastive learning. arXiv preprint arXiv:2205.09542 .
Zhang et al. (2023b) Zhang, Z., Sun, J., Chen, J., Zhao, L., Ji, B., Lan, Z., Li, G., Xing, W., Xu, D., 2023b. Caster: Cartoon style transfer via dynamic cartoon style casting. Neurocomputing 556, 126654.
Zhang et al. (2024a) Zhang, Z., Zhang, Q., Lin, H., Xing, W., Mo, J., Huang, S., Xie, J., Li, G., Luan, J., Zhao, L., et al., 2024a. Towards highly realistic artistic style transfer via stable diffusion with step-aware and layer-aware prompt. arXiv preprint arXiv:2404.11474 .
Zhang et al. (2024b) Zhang, Z., Zhang, Q., Xing, W., Li, G., Zhao, L., Sun, J., Lan, Z., Luan, J., Huang, Y., Lin, H., 2024b. Artbank: Artistic style transfer with pre-trained diffusion model and implicit style prompt bank, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7396–7404.
Zhao et al. (2020) Zhao, Y., Wu, R., Dong, H., 2020. Unpaired image-to-image translation using adversarial consistency loss, in: European Conference on Computer Vision, Springer. pp. 800–815.
Zhu et al. (2017) Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE international conference on computer vision, pp. 2223–2232.
Zuo et al. (2023) Zuo, Z., Zhao, L., Li, A., Wang, Z., Zhang, Z., Chen, J., Xing, W., Lu, D., 2023. Generative image inpainting with segmentation confusion adversarial training and contrastive learning. arXiv preprint arXiv:2303.13133 .

	$\displaystyle L_{c}=\sum^{L}_{i=4}\|\|E^{x}_{VGG}(I_{cs})-E^{x}_{VGG}(I_{c})\|\|_{% 2},$		(15)
	$\displaystyle L_{s}=\sum^{L}_{i=1}\|\|\mu(E^{x}_{VGG}(I_{cs})-\mu(E^{x}_{VGG}(I_% {s}))\|\|_{2}$
	$\displaystyle+\|\|\theta((E^{x}_{VGG}(I_{cs}))-\theta(E^{x}_{VGG}(I_{s})))\|\|_{2},$

	$\displaystyle L_{Identity}=\lambda_{identity1}(\|\|I_{cs}-I_{c}\|\|_{2}+\|\|I_{cs}-I% _{s}\|\|_{2}),$		(17)
	$\displaystyle+\sum_{i=0}^{L}\lambda_{identity1}(\|\|E^{x}_{VGG}(I_{cs})-E^{x}_{% VGG}(I_{c})\|\|_{2}$
	$\displaystyle+\|\|E^{x}_{VGG}(I_{cs})-E^{x}_{VGG}(I_{s})\|\|_{2}),$