Lightweight texture transfer based on texture feature preset

ShiQi Jiang,JunJie Kang, YuJian Li\affiliationorganization=Guilin University of Electronic Technology, city=Guilin, country=China

Abstract

In the task of texture transfer, reference texture images typically exhibit highly repetitive texture features, and the texture transfer results from different content images under the same style also share remarkably similar texture patterns. Encoding such highly similar texture features often requires deep layers and a large number of channels, making it is also the main source of the entire model’s parameter count and computational load, and inference time. We propose a lightweight texture transfer based on texture feature preset (TFP). TFP takes full advantage of the high repetitiveness of texture features by providing preset universal texture feature maps for a given style. These preset feature maps can be fused and decoded directly with shallow color transfer feature maps of any content to generate texture transfer results, thereby avoiding redundant texture information from being encoded repeatedly. The texture feature map we preset is encoded through noise input images with consistent distribution (standard normal distribution). This consistent input distribution can completely avoid the problem of texture transfer differentiation, and by randomly sampling different noise inputs, we can obtain different texture features and texture transfer results under the same reference style. Compared to state-of-the-art techniques, our TFP not only produces visually superior results but also reduces the model size by 3.2-3538 times and speeds up the process by 1.8-5.6 times.

keywords:

Texture Feature Preset; Lightweight; Texture Transfer;

^†^†journal: NEURAL NETWORKS

1 Introduction

Style transfer is a highly attractive image processing technique that can transfer the unique colors and texture styles of artworks to content images. In recent years, methods for style transfer have been widely proposed, which can be roughly divided into two categories: online image optimization and model optimization.

The representative of image optimization methods is (Gatys et al. (2016)), which innovatively transfers gradients to the input image and iteratively optimizes the input content image directly. The style pattern is represented by the feature correlation of deep convolutional neural networks (VGG, Sengupta et al. (2019)). Subsequent work mainly focuses on different forms of loss functions (Kolkin et al. (2019); Risser et al. (2017)). However, this slow online optimization method has a high time cost and greatly reduces its actual citation value. In contrast, the model optimization method effectively solves the time-consuming problem of online iteration through offline model training and forward reasoning. There are three main types of model optimization: (1) Training exclusive style transformation models for a single artistic style (Johnson et al. (2016); Li and Wand (2016b); Ulyanov et al. (2016a, b)) Synthesize stylized images using a single given artistic style image; (2) Training model that can convert multiple styles (Chen et al. (2017); Dumoulin et al. (2016); Wang et al. (2017); Li et al. (2017a); Zhang and Dana (2018a)) Introducing various network architectures while handling multiple styles; (3) Arbitrary style transformation model (Zhang and Dana (2018b); Li et al. (2017b); Wang et al. (2022, 2020); Shen et al. (2018); **g et al. (2020)) used different mechanisms such as feature modulation and matching to transfer any artistic style.

Looking back at all the above methods, , only Gayts (Gatys et al. (2016)), DcDae (ShiQi Jiang (2023b)), CTDP (ShiQi Jiang (2023a)), and IDD (ShiQi Jiang (2023c)) can achieve high-quality texture transfer effects. Observing and analyzing the transfer results of CTDP in Fig.3, it is found that for the same style image, the texture parts in different generated results have extremely high similarity. Such high similarity texture features require encoding at deeper levels and a larger number of channels, so this operation is also the main source of the entire model’s parameter count, computational complexity, and inference time. Therefore, although significant progress has been made in recent years, existing methods have overlooked the highly repetitive nature of texture features and still require repeated encoding for such redundant texture information.

In the face of the aforementioned challenges, we propose a lightweight texture transfer based on texture feature preset (TFP) model. This model can preset a well encoded universal deep texture feature map for a single style after training. In the inference stage, the preset texture feature map can be directly fused and decoded with shallow color transfer feature maps of any content, omitting the repeated encoding process for deep texture feature maps. On the basis of not changing the original framework of CTDP as much as possible, the model size of our texture feature preset scheme can be reduced by 3.2 times during the inference stage, and the inference speed can be accelerated by 1.8 times.

In addition to the improvements in model size and inference speed mentioned above, since the preset texture features are generated from noise, we can generate different texture feature maps such as Fig.8 in the inference stage by randomly sampling noise, thus generating different texture transfer results for the same content image. In addition, based on the input distribution differentiation experiment in IDD (ShiQi Jiang (2023c)), we have learned that the distribution differences within the input content image can lead to texture suppression differentiation performance issues. Similarly, we found a similar issue in the texture transfer task, where distribution differences within the same content image can lead to texture transfer differentiation, as shown in the red box area in Fig.4. However, in our method, deep texture feature maps are unconditionally generated pure texture images from noise images that completely follow the same noise distribution, which completely avoids the problem of texture transfer differentiated performance.

Compared to state-of-the-art models, our TFP not only produces visually superior results, but also has a volume that is 3.2-3538 times smaller and a speed that is 1.8-5.6 times faster. In summary, our contributions are as follows:

1.

We propose a lightweight texture transfer framework based on texture feature preset, which uses content independent noise input images to encode texture feature maps and fuse them with shallow feature maps of any content for decoding as the result of texture transfer.
2.

By presetting a deep texture feature map, we can directly skip the encoding process of the deep texture feature map during the inference stage, greatly reducing the model’s parameter count, computational complexity, and inference speed.
3.

To prevent the semantic content from being completely masked by texture features, we designed a semantic noise texture fusion loss.
4.

To address the issue of local texture loss in texture feature maps caused by feature fusion decoding, we added a semantic conditional texture generation branch.
5.

The method of generating deep texture feature maps from noise can completely avoid the problem of texture transfer differentiation caused by the distribution differences within the content image.
6.

Due to the fact that texture features are generated by noise, random sampling of input noise can generate different texture feature maps and apply them to the texture transfer results.
7.

Numerous qualitative and quantitative experiments have shown that our method can quickly achieve high-quality texture transfer effects even with the fewest number of parameters.

2 Related work

2.1 Neural Style Transfer

With the groundbreaking work of (Gatys et al. (2016)), the era of neural style transfer (NST) has arrived. The visual appeal of style transfer has inspired subsequent researchers to improve in many aspects, including efficiency (Johnson et al. (2016); Ulyanov et al. (2016a)); Quality (**g et al. (2018); Li and Wand (2016a); Gu et al. (2018); Xie et al. (2022); ShiQi Jiang (2023b)); Diversity (Wang et al. (2021); Chen et al. (2021)) and User Control (Zhang et al. (2019); Champandard (2016)); Despite significant progress, existing methods are difficult to process high-resolution images due to complex network structures and limited hardware resources.

2.2 Lightweight Style Transfer

To address the above challenges, (Wang et al. (2020)) employed model compression techniques, known as collaborative distillation, to reduce the convolutional filters of VGG-19. While this method significantly reduced memory consumption, the pruned model was still not fast enough to run on 4K super-resolution images.(Shen et al. (2018)) and (**g et al. (2020)) have designed lightweight networks, but still use pre-trained VGG models to extract style features, which can bring high computational costs and slow inference speed.

In order to achieve high-resolution style transfer, (Chen et al. (2022)) divides the input image into small patches and use thumbnail instance normalization for patch-wise stylization to ensure style consistency between different patches. Although this method achieves 4K super-resolution style transfer, it essentially does not solve the problem of excessive forward inference time consumption.

Recently,(Wang et al. (2022)) completely removed VGG and added a dual modulation strategy to inject color and texture structure information during the decoding phase. However, as shown in Fig.7, experiments have shown that removing VGG style transfer significantly reduces the performance of arbitrary style transfer task, mainly manifested as color leakage, content structure distortion, and pseudo texture structure transfer. The transfer results for different texture structures are extremely similar, because the encoding and decoding of texture structures collapse into a unified compromise suboptimal texture structure.

3 Method

Given an arbitrary content image, our goal is to achieve fast texture transfer through preset texture feature maps. The main challenges of this task lie in three aspects: (1) How to prevent semantic information from being completely masked by texture features during the fusion decoding process; (2) How to solve the problem of local texture missing in texture feature maps; (3) How to solve the problem of texture transfer differentiation;

3.1 Overview of TFP

As shown in Fig.2, our TFP framework consists of four main components: shallow encoder $Enc_{s}$ decoder $Dec_{s}$ , deep encoder $Enc_{d}$ , fusion decoder $Dec_{f}$ , and style discriminator $D_{s}$ (only used during the training phase). Under this framework, $Enc_{s}$ are mainly responsible for encoding the semantics, details, and color transfer, while $Enc_{d}$ are mainly responsible for encoding deep texture feature maps from noisy inputs. Paired encoders and decoders have a symmetrical lightweight structure, consisting of two standard convolutional layers at the beginning and end, as well as several depthwise separable convolutional layers (DW, Howard et al. (2017)) in the middle. The complete forward inference pipeline of our framework is as follows:

(1) Extracting the shallow features $f^{c}_{s}$ of content image $C$ using a shallow encoder $Enc_{s}$ , denoted as $f^{c}_{s}:=Enc_{s}(C)$ .

(2) Extracting the deep features $f^{n}_{d}$ of noise image $N$ using a deep encoder $Enc_{d}$ , denoted as $f^{n}_{d}:=Enc_{d}(N)$ .

(3) Extracting the deep features $f^{c}_{d}$ of content image $C$ using a deep encoder $Enc_{d}$ , denoted as $f^{c}_{d}:=Enc_{d}(C)$ .

(4) Obtain color transfer output $CS^{c}_{c}$ by inputting shallow features $f^{c}_{s}$ into shallow decoder $Dec_{s}$ , denoted as $CS^{c}_{c}:=Dec_{s}(f^{c}_{s})$ .

(5) Obtain noise texture transfer result $CS^{n}_{t}$ by fusing and decoding the fusion features of $f^{c}_{s}$ and $f^{n}_{d}$ , denoted as $CS^{n}_{t}:=Dec_{f}(\lambda_{s}Dae(f^{c}_{s})+\lambda_{d}f^{n}_{d})$ .

(6) Obtain content texture transfer result $CS^{c}_{t}$ by fusing and decoding the fusion features of $f^{c}_{s}$ and $f^{c}_{d}$ , denoted as $CS^{c}_{t}:=Dec_{f}(\lambda_{s}Dae(f^{c}_{s})+\lambda_{d}f^{c}_{d})$ .

Among them, $Dae$ represents the Detail Attention-enhanced (ShiQi Jiang (2023b)) module, this framework primarily focuses on whether the input images are noise or content images. Therefore, we specifically denote the current input form using superscripts, where $c$ represents content input, and $n$ represents noise input. $\lambda_{s}$ and $\lambda_{d}$ represent the fusion strength of shallow and deep feature maps, respectively.

Training Losses. In order to achieve style transfer, similar to the previous method (Gatys et al. (2016); Xie et al. (2022); Shen et al. (2018); Wang et al. (2022); Li et al. (2017c); Park and Lee (2019); Huang and Belongie (2017); ShiQi Jiang (2023b)), we use pre trained VGG-16 (Sengupta et al. (2019)) as our loss model to calculate content and style loss. We use perceptual loss (Johnson et al. (2016)) as our branch content loss $\mathcal{L}_{bc}$ , and all three of our branch content losses are calculated in the ${relu2\_1}$ layers of VGG-16. The branch style loss $\mathcal{L}_{bs}$ is defined as the matching Gram matrix (Gatys et al. (2016)), and the three branches calculate the style loss at different levels (see details in Sec.LABEL:sec:bs). Introduce style discrimination loss $\mathcal{L}_{adv}$ similar to (ShiQi Jiang (2023b)) to ensure the overall color and texture matching effect of stylized images. Please note that we only use VGG-16 during the training phase and do not require complex loss calculations or involve any large networks during the inference phase. In summary, the overall goals of our TFP are:

\mathcal{L}_{Full}=\lambda_{bc}\mathcal{L}_{bc}+\lambda_{bs}\mathcal{L}_{bs}+% \lambda_{adv}\mathcal{L}_{adv}+\lambda_{mtv}\mathcal{L}_{mtv}+\lambda_{fdc}% \mathcal{L}_{fdc}+\lambda_{stf}\mathcal{L}_{stf},

(1)

where hyper-parameters $\lambda_{bc}$ , $\lambda_{bs}$ , $\lambda_{adv}$ , $\lambda_{mtv}$ and $\lambda_{fdc}$ define the relative importance of each component in the total loss function.

3.2 Background

CTDP (ShiQi Jiang (2023a)) pioneered the design of a dual pipeline framework for color and texture, which can simultaneously generate color and texture transfer results. As shown in Fig.1(a), CTDP actually produces three results simultaneously, namely shallow color transfer result, deep texture transfer result, and fusion decoding result. It generally takes the fusion decoding result as the final style transfer output.

3.2.1 Texture Transfer Differentiation

We found that all previous schemes that required encoding of content images, including CTDP, and generating texture transfer results through semantic conditions all had texture transfer differentiation issues, as shown in the red box area in Fig.4. This phenomenon is similar to the texture suppression differentiation performance in IDD (ShiQi Jiang (2023c)), which is believed to be caused by the continuity of the input image. Discontinuous inputs will generate noise features in the feature map, and noise features will evolve into texture structures through convolution operations. Continuous inputs will not generate noise features in the feature map, and will not evolve into texture structure features. As shown in the first column of Fig.4, the extremely continuous areas in the content images directly do not generate any texture information, while the discontinuous parts outside the red box can effectively complete the texture transfer task. This phenomenon once again confirms the hypothesis of IDD.

3.2.2 Texture Similarity

As shown in Fig.3, four output images of the CTDP model are displayed, with the top and bottom rows showing the stylized results of two different reference styles. By zooming in on the red and yellow box areas, we found that under the same reference style, the stylized output texture of different content images all have extremely high texture similarity. In the training and inference process of the model, generating such high similarity texture features requires more convolutional encoding, and more convolutional encoding means deeper network depth and larger number of channels, which makes the generation of texture features the main source of model parameters and computation. Is it necessary for us to repeatedly encode such redundant texture information?

3.2.3 Noise Input

We found in our experiment that replacing the input content image of CTDP with a pure noise image (standard normal distribution with mean 0 and variance 1) will generate a pure texture image as shown in the second row of Fig.5. The first row is its corresponding reference style image, that is, if the input is noise, the model will produce a pure texture image without any semantics. This phenomenon leads us to the following conjecture:

(1) Due to the fact that only the Gram (Gatys et al. (2016)) matrix is constrained for reference style images, the essence of such models is to perform color and texture reconstruction tasks;

(2) When the input is a content image, it is essentially performing the task of conditional texture reconstruction, that is, content conditional generation;

(3) When the input is a content image, it essentially engages in the task of conditional texture reconstruction, namely, content conditional texture generation;

(4) If the result generated by unconditional generation is a pure texture image with no semantics or structure, can we achieve texture transfer by fitting the pure texture image onto the content image?

3.2.4 Conclusion

Overall, our experiments on CTDP have yielded the following conclusions:

(1) The distribution difference of input images can lead to texture transfer differentiation issues;

(2) Different content input images will produce texture transfer results with extremely high similarity in texture features;

(3) The CTDP model essentially performs texture reconstruction tasks. When inputting noisy images, the model generates unconditionally generated pure texture images, and when inputting content images, the model generates conditionally generated texture transfer images with content semantics.

3.3 Texture Feature Preset Framework

We attempt to utilize the high repeatability of texture features and the model’s ability to unconditionally generate pure texture images from noisy inputs to achieve texture feature preset (TFP) effects. TFP aims to provide preset texture feature maps for a single style, which can be fused and decoded with any shallow color transfer feature map to directly generate texture transfer results, thereby avoiding duplicate encoding of redundant texture information.

3.3.1 Pseudo Texture Feature Preset

Firstly, we attempt to directly execute the pseudo texture feature preset scheme on the pre-trained CTDP (ShiQi Jiang (2023a)) framework. As shown in Fig.1(b), the three column outputs are the decoding results of shallow color transfer feature maps, noise feature map decoding, and fusion decoding of two feature maps. Observation shows that the pseudo TFP fusion decoding of the CTDP model results in a state where two images are directly superimposed, presenting an erroneous texture transfer effect where the content information is completely masked by texture features.

3.3.2 Semantic Texture Fusion Loss

We believe that the reason for the incorrect preset method of the pseudo texture features mentioned above is the lack of constraints on the noise encoded texture feature map. In fact, the pure texture feature map generated by noise in the previous experiment is only a side effect product of CTDP (ShiQi Jiang (2023a)) framework training, and is not suitable for being directly used as the preset texture feature map.

To solve the fusion problem of shallow color transfer feature maps and deep noise texture feature maps, we designed a semantic texture fusion loss $\lambda{stf}$ . Because we need to present the semantic and structural information of the reference content image and the texture features in the reference style image in the texture transfer results, $\lambda{stf}$ is actually designed based on the style and content perception loss of Gatys et al. (2016) and Johnson et al. (2016). Unlike previous schemes that were calculated under the input of content images, our scheme calculates n * for randomly sampled noisy input images.

\displaystyle CS^{n}_{t}:=Dec_{f}(\lambda_{s}Dae(Enc_{s}(c))+\lambda_{d}Enc_{d% }(n))

(2)

3.3.3 Semantic Conditional Texture Generation Branch

As shown in the first row of Fig.6(b), we did avoid the content semantics being masked by texture features through semantic texture fusion loss $\lambda{stf}$ . However, when we observed the second row, we found that the texture map generated based on noise had a problem of local texture loss, which led to poor fusion decoding performance.

We believe it is the side effect of the direct constraint of loss $\lambda{stf}$ on noise that leads to the issue of local texture loss. Under the sole constraint of Loss $\lambda{stf}$ , the model, in pursuit of the best semantic texture fusion effect, forces compromises in the encoding of noise-based texture feature maps, resulting in more significant local texture loss to better reduce Loss $\lambda{stf}$ . Therefore, we introduced a semantic conditional texture generation branch, ho** that the semantic conditional encoding based on content images can guide the model in encoding deep texture feature maps beneficial for feature fusion.

As shown in Fig.6(c), the top and bottom rows respectively show the predictive performance of two styles after training with the addition of semantic conditional texture generation branches. We observed that the color matching of the fusion decoding result in the first row is higher, the content color is almost not leaked, the artifacts in the noise output result are reduced, and the scale of the texture feature is increased. It presents an artistic effect where the content semantics are entirely composed of style texture patterns. The fusion decoding and texture feature decoding results in the second line have significantly solved the problem of local texture loss.

3.4 Fast Texture Transfer

The ultimate goal of our texture feature preset framework is to achieve faster inference speed in the inference stage by omitting repeated encoding of deep texture features. As shown in Fig.2(b), it is the execution process of TFP in the inference stage. The gray feature map is the preset deep texture feature map after training. We only need to perform shallow encoding on the content image $c\in R^{3\times H\times W}$ to obtain the shallow color transfer feature map and fuse it with the preset texture feature map to decode and output the texture transfer result quickly, denoted as:

CS^{n}_{t}:=Dec_{f}(\lambda_{s}Dae(Enc_{s}(c))+\lambda_{d}f^{n}_{tfp}).

(3)

In this process, the shallow color transfer feature map is responsible for providing the structure and detail information of the content and completing the encoding of color transfer, while the preset texture feature map is responsible for providing complex and highly repetitive texture patterns with reference styles. As shown in Tab.1, TFP can achieve the fastest inference speed of 3.1ms for a single image at 256 resolution, which is 1.8 times faster than the previous fastest model.

3.5 Random Texture Generation

Since our deep texture feature maps are not encoded from content images, but from random noise maps, we can generate different texture feature maps during the inference stage by sampling different input noise and applying them to the texture transfer results. As shown in the Fig.8, the upper and lower rows are pure texture images directly decoded from two styles of texture feature maps, and the two columns use different random sampling noise input images. The red and purple boxes are the enlarged results in the original image. We can observe that the texture patterns of the decoding results of texture feature maps of the same style are similar, but the combination and arrangement of features are not the same. This is the effect of different texture images with high similarity generated by random noise, and different texture feature maps can also be used to produce different texture transfer effects on a single content image through fusion decoding.

3.6 Texture Transfer Differentiation

As shown in the Fig.4, observing the red box in the figure, it can be found that all previous schemes did not achieve good texture transfer effects in the red box area, and there are serious texture transfer differentiation problems. This is because the background part of the content image has different degrees of continuity, and the red box in the image is different from other areas, which are extremely continuous and smooth. Previously, all solutions required texture encoding for content images, and there were often localized differences in the distribution of content images, especially in extremely continuous parts where the extremely continuous input parts could not generate texture representations, resulting in such texture transfer differentiation issues.

Our TFP scheme precisely avoids this content-based image based texture conditional encoding method, and instead relies entirely on an undifferentiated texture unconditional encoding method with the same noise distribution, generating a pure texture image with consistent global texture patterns. As shown in the last column of the Fig.4, the red box in our TFP scheme generates a highly consistent texture pattern with the other background parts.

Method	(a)Params/ $10^{6}$	(b)Storage/MB	(c)GFLOPs	(d)Time/ms	(e)Prefer/%(top1)	(f)Prefer/%(top3)
PAMA	35.389	138.275	89.802	17.5	0.13	0.11
SANet	20.911	81.703	66.924	9.2	0.06	0.05
Adain	7.01	27.397	47.459	6.2	0.06	0.05
Micro	0.472	1.866	2.765	5.5	0.06	0.05
CTDP	0.032	0.15	0.935	5.5	0.17	0.15
TFP(Ours)	0.01	0.05	0.545	3.1	0.36	0.48
TFP-L(Ours)	0.007	0.039	0.398	2.3	-	-

Table 1: Quantitative Comparison with State-of-the-Art Methods. Storage space is measured within the PyTorch model. GFLOPs and time are measured when both content and style are 4K images, and tested on the NVIDIA 3090 (24GB) GPU.The best results are highlighted in bold. OOM: Out of memory.

4 Experiments

4.1 Implementation Details

We used MS-COO (Lin et al. (2014)) as the content image and extracted style images from Wikiart (Phillips and Mackintosh (2011)) to train our TFP model. In equation.3, the values of $\lambda_{bc}$ , $\lambda_{bs}$ , $\lambda_{adv}$ , $\lambda_{mtv}$ , $\lambda_{fdc}$ and $\lambda_{stf}$ are set to 1e0, 1e5, 1e0, 2e-5, 1e0 and 1e0, respectively. We used the Adam (Kingma and Ba (2014)) optimizer with a learning rate of 0.001. During the training process, first adjust the size of the content image to 512, and then randomly crop it to 256 $\times$ 256 pixels for enhancement. Style images are processed using similar methods, but all images in a batch are randomly cropped from the same reference style image. Unlike the previous plan, we also need to randomly sample a batch of random noise with a size of 256 $\times$ 256 as input. We conducted all experiments on the RTX 3090 GPU.

4.2 Comparisons with Prior Arts

Due to our model’s ability to quickly generate color and texture transfer results simultaneously, we compared our CTDP with state-of-the-art color transfer models and texture transfer models (arbitrary style transfer). In the comparison scheme, we directly ran the code with the default settings published by the author.

4.2.1 Qualitative Comparison

The qualitative comparison results of different texture transfer methods are shown in Fig.7.

Firstly, compare with our most relevant work, CTDP (ShiQi Jiang (2023a)). In the CTDP results of the first, fifth, and sixth rows of the Fig.7, it can be observed that there is a halo problem at the semantic edges of the aircraft, house, and clock tower. TFP has to some extent solved this problem, and there is no obvious or conflicting edge halo. In the second row of CTDP results in the Fig.7, there is a significant texture transfer differentiation problem in the lower left corner. The smoother content input results in CTDP not encoding texture information on it, while TFP avoids this problem. TFP achieves an effect comparable to CTDP in terms of matching texture structure information and color information.

Adain (Huang and Belongie (2017)), SANet (Park and Lee (2019)), PAMA (Luo et al. (2022)), and Micro (Wang et al. (2022)) all have significant issues with texture transfer quality. In the second row of the Fig.7, all schemes in the bottom left corner have obvious texture transfer differentiation issues. Although PAMA and Micro schemes completely lose their texture here, they at least have the effect of color transfer. Adain and SANet even failed to encode the color, completely leaking the background color of the content. In the results of all the schemes in the first, third, fifth, and sixth rows of the Fig.7, it can be observed that there are very obvious halo problems at the edges of the main objects of the airplane, bed and chair, house, and clock tower. The texture matching degree in all results of Adain and Micro schemes is very low. In Micro, all results have similar texture patterns and do not correspond to the reference style. There are different texture patterns in Adain, but the matching degree with the reference texture pattern is not high. PAMA and SANet only match the texture patterns correctly in the third and fourth rows, but their color and overall migration effects are slightly inferior. The texture and color matching degree in other reference images are not high, and the migration effect is poor.

In contrast, our TFP achieves state-of-the-art texture transfer effects. The global color and texture structure of all our results have a high degree of matching, and it is the only solution that can avoid the problem of subject object edge halo and texture transfer differentiation.

4.2.2 Quantitative Comparison

Tab.1 shows the quantitative comparison between our model and state-of-the-art methods in the inference stage. Due to the lack of widely accepted quantitative evaluation metrics for style transfer tasks in the industry, we only compared model size and forward inference speed in this study. As shown in columns a and d of Tab.1, our TFP is 3.2-3538 times smaller and 1.8-5.6 times faster than existing models.

4.2.3 User Study

The evaluation of stylized results is highly subjective. Therefore, we conducted user studies on these five methods. We randomly presented 30 randomly shuffled stylized images to each participant, and each of the six methods (CTDP, AdaIN, PAMA, MicroAST, SANet et al, and ours) generated five stylized images, which were then shuffled randomly. Participants had unlimited time to select their favorite image (top 1) and top three images (top 3). We collected a total of 108 valid votes (three per person) from 36 participants, and show the percentage of preferred results for each method in the last two columns of Table for top 1 and top 3, respectively. Finally, as shown in Tab.1, the results indicate that our stylized images are more attractive than those of competitors.

4.3 Ablation Study

The result without content texture fusion loss is shown in Fig.9(b), where the semantic information of the content image is completely masked by texture features, and the shallow color transfer feature map and deep texture feature map are not well fused and decoded.

The result without semantic conditional texture encoding branch is shown in Fig.9(c), and there is a problem of local texture loss in the texture part of the result.

5 Conclusion

In this article, we propose a dual pipeline lightweight framework called CTDP. For the first time, our dual channels can simultaneously generate color and texture transfer results corresponding to style images, and the weighted fusion of dual branch features achieves the effect of adding texture features with controllable intensity from color transfer results for the first time. In addition, mtv loss was designed to suppress texture information in the model matching Gram matrix, and it was found that smoothing the input in our framework can almost completely eliminate texture features. A large number of experiments have proven the effectiveness of this method. Compared to the current level of technology, our CTDP is the first model that can simultaneously achieve color and texture transfer. It not only produces visually superior results in both migration tasks, but also has a color migration branch model size as low as 20k.

References

Champandard (2016) Champandard, A.J., 2016. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv preprint arXiv:1603.01768 .
Chen et al. (2017) Chen, D., Yuan, L., Liao, J., Yu, N., Hua, G., 2017. Stylebank: An explicit representation for neural image style transfer, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1897–1906.
Chen et al. (2021) Chen, H., Zhao, L., Zhang, H., Wang, Z., Zuo, Z., Li, A., Xing, W., Lu, D., 2021. Diverse image style transfer via invertible cross-space map**, in: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE Computer Society. pp. 14860–14869.
Chen et al. (2022) Chen, Z., Wang, W., Xie, E., Lu, T., Luo, P., 2022. Towards ultra-resolution neural style transfer via thumbnail instance normalization, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 393–400.
Dumoulin et al. (2016) Dumoulin, V., Shlens, J., Kudlur, M., 2016. A learned representation for artistic style. arXiv preprint arXiv:1610.07629 .
Gatys et al. (2016) Gatys, L.A., Ecker, A.S., Bethge, M., 2016. Image style transfer using convolutional neural networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423.
Gu et al. (2018) Gu, S., Chen, C., Liao, J., Yuan, L., 2018. Arbitrary style transfer with deep feature reshuffle, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8222–8231.
Howard et al. (2017) Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H., 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 .
Huang and Belongie (2017) Huang, X., Belongie, S., 2017. Arbitrary style transfer in real-time with adaptive instance normalization, in: Proceedings of the IEEE international conference on computer vision, pp. 1501–1510.
**g et al. (2020) **g, Y., Liu, X., Ding, Y., Wang, X., Ding, E., Song, M., Wen, S., 2020. Dynamic instance normalization for arbitrary style transfer, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 4369–4376.
**g et al. (2018) **g, Y., Liu, Y., Yang, Y., Feng, Z., Yu, Y., Tao, D., Song, M., 2018. Stroke controllable fast style transfer with adaptive receptive fields, in: Proceedings of the European Conference on Computer Vision (ECCV), pp. 238–254.
Johnson et al. (2016) Johnson, J., Alahi, A., Fei-Fei, L., 2016. Perceptual losses for real-time style transfer and super-resolution, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, Springer. pp. 694–711.
Kingma and Ba (2014) Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 .
Kolkin et al. (2019) Kolkin, N., Salavon, J., Shakhnarovich, G., 2019. Style transfer by relaxed optimal transport and self-similarity, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10051–10060.
Li and Wand (2016a) Li, C., Wand, M., 2016a. Combining markov random fields and convolutional neural networks for image synthesis, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2479–2486.
Li and Wand (2016b) Li, C., Wand, M., 2016b. Precomputed real-time texture synthesis with markovian generative adversarial networks, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, Springer. pp. 702–716.
Li et al. (2017a) Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H., 2017a. Diversified texture synthesis with feed-forward networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3920–3928.
Li et al. (2017b) Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.H., 2017b. Universal style transfer via feature transforms. Advances in neural information processing systems 30.
Li et al. (2017c) Li, Y., Wang, N., Liu, J., Hou, X., 2017c. Demystifying neural style transfer. arXiv preprint arXiv:1701.01036 .
Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer. pp. 740–755.
Luo et al. (2022) Luo, X., Han, Z., Yang, L., 2022. Progressive attentional manifold alignment for arbitrary style transfer, in: Proceedings of the Asian Conference on Computer Vision, pp. 3206–3222.
Park and Lee (2019) Park, D.Y., Lee, K.H., 2019. Arbitrary style transfer with style-attentional networks, in: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5880–5888.
Phillips and Mackintosh (2011) Phillips, F., Mackintosh, B., 2011. Wiki art gallery, inc.: A case for critical thinking. Issues in Accounting Education 26, 593–608.
Risser et al. (2017) Risser, E., Wilmot, P., Barnes, C., 2017. Stable and controllable neural texture synthesis and style transfer using histogram losses. arXiv preprint arXiv:1701.08893 .
Sengupta et al. (2019) Sengupta, A., Ye, Y., Wang, R., Liu, C., Roy, K., 2019. Going deeper in spiking neural networks: Vgg and residual architectures. Frontiers in neuroscience 13, 95.
Shen et al. (2018) Shen, F., Yan, S., Zeng, G., 2018. Neural style transfer via meta networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8061–8069.
ShiQi Jiang (2023a) ShiQi Jiang, JunJie Kang, Y.L., 2023a. Color and texture dual pipeline lightweight style transfer. arXiv preprint arXiv:2310.01321 .
ShiQi Jiang (2023b) ShiQi Jiang, JunJie Kang, Y.L., 2023b. Degree-controllable lightweight fast style transfer with detail attention-enhanced. arXiv preprint arXiv:2306.16846 .
ShiQi Jiang (2023c) ShiQi Jiang, JunJie Kang, Y.L., 2023c. Dual pipeline style transfer with input distribution differentiation. arXiv preprint arXiv:2311.05432 .
Ulyanov et al. (2016a) Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V., 2016a. Texture networks: Feed-forward synthesis of textures and stylized images. arXiv preprint arXiv:1603.03417 .
Ulyanov et al. (2016b) Ulyanov, D., Vedaldi, A., Lempitsky, V., 2016b. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022 .
Wang et al. (2020) Wang, H., Li, Y., Wang, Y., Hu, H., Yang, M.H., 2020. Collaborative distillation for ultra-resolution universal style transfer, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1860–1869.
Wang et al. (2017) Wang, X., Oxholm, G., Zhang, D., Wang, Y.F., 2017. Multimodal transfer: A hierarchical deep convolutional neural network for fast artistic style transfer, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5239–5247.
Wang et al. (2021) Wang, Z., Zhao, L., Chen, H., Zuo, Z., Li, A., Xing, W., Lu, D., 2021. Divswapper: towards diversified patch-based arbitrary style transfer. arXiv preprint arXiv:2101.06381 .
Wang et al. (2022) Wang, Z., Zhao, L., Zuo, Z., Li, A., Chen, H., Xing, W., Lu, D., 2022. Microast: Towards super-fast ultra-resolution arbitrary style transfer. arXiv preprint arXiv:2211.15313 .
Xie et al. (2022) Xie, X., Li, Y., Huang, H., Fu, H., Wang, W., Guo, Y., 2022. Artistic style discovery with independent components, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19870–19879.
Zhang et al. (2019) Zhang, C., Zhu, Y., Zhu, S.C., 2019. Metastyle: Three-way trade-off among speed, flexibility, and quality in neural style transfer, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 1254–1261.
Zhang and Dana (2018a) Zhang, H., Dana, K., 2018a. Multi-style generative network for real-time transfer, in: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0.
Zhang and Dana (2018b) Zhang, H., Dana, K., 2018b. Multi-style generative network for real-time transfer, in: Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0.