License: arXiv.org perpetual non-exclusive license
arXiv:2312.03667v1 [cs.CV] 06 Dec 2023

WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual Try-on

Xujie Zhang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,   Xiu Li22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,   Michael Kampffmeyer33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT,   Xin Dong22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

Zhenyu Xie11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,   Feida Zhu22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,   Haoye Dong44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT,   Xiaodan Liang*1absent1{}^{*1}start_FLOATSUPERSCRIPT * 1 end_FLOATSUPERSCRIPT

11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTShenzhen Campus of Sun Yat-Sen University, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTByteDance

33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTUiT The Arctic University of Norway, 44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTCarnegie Mellon University
{zhangxj59,xiezhy6}@mail2.sysu.edu.cn,[email protected],[email protected]
[email protected],[email protected],[email protected],[email protected]
Abstract

Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image onto a target person. While existing methods focus on war** the garment to fit the body pose, they often overlook the synthesis quality around the garment-skin boundary and realistic effects like wrinkles and shadows on the warped garments. These limitations greatly reduce the realism of the generated results and hinder the practical application of VITON techniques. Leveraging the notable success of diffusion-based models in cross-modal image synthesis, some recent diffusion-based methods have ventured to tackle this issue. However, they tend to either consume a significant amount of training resources or struggle to achieve realistic try-on effects and retain garment details. For efficient and high-fidelity VITON, we propose WarpDiffusion, which bridges the war**-based and diffusion-based paradigms via a novel informative and local garment feature attention mechanism. Specifically, WarpDiffusion incorporates local texture attention to reduce resource consumption and uses a novel auto-mask module that effectively retains only the critical areas of the warped garment while disregarding unrealistic or erroneous portions. Notably, WarpDiffusion can be integrated as a plug-and-play component into existing VITON methodologies, elevating their synthesis quality. Extensive experiments on high-resolution VITON benchmarks and an in-the-wild test set demonstrate the superiority of WarpDiffusion, surpassing state-of-the-art methods both qualitatively and quantitatively. 111Xiaodan Liang is the corresponding author. This work is done during Xujie Zhang’s internship in ByteDance.

[Uncaptioned image]
Figure 1: WarpDiffusion surpasses current state-of-the-art methods (PF-AFN [9], FS-VTON [14], SDAFN [2], GP-VTON [36], DCI-VTON [11]), producing high-fidelity results with photo-realistic try-on effects, including wrinkles, shadows and skin in see-through areas.

1 Introduction

Virtual Try-On (VITON) is gaining significant traction in the e-commerce industry due to its potential to revolutionize online shop** with virtual changing room experiences. VITON achieves this by generating highly realistic images of a person wearing a specific garment from an in-shop garment image and a person image [12]. However, most existing approaches fail to meet the following synthesis quality requirements of today’s e-commerce scenarios: 1) Accurate Garment Deformation: The provided garment should be precisely deformed to fit the body pose of the target person, while preserving garment characteristics such as shape and texture. 2) Realistic Try-On Effect: The synthesized results should not only preserve the inherent characteristics of the garment but also incorporate realistic try-on effect, including wrinkles and shadows on the garment region; 3) Consistent Visual Quality: The visual quality of the results should be consistently high throughout the entire image, without any noticeable artifacts or blurriness in local regions such as the skin-garment boundary or neck region.

Most existing methods [31, 38, 9, 19, 36] employ a two-stage framework for try-on synthesis. In the first stage, a neural network is used to model the explicit deformation field, via a Thin Plate Spline (TPS) transformation [3] or appearance flow [42], for garment war**. In the second stage, Generative Adversarial Networks (GANs) [10] are used for try-on synthesis. However, these two-stage methods heavily rely on the war** module and often fail to meet the above-mentioned synthesis quality requirements. Specifically, the explicit deformation field used in the first stage often struggles to handle complex non-rigid deformations, which can negatively impact the subsequent try-on synthesis. For instance, in the first row of Fig 1, when the designated person has a challenging body pose, such as crossed arms, baselines [14, 36, 2, 9, 11] all show excessive distortion where the arms intersect.

Besides, since the explicit deformation field directly performs war** in pixel space, the try-on appearance is solely determined by the original in-shop garment. As shown in the second row of Fig. 1, when the given garment is a flat image without any realistic wrinkles or shadows, these methods are unable to produce realistic results.

Furthermore, due to the limited training data and training instability, GAN-based generators struggle to learn the specific distribution for fine details, resulting in inferior synthesis quality for certain local regions. For instance, these methods often generate blurry or jagged artifacts around the skin-garment boundary, which is evident in the third row of Fig. 1. Further, none of the prior approaches are able to accurately capture the see-through nature of the garment in the belly region in this example.

Recently, in light of the significant success of diffusion models for image synthesis, some pioneering methods [16, 4] have attempted to explore the application of diffusion models to VITON. However, these methods do not incorporate any deformation mechanism, which leads to spatial misalignment between the in-shop garment and body pose, ultimately resulting in a loss of garment details.

Method Devices Training time Data
TryonDiffusion[43] 32 TPU-v4 9 Days 4M
WarpDiffusion 8 A100 1 Day 30K
Table 1: WarpDiffusion requires significantly less training resources than TryonDiffusion [43].

To resolve this issue, TryOnDiffusion [43] introduces an implicit deformation field via cross-attention layers [30] to warp the input garment. However, while this approach leads to a scalable model for high-fidelity try-on synthesis, its training requires a prohibitive amount of resources, including 4 million paired data samples and thousands of GPU hours.

In order to leverage the generative capabilities of diffusion models for the VITON task in the face of limited resources, we introduce a novel effective and efficient model called WarpDiffusion in this work. WarpDiffusion combines the efficiency of explicit war** modules with the generative capability of pre-trained text-to-image diffusion models (i.e., StableDiffusion [27]) to achieve high-fidelity try-on synthesis. It is worth noting that this bridging is not trivial and previous attempts fail to preserve garment details [11, 23]. We solved this by introducing a novel informative and local garment feature attention mechanism that ensures accurate garment deformation and synthesis of realistic effects, while addressing the imperfections of the explicit war** results. Meanwhile, this mechanism introduces almost no overhead when compared to the original StableDiffusion model. As shown in Table 1, the entire model can be trained on 8 NVIDIA A100 GPUs at a resolution of 1024, with a significantly lower training consumption than that of TryonDiffusion [43]. Furthermore, our WarpDiffusion is not reliant on a specific war** module and can be seamlessly integrated as a plug-and-play module into previous GAN-based try-on methods [9, 14, 2, 36].

Refer to caption
Figure 2: The framework of our WarpDiffusion. During the training stage, the garment-agnostic image Iasubscript𝐼𝑎I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, garment-agnostic mask Masubscript𝑀𝑎M_{a}italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, skin mask Mssubscript𝑀𝑠M_{s}italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and foreground mask Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are concatenated with the input noise Itsubscriptsuperscript𝐼𝑡I^{\prime}_{t}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and injected into the diffusion UNet. The in-shop garment G𝐺Gitalic_G is extracted via the Global Garment Feature Encoder and used in the cross-attention. Simultaneously, the warped garment Gwsubscript𝐺𝑤G_{w}italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and the warped garment mask Mwsubscript𝑀𝑤M_{w}italic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT are processed by the Auto Mask module to obtain an informative mask Minfosubscript𝑀infoM_{\text{info}}italic_M start_POSTSUBSCRIPT info end_POSTSUBSCRIPT. Minfosubscript𝑀infoM_{\text{info}}italic_M start_POSTSUBSCRIPT info end_POSTSUBSCRIPT is then used to derive an improved warped garment, denoted as Gwsubscriptsuperscript𝐺𝑤G^{\prime}_{w}italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, which is subsequently utilized in the local texture attention. During the inference step, we employ the warped result from any try-on model to generate realistic try-on results, showcasing WarpDiffusion’s plug-and-play capability.

Overall, our contributions can be summarized as follows:

  • We propose WarpDiffusion, a framework integrating both explicit war** and a diffusion model, to collaboratively utilize global and local garment information for high-fidelity VITON synthesis in real-world scenarios.

  • WarpDiffusion leverages the pre-trained StableDiffusion and a custom-designed local texture attention to enhance the synthesis quality of the body region and significantly reduces resource consumption during training.

  • WarpDiffusion introduces a novel Auto Mask Module for informative garment feature extraction, enabling the synthesis of realistic try-on effects such as wrinkles and shadows without loss of details.

  • Extensive experiments conducted on public try-on benchmarks and in-the-wild test sets demonstrate that WarpDiffusion outperforms existing state-of-the-art methods and can be easily integrated as a plug-and-play module to enhance the synthesis quality of existing approaches.

2 Related work

GAN-Based Virtual Try-on. Most existing VITON methods [31, 13, 39, 7, 38, 18, 9, 14, 5, 20, 6, 34, 41, 33, 19, 2, 22, 35, 8, 17, 36] follow a conventional two-stage Generative Adversarial Network (GAN)-based pipeline for realistic try-on synthesis. In this paradigm, the first stage employs an explicit war** module to deform the in-shop garment to the target shape, while the second stage uses a GAN-based generator to fuse the deformed garment onto the reference person. The synthesis quality of these GAN-based solutions heavily relies on the deformation quality of the first stage, prompting current methods to emphasize enhancing the non-rigid deformation ability of the war** module. However, most of the GAN-based solutions directly exploit the UNet-based [28] generator for try-on synthesis and seldom explore how to improve the capability of the generator, leading to try-on results with inferior visual quality. In this work, we mainly focus on exploring a diffusion-based generator for virtual try-on to improve the synthesis capability of existing methods in terms of the realism of the garment-skin boundary and realistic try-on effect in the garment region.

Diffusion-Based Virtual Try-on. Compared to GAN-based models, diffusion models have shown great success in high-fidelity conditional image generation [27, 29, 26] and editing [15, 21, 37]. Image-based virtual try-on can be considered as a special case of the image editing/inpainting task conditioned on the given garment image. Therefore a straightforward way is to extend text-to-image diffusion models to accept images as the condition. Although methods such as [37, 16, 4] have demonstrated their ability for virtual try-on, these methods fail to preserve the texture detail of the try-on results because of the spatial misalignment between the in-shop garment and the reference person.

The recently proposed TryonDiffusion [43] instead introduces an implicit war** mechanism for garment war**, which addresses the texture misalignment problem mentioned above and obtains promising try-on synthesis. However, the training of TryonDiffusion requires a large-scale dataset and thousands of GPU hours, which generally is prohibitive for most researchers.

Other recent approaches [23, 11] aim to merge traditional GAN-based methods with diffusion models. They use the explicit war** module to create the warped garment and use the diffusion model to fuse it with the reference person image. Although these methods share a similar framework with ours, we observed that both of them fail to preserve texture details. We argue that this is due to the way they use the results of the war** module and solve this with a novel informative and local garment feature attention mechanism.

3 Methods

3.1 Overview

Given an in-shop garment image G𝐺Gitalic_G and a person image I𝐼Iitalic_I, image-based VITON aims to seamlessly fit G𝐺Gitalic_G onto I𝐼Iitalic_I to synthesize photo-realistic try-on result Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. To achieve this, WarpDiffusion first leverages a war** module to estimate the deformation field F𝐹Fitalic_F, which is used to deform the garment image G𝐺Gitalic_G to the warped garment image Gwsubscript𝐺𝑤G_{w}italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. Note that WarpDiffusion is designed to be agnostic to the choice of war** module and can thus act as a plug-and-play solution to any existing war**-based approach. Subsequently, WarpDiffusion integrates the garment image G𝐺Gitalic_G, the warped garment Gwsubscript𝐺𝑤G_{w}italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT, and the garment-agnostic person information Iasubscript𝐼𝑎I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT222The garment-agnostic information contains the face, hair, background and all areas that are not involved in the try-on process. to obtain the final try-on result Isuperscript𝐼I^{\prime}italic_I start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, which not only captures texture details in the garment image, but also generates realistic try-on effects like wrinkles and shadows.

To speed up network convergence and inherit its powerful generative ability, WarpDiffusion is built upon StableDiffusion (SD) [27] (see Fig. 2). More specifically, we modify the StableDiffusion inpainting model in the following ways to make it a VITON generator:

  • Besides the garment-agnostic image Iasubscript𝐼𝑎I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and the corresponding garment-agnostic mask Masubscript𝑀𝑎M_{a}italic_M start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, we concatenate two additional features, namely the skin mask Mssubscript𝑀𝑠M_{s}italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the foreground mask Mfsubscript𝑀𝑓M_{f}italic_M start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , with zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, which is used as a hint of the try-on area (Pixel Aligned Feature Guidance)

  • We add the global feature extracted from garment G𝐺Gitalic_G into the timestep embedding, enabling the expression of high-level garment characteristics. Simultaneously, we replace the text prompts in SD with the warped garment Gwsubscript𝐺𝑤G_{w}italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT. The auto mask module processes the warped garment to retain the most informative features. Subsequently, a local texture attention mechanism is employed to fuse the improved garment feature into the try-on result, enhancing texture consistency (Garment Feature Guidance).

Refer to caption
Figure 3: The Local Texture Attention mechanism. The local garment feature is extracted by masking the warped garment Gwsubscript𝐺𝑤G_{w}italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT with the Auto Mask Module and is spatially aligned with the image feature. Those two features are then divided into non-overlap windows with the same partition method, and cross attention is applied only between the corresponding windows.

3.2 Pixel Aligned Feature Guidance

Following the SD inpainting model, we concatenate the VAE encoded garment-agnostic image Iasubscript𝐼𝑎I_{a}italic_I start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and the inpainting mask (resized according to the VAE spatial stride factor) to the noisy latent at each timestep. We also observed that adding two additional mask hints, the foreground mask and skin region mask, leads to more consistent visual quality. To prevent information leakage and deal with possible war** artifacts, we augment these mask regions by adding random dilation and erosion.

A major difference between our method and previous methods [11, 23] is that we do not directly introduce the warped garment Gwsubscript𝐺𝑤G_{w}italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT by concatenating it to the noisy input of the diffusion UNet. Our observation is that concatenated features are typically assumed to be pixel-aligned with the diffusion target, which results in two primary issues: firstly, there is a clear mismatch between the warped garment Gwsubscript𝐺𝑤G_{w}italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and the garment from the ground-truth person image. When there is an error in predicting the deformation of the garment, this leads to degraded try-on results. Secondly, even when the predictions are accurate, the generated outcomes often retain all the information from the warped garment. The generator struggles to distinguish between wrinkles and textures. Consequently, when the clothing is too flat in the in-store image, it becomes challenging to produce realistic try-on result.

In this work, we instead propose a Auto Mask Module that takes the garment and disregards areas that are unrealistic or even erroneous. Subsequently, we integrate the enhanced warped garment via a local cross attention mechanism (Sec. 3.3), which allows us to address the challenge of incomplete alignment and unrealistic effects.

3.3 Garment Feature Guidance

Global Garment Feature

To capture comprehensive information about the garment, such as garment style, color etc., our WarpDiffusion leverages a pretrained CLIP image encoder [25] to extract features of in-shop garment G𝐺Gitalic_G. The extracted feature vector is then compressed to the same dimensionality as the timestep embedding using an MLP and summed with it.

Local Texture Attention

The CLIP image embedding alone cannot capture all the fine details of the garment. To solve this, TryonDiffusion [43] injects the un-warped garment in the cross attention layer. However, since their cross attention layer is responsible for estimating the deformation field, it must calculate cross attention between the whole garment and the entire image, which greatly increases the computational complexity. Instead, we use the warped garment as the input for the cross attention layer, enabling the model to focus solely on texture generation. Since the warped garment is almost aligned in space, we do not need to calculate the cross attention between the warped garment and the whole image, and can instead follow the process outlined in Figure 3. We first divide the original image features and warped garment features into windows through window partition. The attention is then only performed within each window region, reducing the overall computation considerably. After we perform the calculation window-by-window, we use the reverse operation to stack the windows back together and obtain the output. Note that when the window size is 1, this module is almost the same as concatenating, thus we use a relatively large window size (8 in our experiments, which is effectively similar to the prompt token length of 64).

Auto Mask Module

Even though we use the local texture attention to enhance the details of the try-on results, the model still lacks the ability to distinguish unrealistic or incorrect textures, resulting in unrealistic results. To address this, instead of retaining unrealistic and non-contributing elements produced by the war** module, our primary objective is to preserve only detailed and valuable textures. We therefore introduce the Auto Mask Module, which is designed to generate a mask over the informative regions. Specifically, we use a UNet to predict an informative mask Minfosubscript𝑀infoM_{\text{info}}italic_M start_POSTSUBSCRIPT info end_POSTSUBSCRIPT from the VAE encoded warped garment E(Gw)𝐸subscript𝐺𝑤E(G_{w})italic_E ( italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) and the warped garment mask Mwsubscript𝑀𝑤M_{w}italic_M start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT.

We make several hypotheses about the generated mask. First, it should have a high masking ratio to retrain only the most informative feature, which is achieved by adding a L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT regularization term, denoted as Lminsubscript𝐿minL_{\text{min}}italic_L start_POSTSUBSCRIPT min end_POSTSUBSCRIPT. Lminsubscript𝐿minL_{\text{min}}italic_L start_POSTSUBSCRIPT min end_POSTSUBSCRIPT represents the mean value of all features of Minfosubscript𝑀infoM_{\text{info}}italic_M start_POSTSUBSCRIPT info end_POSTSUBSCRIPT. Secondly, we have observed that incorporating some prior knowledge into the mask contributes to the stability of training. For this purpose, we utilize a Laplacian filter to process the warped garment, obtaining a mask that serves as the ground truth, which is aimed at retraining regions with higher gradients (usually corresponding to textures such as logos). Moreover, we calculate the Mean Squared Error (MSE) between the ground truth and the Minfosubscript𝑀infoM_{\text{info}}italic_M start_POSTSUBSCRIPT info end_POSTSUBSCRIPT, denoted as Lpreservesubscript𝐿preserveL_{\text{preserve}}italic_L start_POSTSUBSCRIPT preserve end_POSTSUBSCRIPT. Lastly, the masked feature should be capable of recovering the original image. This is accomplished through the joint training of the auto mask UNet with the diffusion UNet.

Refer to caption
Figure 4: Qualitative comparisons on VITON-HD [5], DressCode [22], and the in-the-wild test set. The first and second rows illustrate outcomes on VITON-HD [5], the third and fourth rows depict results on the in-the-wild scenario, and the last two rows showcase results on DressCode [22]. Our method consistently produces more realistic images. Note, the available DCI-VTON model does not support the generation of dresses.

The final objective function for the Auto Mask Module is,

Ltotal=Ldm+Lpreserve+Lmin,subscript𝐿totalsubscript𝐿dmsubscript𝐿preservesubscript𝐿minL_{\text{total}}=L_{\text{dm}}+L_{\text{preserve}}+L_{\text{min}},italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT dm end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT preserve end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT min end_POSTSUBSCRIPT , (1)

where Ldmsubscript𝐿dmL_{\text{dm}}italic_L start_POSTSUBSCRIPT dm end_POSTSUBSCRIPT represents the loss of the original diffusion model. During training, we extract the ground truth garment from the person image and apply random elastic transformations and Gaussian blur to create the pseudo warped garment. The pixel-aligned features mentioned earlier are also derived from the ground truth target image by adding random noise. Additionally, to improve the model’s recognition of real warp noise, we augment the training dataset with warped garments from baseline war**-based methods (e.g., [36]), integrating them in the final five epochs.

4 Results

Datasets. We conduct extensive experiments on two public high-resolution VITON benchmarks, namely VITON-HD [5] and DressCode [22]. For each dataset, we separately conduct experiments under the resolution of 512×384512384512\times 384512 × 384 and 1024×76810247681024\times 7681024 × 768. Specifically, VITON-HD consists of 13,679 image pairs of front-view upper-body woman and upper-body in-store garment images, which are further split into 11,647/2,032 training/testing pairs. DressCode contains 48,392/5,400 training/testing pairs of front-view full-body person and in-store garment images, which are composed of three subsets with different category pairs (i.e., upper, lower, dresses). For in-the-wild testing, we additionally gathered images from the Internet that were not part of the above datasets.

Implementation Details. We employ SD-inpainting 2.0 [1] as our pretrained model. Our 512 resolution model is trained on 8 Tesla V100 GPUs, with a batch size of 6 for each GPU. The model is trained for 120 epochs with learning rate 5e-5 over 3 days. For the 1024 resolution model, training is conducted on 8 NVIDIA A100 GPUs, configured with a batch size of 8 for each GPU. This model is trained for 100 epochs with a learning rate of 1e-5 over 1 day.

Refer to caption
Figure 5: Qualitative comparisons to war**-based approaches when our model uses their warped garment in a plug-and-play manner. WarpDiffusion consistently improves the baseline results.
Resolution 512×384512384512\times 384512 × 384 1024×76810247681024\times 7681024 × 768
Method SSIM \uparrow FID \downarrow LPIPS \downarrow HE \uparrow SSIM \uparrow FID \downarrow LPIPS \downarrow HE \uparrow
PF-AFN [9] 0.885 9.616 0.087 0.06 0.898 9.807 0.096 0.03
FS-VTON [14] 0.881 9.735 0.091 0.09 0.896 9.667 0.097 0.05
SDAFN [2] 0.881 9.497 0.092 0.04 0.892 9.783 0.108 0.12
GP-VTON [36] 0.893 9.405 0.079 0.11 0.898 9.238 0.091 0.14
DCI-VTON [11] 0.868 9.166 0.096 0.12 - - - -
WarpDiffusion 0.896 8.902 0.078 0.58 0.901 9.187 0.089 0.66
Table 2: Quantitative comparisons at 512×384512384512\times 384512 × 384 and 1024×76810247681024\times 7681024 × 768 resolution on the VITON-HD dataset [5].

Baselines and Evaluation Metrics. We perform quantitative and qualitative comparisons with four advanced GAN-based methods, namely PF-AFN [9], FS-VTON [14], SDAFN [2] and GP-VTON [36], and the latest Diffusion-based method, DCI-VTON [11]. The GAN-based methods are trained from scratch on VITON-HD [5] and DressCode [22] while for the diffusion-based model [11], we directly use their released pretrained model. We are not able to conduct extensive comparison with TryonDiffusion [43] as it is not publicly available now. For all baselines, we strictly follow the official instructions to run the training and testing scripts. Following previous work, we employ three widely used metrics (i.e., Structural SIMilarity index (SSIM) [32], Perceptual distance (LPIPS) [40], and Fr𝐞´´𝐞\mathbf{\acute{e}}over´ start_ARG bold_e end_ARGchet Inception Distance (FID) [24]) to evaluate the similarity between the synthesized and real images. Furthermore, we conduct a Human Evaluation (HE) study to evaluate the synthesis quality of different methods333More details on the Human Evaluation are included in the supplementary material.. It should be noted that, unless otherwise stated, we leverage GP-VTON’s war** module in WarpDiffusion. However, to illustrate WarpDiffusion’s plug-and-play nature, we do provide results with alternative war** strategies in Fig. 5.

4.1 Comparison with the State-of-the-Art Methods

Resolution 512×384512384512\times 384512 × 384
Method SSIM \uparrow FID \downarrow LPIPS \downarrow HE \uparrow
PF-AFN [9] 0.884 10.19 0.107 0.07
FS-VTON [14] 0.886 9.20 0.099 0.08
SDAFN [2] 0.892 8.69 0.089 0.11
GP-VTON [36] 0.894 8.71 0.091 0.15
WarpDiffusion 0.895 8.61 0.088 0.59
Table 3: Quantitative comparisons at 512×384512384512\times 384512 × 384 resolution on the DressCode dataset [22].

Quantitative Results As reported in Tab. 2 and Tab. 3 our WarpDiffusion consistently surpasses the baselines on all metrics for the VITON-HD [5] and DressCode datasets [22], demonstrating that WarpDiffusion can obtain more precise warped garments and generate try-on results with better visual quality. In addition, we designed a human evaluation to compare the baselines and our generation results. A higher human evaluation score indicates that a larger fraction of participants preferred the results of a given method. As reported in Tab. 2 and Tab. 3, our proposed WarpDiffusion model outperforms the baselines in most cases by a large margin. This means that our method can generate better and more realistic results also when judged by humans.

Qualitative Results. Fig. 4 provides the qualitative comparison of WarpDiffusion with the state-of-the-art baselines on the VITON-HD [5] and DressCode datasets [22] as well as for the in-the-wild test set. The results demonstrate the superiority of our WarpDiffusion over the baselines.444More results are included in the supplementary material First of all, all of baselines fail to generate realistic wrinkles. In addition, compared with the diffusion base model, our method uses richer features which makes our texture consistency better than DCI-VTON [11]. Finally, Fig. 5 shows the comparison between the results of our method and the original method using the same war** schemes. It shows that our method introduces robustness to excessive deformations and noisy war** results to a certain extent and generates more realistic results. This demonstrates the superiority of our approach and the benefit of its plug-and-play nature.

Refer to caption
Figure 6: Qualitative ablation study highlighting the value of our Auto Mask Module and Local Texture Attention. When Auto Mask is not applied, the try-on results will retain unrealistic wrinkles from the original garment while the full model fits the garment realistically to the person.

4.2 Ablation Study

To validate the effectiveness of the proposed Auto Mask Module, we conducted an ablation study by comparing WarpDiffusion with a version that does not include the auto mask module. The comparison was performed on the VITON-HD dataset [5], and we observed that the full WarpDiffusion model achieved improved metric scores. This indicates that the generated results are more aligned with the real distribution of the dataset, thus validating the impact of auto mask in preserving significant information while improving the ability to generate other areas.

To further validate the effectiveness of the local texture attention, we conducted another ablation study. In this ablation study, we compared WarpDiffusion with versions that only utilized the CLIP feature of the unwarped garment and removed the local texture attention. Once again, we compared the main metric scores and observed that the scores worsened when the local texture attention was removed. This indicates that the rich garment features play a significant role in the generation process.

We also provide qualitative results in Fig 6 of these ablation studies, which further demonstrate the benefits of our proposed method. Specifically, we observed that removing the auto mask module weakened the ability to generate new wrinkles and address excessive distortion. The model primarily preserved the existing wrinkles from the warped garment, resulting in less realistic results. Furthermore, when we eliminated the local texture attention, we notice a significant reduction in texture consistency.

4.3 Visualization of Local Garment Feature

In this section, we further demonstrate the remarkable ability of our proposed Auto Mask Module to capture fine-grained texture. To provide a clear illustration, we employ heatmaps to highlight the specific areas that the Auto Mask Module focuses on. As depicted in Figure 7, the Auto Mask Module effectively extracts the required information while intelligently disregarding wrinkles and other insignificant features. This ensures that regions without significant textures will be enhanced by a more powerful generator, leading to more realistic results, while at the same time preserving the overall texture consistency.

Method SSIM \uparrow FID \downarrow LPIPS \downarrow
w/o Auto Mask Module 0.894 9.10 0.079
w/o Local Texture Attention 0.831 9.23 0.102
WarpDiffusion 0.896 8.90 0.078
Table 4: Ablation results on the VITON-HD dataset [5] demonstrating the benefit of the proposed Auto Mask Module and Local Texture Attention.
Refer to caption
Figure 7: Visualization of the auto mask for warped garments.

5 Conclusion

In this work, we introduce WarpDiffusion, a plug-and-play model to generate high-fidelity VITON results with authentic realistic try-on effects. To optimize resource utilization and enhance synthesis quality, WarpDiffusion integrates a pre-trained StableDiffusion model along with a novel local texture attention mechanism. In order to attain more realistic try-on effects, WarpDiffusion introduces the Auto Mask Module to refine the warped garment. Experiments conducted on high-resolution VITON benchmarks and a diverse in-the-wild test set affirm WarpDiffusion’s superior efficacy compared to existing methods.

Limitation and future work.

A limitation of our approach is that when faced with very complex poses and more sophisticated try-on styles (such as layering, inner wear), the quality of the garment war** is not sufficient, leading to sub-optimal results. Besides, we use the stable diffusion backbone with lossy compression VAE, thus can not fully preserve all high-frequency details. To address these limitations, we aim t explore how to control the try-on process in a more fine-grained manner.

References

  • [1] Stable diffusion 2.0 inpainting. https://huggingface.co/stabilityai/stable-diffusion-2-inpainting.
  • Bai et al. [2022] Shuai Bai, Huiling Zhou, Zhikang Li, Chang Zhou, and Hongxia Yang. Single stage virtual try-on via deformable attention flows. In European Conference on Computer Vision, 2022.
  • Bookstein [1989] Fred L. Bookstein. Principal warps: Thin-plate splines and the decomposition of deformations. IEEE Transactions on pattern analysis and machine intelligence, 11(6):567–585, 1989.
  • Chen et al. [2023] Xi Chen, Lianghua Huang, Yu Liu, Yujun Shen, Deli Zhao, and Hengshuang Zhao. Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481, 2023.
  • Choi et al. [2021] Seunghwan Choi, Sunghyun Park, Minsoo Lee, and Jaegul Choo. Viton-hd: High-resolution virtual try-on via misalignment-aware normalization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021.
  • Chopra et al. [2021] Ayush Chopra, Rishabh Jain, Mayur Hemani, and Balaji Krishnamurthy. Zflow: Gated appearance flow-based virtual try-on with 3d priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  • Dong et al. [2019] Haoye Dong, Xiaodan Liang, Xiaohui Shen, Bochao Wang, Hanjiang Lai, Jia Zhu, Zhiting Hu, and Jian Yin. Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE/CVF international conference on computer vision, 2019.
  • Dong et al. [2022] Xin Dong, Fuwei Zhao, Zhenyu Xie, Xi** Zhang, Kang Du, Min Zheng, Xiang Long, Xiaodan Liang, and Jianchao Yang. Dressing in the wild by watching dance videos. In CVPR, pages 3480–3489, 2022.
  • Ge et al. [2021] Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, and ** Luo. Parser-free virtual try-on via distilling appearance flows. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Gou et al. [2023] Junhong Gou, Siyu Sun, Jianfu Zhang, Jianlou Si, Chen Qian, and Liqing Zhang. Taming the power of diffusion models for high-quality virtual try-on with appearance flow. In Proceedings of the 31st ACM International Conference on Multimedia, 2023.
  • Han et al. [2018] Xintong Han, Zuxuan Wu, Zhe Wu, Ruichi Yu, and Larry S Davis. Viton: An image-based virtual try-on network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
  • Han et al. [2019] Xintong Han, Xiaojun Hu, Weilin Huang, and Matthew R Scott. Clothflow: A flow-based model for clothed person generation. In Proceedings of the IEEE/CVF international conference on computer vision, 2019.
  • He et al. [2022] Sen He, Yi-Zhe Song, and Tao Xiang. Style-based global appearance flow for virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • Huang et al. [2023] Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and **gren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023.
  • Huang et al. [2022] Zaiyu Huang, Hanhui Li, Zhenyu Xie, Michael Kampffmeyer, Qingling Cai, and Xiaodan Liang. Towards hard-pose virtual try-on via 3d-aware global correspondence learning. In NeurIPS, 2022.
  • Issenhuth et al. [2020] Thibaut Issenhuth, Jérémie Mary, and Clément Calauzenes. Do not mask what you do not need to mask: a parser-free virtual try-on. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, 2020.
  • Lee et al. [2022] Sangyun Lee, Gyojung Gu, Sunghyun Park, Seunghwan Choi, and Jaegul Choo. High-resolution virtual try-on with misalignment and occlusion-handled conditions. In European Conference on Computer Vision, 2022.
  • Li et al. [2021] Kedan Li, Min ** Chong, Jeffrey Zhang, and **gen Liu. Toward accurate and realistic outfits visualization with attention to details. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021.
  • Mokady et al. [2023] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Morelli et al. [2022] Davide Morelli, Matteo Fincato, Marcella Cornia, Federico Landi, Fabio Cesari, and Rita Cucchiara. Dress code: High-resolution multi-category virtual try-on. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • Morelli et al. [2023] Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia, Marco Bertini, and Rita Cucchiara. LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On. In Proceedings of the ACM International Conference on Multimedia, 2023.
  • Parmar et al. [2022] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evaluation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 2015.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 2022.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. [2018] Bochao Wang, Huabin Zheng, Xiaodan Liang, Yimin Chen, Liang Lin, and Meng Yang. Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European conference on computer vision, 2018.
  • Wang et al. [2004] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Xie et al. [2021a] Zhenyu Xie, Zaiyu Huang, Fuwei Zhao, Haoye Dong, Michael Kampffmeyer, and Xiaodan Liang. Towards scalable unpaired virtual try-on via patch-routed spatially-adaptive gan. In NeurIPS, 2021a.
  • Xie et al. [2021b] Zhenyu Xie, Xujie Zhang, Fuwei Zhao, Haoye Dong, Michael C Kampffmeyer, Haonan Yan, and Xiaodan Liang. Was-vton: War** architecture search for virtual try-on network. In ACMMM, pages 3350–3359, 2021b.
  • Xie et al. [2022] Zhenyu Xie, Zaiyu Huang, Fuwei Zhao, Haoye Dong, Michael Kampffmeyer, Xin Dong, Feida Zhu, and Xiaodan Liang. Pasta-gan++: A versatile framework for high-resolution unpaired virtual try-on. arXiv preprint arXiv:2207.13475, 2022.
  • Xie et al. [2023] Zhenyu Xie, Zaiyu Huang, Xin Dong, Fuwei Zhao, Haoye Dong, Xi** Zhang, Feida Zhu, and Xiaodan Liang. Gp-vton: Towards general purpose virtual try-on via collaborative local-flow global-parsing learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Yang et al. [2023] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xue** Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Yang et al. [2020] Han Yang, Ruimao Zhang, Xiaobao Guo, Wei Liu, Wangmeng Zuo, and ** Luo. Towards photo-realistic virtual try-on by adaptively generating-preserving image content. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020.
  • Yu et al. [2019] Ruiyun Yu, Xiaoqi Wang, and Xiaohui Xie. Vtnfp: An image-based virtual try-on network with body and clothing feature preservation. In Proceedings of the IEEE/CVF international conference on computer vision, 2019.
  • Zhang et al. [2018] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.
  • Zhao et al. [2021] Fuwei Zhao, Zhenyu Xie, Michael Kampffmeyer, Haoye Dong, Songfang Han, Tianxiang Zheng, Tao Zhang, and Xiaodan Liang. M3d-vton: A monocular-to-3d virtual try-on network. In ICCV, pages 13239–13249, 2021.
  • Zhou et al. [2016] Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, and Alexei A Efros. View synthesis by appearance flow. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, 2016.
  • Zhu et al. [2023] Luyang Zhu, Dawei Yang, Tyler Zhu, Fitsum Reda, William Chan, Chitwan Saharia, Mohammad Norouzi, and Ira Kemelmacher-Shlizerman. Tryondiffusion: A tale of two unets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
\thetitle

Supplementary Material

6 Introduction

In this supplementary material, we provide additional details of the architecture (Sec. 7) and the human evaluation (Sec. 8). Further, we include additional qualitative results of our method (Sec. 9) and ablated results (Sec. 10). Finally, we provide an analysis of the potential social impact and an analysis of failure cases (Sec. 11).

7 Architecture Details

The structure of WarpDiffusion resembles Stable Diffusion [1], with the main differences lying in its acceptance of an 11-channel input. This input includes 4 channels of noise, 4 channels of garment-agnostic representations, 1 channel corresponding to the mask, and one channel each for the foreground mask and skin mask. Additionally, in the cross-attention layer, it takes as input the 4 channels of the warped garment enhanced by the auto-mask module and its corresponding mask.

As for the auto-mask module, it leverages a three-layer Feature Pyramid Network (FPN). Each layer consists of a convolution with a stride of 2, succeeded by two residual blocks.

8 Human Evaluation Details

For the human evaluation, we separately designed two questionnaires for the VITON-HD dataset [5] and the DressCode dataset [22]. Specifically, for the VITON-HD dataset, 100 volunteers are invited to complete the questionnaire which is composed of 25 assignments. For the DressCode dataset, 100 volunteers are invited to complete the questionnaire which contains 15 assignments for each garment category (i.e., upper-, lower-garment, dresses), namely, 45 assignments in total. For each assignment in the questionnaire, given a person image and a garment image, the volunteers are asked to select the most realistic and accurate try-on result out of six options, which are generated by our WarpDiffusion and the baseline methods (i.e., PF-AFN [9], FS-VTON [14], SDAFN [2], GP-VTON [36], DCI-VTON [11]).  The order of the generated results in each assignment is randomly shuffled. Fig. 10 shows the interface of the questionnaire for the VITON-HD dataset. The interface for the DressCode dataset is identical.

9 Additional Results

Visual Comparisons with SOTAs on the VITON-HD dataset [5]. Fig. 11 displays additional visual comparisons of WarpDiffusion with the baseline methods on the VITON-HD dataset.

Visual Comparisons with SOTAs on the DressCode dataset [22]. Fig. 12 displays additional visual comparisons of WarpDiffusion with the baseline methods on the DressCode dataset.

Virtual Try-on for High Resolution. Fig. 13 displays additional results of WarpDiffusion at the resolution of 1024.

10 Additional Ablation Study

To validate the effectiveness of the loss function in the Auto Mask Module, we conducted an ablation experiment specifically focusing on this module. We generated two versions of WarpDiffusion, where the first version does not include Lminsubscript𝐿minL_{\text{min}}italic_L start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, while the second version does not include Lpreservesubscript𝐿preserveL_{\text{preserve}}italic_L start_POSTSUBSCRIPT preserve end_POSTSUBSCRIPT. Qualitative results are provided in Fig 8. In the absence of Lminsubscript𝐿minL_{\text{min}}italic_L start_POSTSUBSCRIPT min end_POSTSUBSCRIPT, the generated mask often retains a significant amount of information, leading to the preservation of undesired false wrinkles. On the other hand, without Lpreservesubscript𝐿preserveL_{\text{preserve}}italic_L start_POSTSUBSCRIPT preserve end_POSTSUBSCRIPT, the generated mask retained very little information, resulting in the inability to generate the correct try-on results.

Refer to caption
Figure 8: Results for the Auto Mask Module ablation study.

Refer to caption
Figure 9: Failure cases of our WarpDiffusion.

11 Potential Social Impacts and Limitations

Potential social impacts. As with most generative approaches, WarpDiffusion can be used for malicious purposes by generating images that infringe upon copyrights and/or privacy. Given these considerations, responsible use of the model is advocated.

Limitations. As shown in Fig. 9, our WarpDiffusion encounters challenges when faced with difficult poses and unfamiliar garments, making it difficult to generate satisfactory try-on results. In order to address this issue, we are exploring ways to enhance the model’s zero-shot generation capability and achieve finer control.

Refer to caption
Figure 10: Interface of the questionnaire used to evaluate the final try-on results on the VITON-HD dataset [5].

Refer to caption
Figure 11: Qualitative comparisons on VITON-HD [5]. Please zoom in for more details.
Refer to caption
Figure 12: Qualitative comparisons on DressCode [22]. Please zoom in for more details.
Refer to caption
Figure 13: Qualitative results at 1024 resolution. Please zoom in for more details.