Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

Clément Chadebec
Jasper Research &Onur Tasar
Jasper Research &Eyal Benaroche
Jasper Research
&Benjamin Aubin
Jasper Research Corresponding author - [email protected]Contributed during an internship at Jasper Research

Abstract

In this paper, we propose an efficient, fast, and versatile distillation method to accelerate the generation of pre-trained diffusion models: Flash Diffusion. The method reaches state-of-the-art performances in terms of FID and CLIP-Score for few steps image generation on the COCO2014 and COCO2017 datasets, while requiring only several GPU hours of training and fewer trainable parameters than existing methods. In addition to its efficiency, the versatility of the method is also exposed across several tasks such as text-to-image, inpainting, face-swap**, super-resolution and using different backbones such as UNet-based denoisers (SD1.5, SDXL) or DiT (Pixart- $\alpha$ ), as well as adapters. In all cases, the method allowed to reduce drastically the number of sampling steps while maintaining very high-quality image generation. The official implementation is available at https://github.com/gojasper/flash-diffusion.

	$\displaystyle\mathcal{L}_{\mathrm{adv}}=$	$\displaystyle~{}\frac{1}{2}~{}\mathbb{E}_{z_{0},t^{\prime},\varepsilon}\left[% \left\\|D_{\nu}(f_{\theta}^{\mathrm{student}}(z_{t^{\prime}},t^{\prime}))-1% \right\\|^{2}\right]\,,$		(8)
	$\displaystyle\mathcal{L}_{\mathrm{discriminator}}=$	$\displaystyle~{}\frac{1}{2}~{}\mathbb{E}_{z_{0},t^{\prime},\varepsilon}\left[% \left\\|D_{\nu}(z_{0})-1\right\\|^{2}+\left\\|D_{\nu}(f_{\theta}^{\mathrm{student% }}(z_{t^{\prime}},t^{\prime}))-0\right\\|^{2}\right]\,.$		(8)

	$\displaystyle\mathcal{L}_{\mathrm{DMD}}$	$\displaystyle=~{}D_{KL}(p^{\mathrm{student}}_{\theta}\|\|p^{\mathrm{teacher}}_{% \phi})$		(9)
		$\displaystyle=~{}\mathbb{E}_{z_{0},t,\varepsilon}\bigg{[}-\bigg{(}\log p^{% \mathrm{teacher}}_{\phi}\big{(}f_{\theta}^{\mathrm{student}}(z_{t},t)\big{)}-% \log p^{\mathrm{student}}_{\theta}\big{(}f_{\theta}^{\mathrm{student}}(z_{t},t% )\big{)}\bigg{)}\bigg{]}\,.$		(9)

Method	# Train.	NFE	FID $\downarrow$	CLIP $\uparrow$
Method	Param. (M)	NFE	FID $\downarrow$	CLIP $\uparrow$
Teacher (SD1.5)	N/A	50	20.1	0.318
Teacher (SD1.5)	N/A	16	31.7	0.320
Prog. Distil.[45]	900M	2	37.3	0.270
		4	26.0	0.300
		8	26.9	0.300
InstaFlow [36]	900M	1	23.4	0.304
CFG Distil. [30]	850M	16	24.2	0.300
Ours	26.4 M	2	22.6	0.306
Ours	26.4 M	4	22.5	0.311

Method	# Train.	NFE	FID $\downarrow$
Method	Param. (M)	NFE	FID $\downarrow$
DPM++^† [38]	N/A	8	22.44
UniPC^† [82]	N/A	8	23.30
UFOGen [76]	1,700M	1	12.78
InstaFlow [36]	900M	1	13.10
DMD^† [77]	1,700M	1	14.93
LCM-LoRA^† [43]	67.5M	1	77.90
		2	24.28
		4	23.62
Ours	26.4 M	2	12.27
Ours	26.4 M	4	12.41

Loss	FID $\downarrow$	CLIP $\uparrow$
$\mathcal{L}_{\mathrm{distil.}}$	27.12	29.85
$\mathcal{L}_{\mathrm{distil.}}+\mathcal{L}_{\mathrm{DMD}}$	26.88	30.45
$\mathcal{L}_{\mathrm{distil.}}+\mathcal{L}_{\mathrm{adv}}$	23.41	30.14
$\mathcal{L}_{\mathrm{distil.}}+\mathcal{L}_{\mathrm{DMD}}+\mathcal{L}_{\mathrm% {adv}}$	22.64	30.61

$\pi(t)$	FID $\downarrow$	CLIP $\uparrow$
$\pi^{\mathrm{uniform}}(t)$	24.25	30.11
$\pi^{\mathrm{gaussian}}(t)$	35.89	28.15
$\pi^{\mathrm{sharp}}(t)$	23.35	30.58
$\pi^{\mathrm{ours}}(t)$	22.64	30.61

$\mathcal{L}_{\mathrm{adv.}}$	FID $\downarrow$	CLIP $\uparrow$
Hinge	25.02	30.17
WGAN	24.58	30.36
LSGAN	22.64	30.61

Method	NFE	FID $\downarrow$	CLIP $\uparrow$
Teacher (SDXL)	40	18.42	0.339
LCM [42]	8	21.73	0.327
Turbo [61]	4	23.69	0.337
SDXL-lightning [33]	4	24.61	0.329
SDXL-lightning-LoRA [33]	4	25.13	0.328
Hyper-SD-LoRA [56]	4	27.76	0.333
Ours	4	21.62	0.327

Flash Diffusion: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation

Abstract

1 Introduction

2 Related Works

Diffusion Models

Diffusion Distillation

3 Background

3.1 Diffusion Models

3.2 Consistency Models

4 Proposed Method

4.1 Distilling a Pretrained Diffusion Model

4.2 Timesteps Sampling

4.3 Adversarial Objective

4.4 Distribution Matching

4.5 Model Training

5 Experiments

5.1 Text-to-Image Quantitative Evaluation

5.2 Ablation Study

Influence of the loss terms

Influence of the timestep sampling

Influence of the guidance scale

5.3 On the Method’s Versatility

5.3.1 Backbones’ Study

Flash SDXL

Flash Pixart (DiT)

5.3.2 Conditionings’ Study

Inpainting

Super-Resolution

Face-Swap**

Adapters

6 Conclusion

References

Appendix A Experimental Details

A.1 Experimental Setup for Text-to-Image

Flash SD1.5

Flash SDXL

Flash Pixart (DiT)

A.2 Experimental Setup for Inpainting

A.3 Experimental Setup for Super-Resolution

A.4 Experimental Setup for Face-Swap**

Appendix B Additional Sampling Results

B.1 Flash SDXL

B.2 Flash Pixart (DiT)

B.3 Flash Inpainting

B.4 Flash Upscaler