HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY-NC-SA 4.0
arXiv:2404.05384v1 [cs.CV] 08 Apr 2024

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

Dazhong Shen11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT,     Guanglu Song22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,     Zeyue Xue33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT,     Fu-Yun Wang44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT,     Yu Liu1,2,12{}^{1,2,\thanks{the corresponding author: [email protected]}}start_FLOATSUPERSCRIPT 1 , 2 , end_FLOATSUPERSCRIPT
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTShanghai Artificial Intelligence Laboratory,   22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTSenseTime Research,
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTThe University of Hong Kong,    44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTThe Chinese University of Hong Kong
the corresponding author: [email protected]
Abstract

Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.

1 Introduction

Recently, text-to-image generation has witnessed rapid development and various applications [33, 48, 30, 34, 31], where visually stunning images can be created by simply ty** in a text prompt. In particular, after DDPM [12, 7] succeeded GANs [8, 3], diffusion models [40], such as Stable Diffusion [34] and DallE-3 [2], have emerged as the new state-of-the-art family for image-generative models.

The key feature of diffusion models is to approximate the true data distribution p(x)𝑝𝑥p(x)italic_p ( italic_x ) by reversing the process of perturbing the data with noise progressively in a long iterative chain. To incorporate the text prompt c𝑐citalic_c into the final generation, it is necessary to enhance the likelihood of c𝑐citalic_c given the current latent image xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each reversed diffusion step t𝑡titalic_t. Instead of training extra classifiers to model p(c|xt)𝑝conditional𝑐subscript𝑥𝑡p(c|x_{t})italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) at each diffusion step t𝑡titalic_t [7], classifier-free guidance (CFG) [11] has recently been proposed to estimate both the classifier score xtlogp(c|xt)subscriptsubscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\nabla_{x_{t}}\log p(c|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and the diffusion score xtp(xt)subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡\nabla_{x_{t}}p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with the same neural models, such as U-net [35]. In particular, an empirical CFG scale is introduced to control the strength of the text guidance on the whole image space.

Refer to caption
Figure 1: A motivation example. The first line shows images generated by Stable Diffusion with CFG and S-CFG, where the prompt is “a photo of an astronaut riding a horse” and the segmentation maps are manually labeled (Ground, Sky, Horse, Astronaut). The below line shows the average norm curves of the estimated classifier score xtlogp(c|xt)subscriptnormal-∇subscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\ \nabla_{x_{t}}\log p(c|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (solid line) and diffusion score xtlogp(xt)subscriptnormal-∇subscript𝑥𝑡𝑝subscript𝑥𝑡\nabla_{x_{t}}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (dashed line) in each semantic region. The Y-axis scale unit is set as the dynamic variance parameter σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for better illustrations without damaging the conclusion.

However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths during the denoising process and suboptimal quality of the final image. Figure 1 shows samples generated by Stable Diffusion [34]. The images can be segmented into four semantic regions corresponding to “astronaut”, “horse”, “sky” and “ground”. To compare the guidance degrees assigned to different semantic units, the figures in the second line illustrate the average norm curves of the estimated classifier score xtlogp(c|xt)subscriptsubscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\nabla_{x_{t}}\log p(c|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and diffusion score xtlogp(xt)subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡\nabla_{x_{t}}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in each semantic region at any time step. as for the images with the original CFG strategy, we can find that the classifier score norm changes a lot on different semantic units, while the norms of diffusion scores seem to be closer. Intuitively, the larger classifier score implies a greater guidance degree received by the semantic unit. As a result, the final generative samples may exhibit spatial inconsistency in image qualities for different semantic units. For instance, the “astronaut” region, which consistently attains the highest score ratio, displays intricate and finely detailed structures that starkly contrast with the “sky” and “ground” regions.

Along this line, in contrast to the previous works, we propose to set customized CFG scales for different semantic regions of the latent image at each denoising step. In particular, we assume that the inter-patches in each semantic region serve a similar semantic concept and different regions are relatively independent. In this case, the classifier scores xtlogp(c|xt)subscriptsubscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\nabla_{x_{t}}\log p(c|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be approximately deduced into the combination of that conditioning on all independent semantic regions. Therefore, customized CFG scales can be safely involved for each semantic region, without the disruption of relative relations among interdependent patches. However, it is not trivial to conduct semantic segmentation on the latent image without accessing the final generated image. Meanwhile, determining the customized CFG scales to balance semantic units is another challenge.

To this end, in this paper, we propose a novel approach, called Semantic-aware Classifier-Free Guidance (S-CFG), to dynamically and customizedly control the text guidance degrees in text-to-image diffusion models. Specifically, when modeling the conditional distribution p(x|c)𝑝conditional𝑥𝑐p(x|c)italic_p ( italic_x | italic_c ), diffusion models take c𝑐citalic_c as another input with self-attention and cross-attention layers to mix up the image and text, which preserves the underlying semantic information. Along this line, we first design a training-free segmentation method for the latent images at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic information, we rescale the classifier score xtlogp(c|xt)subscriptsubscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\nabla_{x_{t}}\log p(c|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) across different semantic regions to a uniform level with the adaptive CFG scales. Finally, we conduct qualitative and quantitative analysis based on various diffusion models. The results demonstrate that S-CFG can outperform the original CFG strategy and obtain a robust improvement without any extra training cost. At first glance, the right part in Figure 1 demonstrates reduced disparities among the classifier score norms xtlogp(c|xt)subscriptsubscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\nabla_{x_{t}}\log p(c|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) of different semantic units in the image with S-CFG. As a result, more abundant clouds float in the “sky”. The boundary between the “sky” and the “ground” is clearer.

2 Related Work

2.1 Image Diffusion Generative Models

Recently, diffusion models have emerged as an expressive and flexible family for image generation with remarkable image quality and various applications [30, 34, 31, 1, 18, 13, 25]. The general idea is to apply a forward diffusion process that adds tiny noise to the input data, then learn the reverse process with neural networks to gradually recover the original samples from the noisy data, step-by-step. Among them, Denoising Diffusion Probabilistic Model (DDPM) [12] is the representative baseline, which carefully designed the noise schedule on the pixel space during the forward process and the network architecture in the reverse process. As a result, diffusion models achieved better model coverage and training stability compared to GANs [8, 3, 16]. To further reduce computational costs, the subsequent study turned to combining DDPM and VAE [19, 32, 38] by applying diffusion models to the lower-dimensional latent space of a VAE trained on large-scale image datasets, such as Stable Diffusion [34]. In general, diffusion models suffer the downside of low inference speed compared to other generative models. However, this problem can be greatly alleviated by distillation strategies [42, 43] or advanced sampling strategies, such as DDIM [41, 52], DPMSolver [23, 24], PNDM [17], Euler [17], and DEIS [51], which can perform 10X to 100X speedup compared to the original DDPM sampler. Here, we further explore a better way for image generation based on diffusion models.

2.2 Text-guided Generation

Recently, the text-guided generation in diffusion models has reached an unprecedented level, like DallE-3 [2]. This generative power stems from three aspects. First, to represent the unstructured text, expressive language embedding models are used to embed each token in the given text, such as CLIP [28] in Stable Diffusion [34], and T5 [29] in Imagen [37]. Second, to facilitate the interaction between text and image information, diffusion models typically enhance the network backbone, such as the U-net backbone [35], with the cross-attention mechanism. This mechanism involves utilizing the image embedding as the query and the key and value embeddings derived from the text. Third, Classifier-Free Guidance (CFG) [11] has recently been widely involved as a lightweight and robust technique to encourage text prompt adherence in generations. Instead of training extra classifiers [7, 22], CFG mixes the score estimates of the diffusion model with or without the conditional prompt. Some other works [21, 15] further separate a prompt into multiple concepts and generate an image by combining a set of diffusion models with each of them conditioning on a certain concept component. Here, we further emphasize the importance of varying CFG scales across different image semantic regions and design the semantic-ware CFG strategy to improve image quality.

2.3 Applications with Cross-Attention Maps

Cross-attention maps in the diffusion U-net Backbone are derived to represent the spatial relation between image patches and prompt tokens. They provide valuable semantic information for image segmentation and can contribute to various applications. For example, some works [6, 5, 53, 47] introduce layout control in image generation by minimizing the difference between the cross-attention-based semantic segmentation and the given layout conditions. Prompt2Prompt [10] achieves image editing by simply replacing, adding, or re-weighting cross-attention maps. Attend-and-Excite [4] improves the text alignment by optimizing the cross-attention maps during the inference process. Subsequent works further extend those ideas for image-to-image translation [27], text-driven image editing [45, 9], and compositional image generation [46]. In this paper, we further use cross-attention maps to improve image quality by segmenting latent images and customizing the guidance degrees of different semantic regions.

3 Preliminary

3.1 Diffusion Models

Given the image data space 𝒳𝒳\mathcal{X}caligraphic_X, diffusion models define a Markov Chain, known as the forward process, to corrupt the real data x0𝒳subscript𝑥0𝒳x_{0}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_X by progressively adding Gaussian noise from time steps 00 to T𝑇Titalic_T:

q(xt|xt1)=𝒩(xt;1βtxt1,βt𝐈),𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝑥𝑡1subscript𝛽𝑡subscript𝑥𝑡1subscript𝛽𝑡𝐈\begin{split}q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},% \beta_{t}\textbf{I}),\end{split}start_ROW start_CELL italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT I ) , end_CELL end_ROW (1)

where {βt}t=1:Tsubscriptsubscript𝛽𝑡:𝑡1𝑇\{\beta_{t}\}_{t=1:T}{ italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 : italic_T end_POSTSUBSCRIPT denotes the variance for each noise step, set as constant usually. Taking advantage of the properties of the Gaussian distribution, we can obtain xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at an arbitrary time step t𝑡titalic_t using the following closed form:

xt=α¯tx0+1α¯tϵt,ϵt𝒩(0,𝐈),formulae-sequencesubscript𝑥𝑡subscript¯𝛼𝑡subscript𝑥01subscript¯𝛼𝑡subscriptitalic-ϵ𝑡similar-tosubscriptitalic-ϵ𝑡𝒩0𝐈\begin{split}x_{t}=\sqrt{\overline{\alpha}_{t}}x_{0}+\sqrt{1-\overline{\alpha}% _{t}}\epsilon_{t},~{}\epsilon_{t}\sim\mathcal{N}(0,\textbf{I}),\end{split}start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , I ) , end_CELL end_ROW (2)

where αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and α¯t=s=1tαssubscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\overline{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_s = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT will degrade to standard Gaussian noise with α¯T0subscript¯𝛼𝑇0\overline{\alpha}_{T}\approx 0over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ≈ 0.

The reverse denoising process aims to approximate the true posterior of each forward step via a time-dependent neural network parameterized by θ𝜃\thetaitalic_θ:

pθ(xt1|xt)=𝒩(xt1;μθ(xt,t),σθ(xt,t)𝐈),subscript𝑝𝜃conditionalsubscript𝑥𝑡1subscript𝑥𝑡𝒩subscript𝑥𝑡1subscript𝜇𝜃subscript𝑥𝑡𝑡subscript𝜎𝜃subscript𝑥𝑡𝑡𝐈\begin{split}p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},% t),\sigma_{\theta}(x_{t},t)\textbf{I}),\end{split}start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ; italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) I ) , end_CELL end_ROW (3)

which can be used to generate image x0pθ(x0)similar-tosubscript𝑥0subscript𝑝𝜃subscript𝑥0x_{0}\sim p_{\theta}(x_{0})italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) by sampling Gaussian noise xT𝒩(0,𝐈)similar-tosubscript𝑥𝑇𝒩0𝐈x_{T}\sim\mathcal{N}(0,\textbf{I})italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , I ) first and denoising step-by-step from xT1subscript𝑥𝑇1x_{T-1}italic_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT to x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. In practice, to simplify the model training, σθ(xt,t)subscript𝜎𝜃subscript𝑥𝑡𝑡\sigma_{\theta}(x_{t},t)italic_σ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is set as constant σtsubscript𝜎𝑡\sigma_{t}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [7] and μθ(xt,t)subscript𝜇𝜃subscript𝑥𝑡𝑡\mu_{\theta}(x_{t},t)italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is parameterized as follows:

μθ(xt,t)=1αt(xtβt1α¯tϵθ(xt,t)),subscript𝜇𝜃subscript𝑥𝑡𝑡1subscript𝛼𝑡subscript𝑥𝑡subscript𝛽𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\begin{split}\mu_{\theta}(x_{t},t)=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-% \frac{\beta_{t}}{1-\overline{\alpha}_{t}}\epsilon_{\theta}(x_{t},t)\right),% \end{split}start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) , end_CELL end_ROW (4)

where the neural model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, such as U-net [35], is trained to predict the noise ϵtsubscriptitalic-ϵ𝑡\epsilon_{t}italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT added in each forward step, which also mirrors the denoising score-matching, i.e, ϵθ(xt,t)σtxtlogp(xt)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡subscript𝜎𝑡subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡\epsilon_{\theta}(x_{t},t)\approx-\sigma_{t}\nabla_{x_{t}}\log p(x_{t})italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ≈ - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

3.2 Classifier-free Guidance

The vanilla diffusion model described above is an unconditional generative model pθ(x0)subscript𝑝𝜃subscript𝑥0p_{\theta}(x_{0})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to approximate the true data distribution q(x0)𝑞subscript𝑥0q(x_{0})italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). However, in practical scenarios, there is a growing demand to condition the generation on a label or text prompt c𝑐citalic_c [49]. To address this requirement, classifier-guidance [7] incorporates an auxiliary classifier pϕ(c|xt)subscript𝑝italic-ϕconditional𝑐subscript𝑥𝑡p_{\phi}(c|x_{t})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to guide the sampling in each reverse denoising step, thereby increasing the likelihood of c𝑐citalic_c given xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Specifically, the diffusion score is modified as follows:

ϵ^θ(xt.c,t)=ϵθ(xt,t)γσtxtlogpϕ(c|xt)σtxtlog(p(xt)pϕγ(c|xt)),\begin{split}\hat{\epsilon}_{\theta}(x_{t}.c,t)=\epsilon_{\theta}(x_{t},t)-% \gamma\sigma_{t}\nabla_{x_{t}}\log p_{\phi}(c|x_{t})\\ \approx-\sigma_{t}\nabla_{x_{t}}\log(p(x_{t})p^{\gamma}_{\phi}(c|x_{t})),\end{split}start_ROW start_CELL over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . italic_c , italic_t ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) - italic_γ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ≈ - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log ( italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_p start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , end_CELL end_ROW (5)

where γ𝛾\gammaitalic_γ is a scalar parameter to regulate the strength of the classifier guidance. While this method has demonstrated some performance improvements, training a robust classifier for all reverse steps, particularly for the highly noisy input at the initial step, poses a significant challenge and incurs additional training costs.

To avoid training a separate classifier model, classifier-free guidance [11] takes c𝑐citalic_c as another input of the denoising neural network to model the conditional diffusion score, i.e., ϵθ(xt,c,t)σtxtlogp(xt|c)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡subscript𝜎𝑡subscriptsubscript𝑥𝑡𝑝conditionalsubscript𝑥𝑡𝑐\epsilon_{\theta}(x_{t},c,t)\approx-\sigma_{t}\nabla_{x_{t}}\log p(x_{t}|c)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) ≈ - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c ), while the unconditional score ϵθ(xt,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) is jointly estimated by randomly drop** the text prompt with a certain probability at each training iteration. Then the gradients for the classifier pϕ(c|xt)subscript𝑝italic-ϕconditional𝑐subscript𝑥𝑡p_{\phi}(c|x_{t})italic_p start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) can be estimated as:

xtlogp(c|xt)=xtlogpθ(xt|y)xtlogpθ(xt)=1σt(ϵθ(xt,c,t)ϵθ(xt,t)).subscriptsubscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡subscriptsubscript𝑥𝑡subscript𝑝𝜃conditionalsubscript𝑥𝑡𝑦subscriptsubscript𝑥𝑡subscript𝑝𝜃subscript𝑥𝑡1subscript𝜎𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\begin{split}\nabla_{x_{t}}\log p(c|x_{t})&=\nabla_{x_{t}}\log p_{\theta}(x_{t% }|y)-\nabla_{x_{t}}\log p_{\theta}(x_{t})\\ &=-\frac{1}{\sigma_{t}}(\epsilon_{\theta}(x_{t},c,t)-\epsilon_{\theta}(x_{t},t% )).\end{split}start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_y ) - ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = - divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) . end_CELL end_ROW (6)

Along this line, the corresponding diffusion score in Equation 5 can be derived as:

ϵ^θ(xt.c,t)=ϵθ(xt,t)+γ(ϵθ(xt,c,t)ϵθ(xt,t)),\begin{split}\hat{\epsilon}_{\theta}(x_{t}.c,t)=\epsilon_{\theta}(x_{t},t)+% \gamma(\epsilon_{\theta}(x_{t},c,t)-\epsilon_{\theta}(x_{t},t)),\end{split}start_ROW start_CELL over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . italic_c , italic_t ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) + italic_γ ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) , end_CELL end_ROW (7)

where γ𝛾\gammaitalic_γ is also usually set as a global scalar parameter to control the guidance degree of the condition. However, in this paper, we argue that the CFG scale should be spatially adaptive, allowing for balancing the inconsistency of semantic strengths for diverse semantic units in the image.

Refer to caption
Figure 2: The overall framework of our S-CFG method. At each denoising step in diffusion models, the U-net backbone estimates both diffusion score xtlogp(xt)subscriptnormal-∇subscript𝑥𝑡𝑝subscript𝑥𝑡\nabla_{x_{t}}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and conditional diffusion score xtlogp(xt|c)subscriptnormal-∇subscript𝑥𝑡𝑝conditionalsubscript𝑥𝑡𝑐\nabla_{x_{t}}\log p(x_{t}|c)∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_c ) without or with text prompt input, which can further infer the classifier score xtlogp(c|xt)subscriptnormal-∇subscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\nabla_{x_{t}}\log p(c|x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). By extracting and exploiting self-attention map Stksubscriptsuperscript𝑆𝑘𝑡S^{k}_{t}italic_S start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and cross-attention map Ctksubscriptsuperscript𝐶𝑘𝑡C^{k}_{t}italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in each attention layer of U-net, we can obtain the region masks mt,isubscript𝑚𝑡𝑖m_{t,i}italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT for each prompt token i𝑖iitalic_i. With the goal of unifying the classifier score norm in different regions, the CFG scale map can be determined to control the semantic strengths spatially in the following step.

4 Methods

In this section, we introduce the technical details of Semantic-aware Classifier-Free Guidance (S-CFG). where the overview of the framework is shown in Figure 2. At each denoising step in diffusion models, the current latent image is fed into the U-net backbone to estimate both diffusion score and conditional diffusion score without or with text prompt input. With the extracted attention maps, we can derive region masks for the relatively independent semantic units. In particular, the cross-attention map is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic information, we set adaptive CFG scales on diverse region masks and obtain the scale map to rescale their classifier scores into a uniform level.

4.1 Segmantic Map Generation

To customizedly control the amplification of diverse semantic units, we need to segment the latent image once using the CFG strategy defined in Equation 7, i.e., at each denoising step. However, this task is not trivial because the final image can not be accessed during the generation process. Fortunately, the attention layers in the U-net backbone have been reported to contain valuable semantic information for capturing relationships between image and text prompts [4, 44], which can be leveraged to efficiently extract semantic units.

Specifically, for most text-to-image diffusion models, the interaction between the text prompt and the generation image is performed using cross-attention mechanisms. In general, the denoising U-net network consists of self-attention layers followed by cross-attention layers at certain resolutions. For example, SD puts 16 self- and cross-attention layers at the resolution of 64, 32, 16, 8. In the k𝑘kitalic_k-th attention layer, a self-attention map StkHW×HWsubscriptsuperscript𝑆𝑘𝑡superscript𝐻𝑊𝐻𝑊S^{k}_{t}\in\mathbb{R}^{HW\times HW}italic_S start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_H italic_W end_POSTSUPERSCRIPT and a cross-attention map CtkHW×Lsubscriptsuperscript𝐶𝑘𝑡superscript𝐻𝑊𝐿C^{k}_{t}\in\mathbb{R}^{HW\times L}italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_L end_POSTSUPERSCRIPT are calculated over linear projections of the intermediate image spatial feature ztkHW×Csubscriptsuperscript𝑧𝑘𝑡superscript𝐻𝑊𝐶z^{k}_{t}\in\mathbb{R}^{HW\times C}italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_C end_POSTSUPERSCRIPT or text embedding eL×D𝑒superscript𝐿𝐷e\in\mathbb{R}^{L\times D}italic_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_L × italic_D end_POSTSUPERSCRIPT,

Stk=Softmax(Qs(ztk)Ks(ztk)Td),Ctk=Softmax(Qc(ztk)Kc(e)Td),formulae-sequencesuperscriptsubscript𝑆𝑡𝑘Softmaxsubscript𝑄𝑠subscriptsuperscript𝑧𝑘𝑡subscript𝐾𝑠superscriptsubscriptsuperscript𝑧𝑘𝑡𝑇𝑑superscriptsubscript𝐶𝑡𝑘Softmaxsubscript𝑄𝑐subscriptsuperscript𝑧𝑘𝑡subscript𝐾𝑐superscript𝑒𝑇𝑑\begin{split}S_{t}^{k}={\rm Softmax}\left(\frac{Q_{s}(z^{k}_{t})K_{s}(z^{k}_{t% })^{T}}{\sqrt{d}}\right),\\ C_{t}^{k}={\rm Softmax}\left(\frac{Q_{c}(z^{k}_{t})K_{c}(e)^{T}}{\sqrt{d}}% \right),\end{split}start_ROW start_CELL italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_Softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) , end_CELL end_ROW start_ROW start_CELL italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = roman_Softmax ( divide start_ARG italic_Q start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_z start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_K start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_e ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) , end_CELL end_ROW (8)

where H𝐻Hitalic_H and W𝑊Witalic_W are the current resolutions, L𝐿Litalic_L is the number of text tokens, C𝐶Citalic_C is the image feature channel, D𝐷Ditalic_D is the token embedding dimension, and Q*()subscript𝑄Q_{*}(\cdot)italic_Q start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ( ⋅ ) and K*()subscript𝐾K_{*}(\cdot)italic_K start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ( ⋅ ) are linear projections with the dimension of output as d𝑑ditalic_d.

4.1.1 Cross-Attention-based Semantic Segmentation

Intuitively, at each denoising step t𝑡titalic_t, each row in Ctksuperscriptsubscript𝐶𝑡𝑘C_{t}^{k}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT defines the distribution over the text tokens, which is used to augment with the most relevant textual token for each patch. Therefore, a higher probability Ctk[s,i]superscriptsubscript𝐶𝑡𝑘𝑠𝑖C_{t}^{k}[s,i]italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT [ italic_s , italic_i ] indicates a closer relationship between the current patch s𝑠sitalic_s and the corresponding token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Along this line, we propose to segment the latent image xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as the set of regions masked by {mt,1,,mt,L}subscript𝑚𝑡1subscript𝑚𝑡𝐿\{m_{t,1},...,m_{t,L}\}{ italic_m start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_t , italic_L end_POSTSUBSCRIPT }, where i𝑖iitalic_i-th masked region mt,i{0,1}HWsubscript𝑚𝑡𝑖superscript01𝐻𝑊m_{t,i}\in\{0,1\}^{HW}italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT corresponds to the semantic token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Specifically, we first employ a fusion process to obtain the final cross-attention map CtHW×Lsubscript𝐶𝑡superscript𝐻𝑊𝐿C_{t}\in\mathbb{R}^{HW\times L}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W × italic_L end_POSTSUPERSCRIPT. This fusion involves averaging the cross-attention layers and heads with the smallest two resolutions, as these have been shown to contain the most substantial semantic information [10]. In particular, all attention maps are upsampled into the same size. Then, Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is renormalized along the spatial dimension, and the argmax operation is applied on the token dimension to determine the activation of the current patch, denoted as:

C^t[s,i]=Ct[s,i]s=1HWCt[s,i],is=argmaxiC^t[s,i],formulae-sequencesubscript^𝐶𝑡𝑠𝑖subscript𝐶𝑡𝑠𝑖superscriptsubscriptsuperscript𝑠1𝐻𝑊subscript𝐶𝑡superscript𝑠𝑖subscript𝑖𝑠subscript𝑖subscript^𝐶𝑡𝑠𝑖\begin{split}\hat{C}_{t}[s,i]=\frac{C_{t}[s,i]}{\sum_{s^{\prime}=1}^{HW}C_{t}[% s^{\prime},i]},\\ i_{s}=\arg\max_{i}\hat{C}_{t}[s,i],\end{split}start_ROW start_CELL over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_s , italic_i ] = divide start_ARG italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_s , italic_i ] end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ] end_ARG , end_CELL end_ROW start_ROW start_CELL italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_s , italic_i ] , end_CELL end_ROW (9)

where C^t[s,i]subscript^𝐶𝑡𝑠𝑖\hat{C}_{t}[s,i]over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT [ italic_s , italic_i ] estimates the possibility assigned to the patch s𝑠sitalic_s for the token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The corresponding region mask mt,isubscript𝑚𝑡𝑖m_{t,i}italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT can be derived by setting the element in the patch set {s:is=i}conditional-set𝑠subscript𝑖𝑠𝑖\{s:i_{s}=i\}{ italic_s : italic_i start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_i } as 1, and 0 for others. Note that the renormalization in the above equation plays a crucial role in aligning the token with the image patch in our practice. Without the renormalization, Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT would tend to concentrate most of the attention on a single token, such as the START token, for all patches, damaging the semantic segmentation.

The second column in Figure 3 shows an example result of the above semantic segmentation, we can find that the semantic maps could successfully detect the rough locations of several important tokens, such as “astronaut” and “horse”. However, it is worth noting that they often exhibit unclear object boundaries and may contain internal holes, particularly during the initial denoising steps. To alleviate this problem, we propose to refine and complete the semantic map with self-attention maps in the following section.

Refer to caption
Figure 3: The latent image segmentation based on attention maps at different denoising steps. The first column shows the predicted image x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT based on the current latent image xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and noise estimation ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with Equation 2. The following three columns show the semantic segmentation maps with different strategies. Regions labeled by different colors correspond to different tokens. The last column shows the foreground mask detected by our approach.

4.1.2 Self-Attention-based Segmentation Completion

Specifically, we follow [44] and refine each cross-attention map Ctksuperscriptsubscript𝐶𝑡𝑘C_{t}^{k}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT by multiplying it with the corresponding self-attention maps at each attention layer. The hidden logic is rooted in the ability of self-attention maps to estimate the correlation between patches, enabling cross-attention to compensate for incomplete activation regions and perform region completion. Meanwhile, note that Stksuperscriptsubscript𝑆𝑡𝑘S_{t}^{k}italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT can be interpreted as a transition matrix among all patches, where each element is nonnegative and the sum of each row equals 1. We can also enhance the region completion by transmitting semantic information among patches following the idea of feature propagation in graph [20]. Therefore, same as  [54], we refine the cross-attention map Ctksuperscriptsubscript𝐶𝑡𝑘C_{t}^{k}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT as follows:

C¯tk=1Rr=1R(Stk)rCtk,superscriptsubscript¯𝐶𝑡𝑘1𝑅superscriptsubscript𝑟1𝑅superscriptsuperscriptsubscript𝑆𝑡𝑘𝑟superscriptsubscript𝐶𝑡𝑘\begin{split}\overline{C}_{t}^{k}=\frac{1}{R}\sum_{{r}=1}^{R}(S_{t}^{k})^{r}C_% {t}^{k},\end{split}start_ROW start_CELL over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_R end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( italic_S start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , end_CELL end_ROW (10)

where R𝑅Ritalic_R is a hyper-parameter and set as 4 in our experiments. Combining Eqaution 10, a refined version of cross-attention map, i..e, C¯tsubscript¯𝐶𝑡\overline{C}_{t}over¯ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, would be computed, which would be put into Equation 9 for deriving refined segmentation masks. The fourth column in Figure 3 shows the corresponding results, where segmentation maps become better with clearer object boundaries and fewer internal holes, even better than the third column which sets R=1𝑅1R=1italic_R = 1.

4.2 Semantic-Aware CFG

At each denoising step t𝑡titalic_t, given the semantic units with masks {mt,1,,mt,M}subscript𝑚𝑡1subscript𝑚𝑡𝑀\{m_{t,1},...,m_{t,M}\}{ italic_m start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_t , italic_M end_POSTSUBSCRIPT }, we turn to design the semantic-aware CFG strategy to control the strength of each semantic unit separately. In particular, note that the image patches in the different semantic units usually have a more distant relationship than that among the same semantic unit. To simplify the discussion, we assume that different semantic units are independent of each other at any time step. Based on this assumption, we can derive the following expressions about the classifier p(c|xt)𝑝conditional𝑐subscript𝑥𝑡p(c|x_{t})italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ):

p(c|xt)=i=1Lp(wi|mt,ixt),xtlogp(wi|mt,ixt)=mt,ixtlogp(c|xt),formulae-sequence𝑝conditional𝑐subscript𝑥𝑡superscriptsubscriptproduct𝑖1𝐿𝑝conditionalsubscript𝑤𝑖direct-productsubscript𝑚𝑡𝑖subscript𝑥𝑡subscriptsubscript𝑥𝑡𝑝conditionalsubscript𝑤𝑖direct-productsubscript𝑚𝑡𝑖subscript𝑥𝑡direct-productsubscript𝑚𝑡𝑖subscriptsubscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\begin{split}p(c|x_{t})=\prod_{i=1}^{L}p(w_{i}|m_{t,i}\odot x_{t}),\\ \nabla_{x_{t}}\log p(w_{i}|m_{t,i}\odot x_{t})=m_{t,i}\odot\nabla_{x_{t}}\log p% (c|x_{t}),\end{split}start_ROW start_CELL italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , end_CELL end_ROW (11)

where mt,isubscript𝑚𝑡𝑖m_{t,i}italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is interpolated and reshaped to the same size as xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and direct-product\odot is the element-wise product. (The detailed derivation can be found in the Appendix.) Then, instead of using a single scalar to control the guidance degrees of all semantic units, like that in Equation  5 and 7, we define the composed diffusion score function as follows:

ϵ^θ(xt,c,t)=ϵθ(xt,t)+i=1Mγt,imt,i(ϵθ(xt,c,t)ϵθ(xt,t)),subscript^italic-ϵ𝜃subscript𝑥𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡superscriptsubscript𝑖1𝑀direct-productsubscript𝛾𝑡𝑖subscript𝑚𝑡𝑖subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\begin{split}\hat{\epsilon}_{\theta}(x_{t},&c,t)=\epsilon_{\theta}(x_{t},t)\\ &+\sum_{i=1}^{M}\gamma_{t,i}m_{t,i}\odot(\epsilon_{\theta}(x_{t},c,t)-\epsilon% _{\theta}(x_{t},t)),\end{split}start_ROW start_CELL over^ start_ARG italic_ϵ end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , end_CELL start_CELL italic_c , italic_t ) = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ ( italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ) , end_CELL end_ROW (12)

where each term in the sum operation is the estimation of log-density for each semantic token wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and γt,isubscript𝛾𝑡𝑖\gamma_{t,i}italic_γ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is the scalar parameter to strengthen the corresponding semantic information. In particular, when all parameter γt,isubscript𝛾𝑡𝑖\gamma_{t,i}italic_γ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is set as the same as γ𝛾\gammaitalic_γ, the above equation reduces into the same as the original CFG strategy in Equation 7.

Refer to caption
(a) SD-v1.5
Refer to caption
(b) SD-v2.1
Refer to caption
(c) DeepFloyd IF
Figure 4: The qualitative evaluation results on the trade-off curve of FID-30K VS CLIP Score.

4.2.1 Adaptive CFG Scale γt,isubscript𝛾𝑡𝑖\gamma_{t,i}italic_γ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT

Here, we further propose an approach to adaptively set the CFG scale γt,isubscript𝛾𝑡𝑖\gamma_{t,i}italic_γ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT. The primary objective is to achieve a balanced amplification of diverse semantic units during each denoising step. To achieve this, an intuitive idea is to rescale the classifier scores in different semantic regions to a benchmark scale. This ensures that all semantic units undergo a comparable magnitude of change throughout the denoising process. Specifically, γt,isubscript𝛾𝑡𝑖\gamma_{t,i}italic_γ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT is defined as follows:

ηt=ϵθ(xt,c,t)ϵθ(xt,t)2HW,γt,i=γ|mt,bηt||mt,iηt||mt,i||mt,b|,formulae-sequencesubscript𝜂𝑡subscriptdelimited-∥∥subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑐𝑡subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡2superscript𝐻𝑊subscript𝛾𝑡𝑖𝛾direct-productsubscript𝑚𝑡𝑏subscript𝜂𝑡direct-productsubscript𝑚𝑡𝑖subscript𝜂𝑡subscript𝑚𝑡𝑖subscript𝑚𝑡𝑏\begin{split}\eta_{t}&=\|\epsilon_{\theta}(x_{t},c,t)-\epsilon_{\theta}(x_{t},% t)\|_{2}\in\mathbb{R}^{HW},\\ \gamma_{t,i}&=\gamma\frac{|m_{t,b}\odot\eta_{t}|}{|m_{t,i}\odot\eta_{t}|}\frac% {|m_{t,i}|}{|m_{t,b}|},\end{split}start_ROW start_CELL italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL start_CELL = ∥ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c , italic_t ) - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT , end_CELL end_ROW start_ROW start_CELL italic_γ start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT end_CELL start_CELL = italic_γ divide start_ARG | italic_m start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT ⊙ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG start_ARG | italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_η start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | end_ARG divide start_ARG | italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_m start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT | end_ARG , end_CELL end_ROW (13)

where 2\|\cdot\|_{2}∥ ⋅ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is the 2-norm operator of vectors used on the last dimension of a tensor, and |||\cdot|| ⋅ | is the sum operator of a vector or matrix. γ𝛾\gammaitalic_γ is a hyper-parameter shared for all samples and time steps, like that in the original CFG strategy. In particular, the mask mt,b{0,1}HWsubscript𝑚𝑡𝑏superscript01𝐻𝑊m_{t,b}\in\{0,1\}^{HW}italic_m start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_H italic_W end_POSTSUPERSCRIPT is introduced to assign the benchmarking region. For example, when setting mt,bsubscript𝑚𝑡𝑏m_{t,b}italic_m start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT as 1 for any patch, the average patch norm of the current latent image is the benchmark scale. Here we also introduce another benchmark region for better performance, i.e., the foreground region, such as the union of the regions of “astronaut” and “horse” in Figure 1.

Specifically, when estimating the unconditional score xtlogp(xt)subscriptsubscript𝑥𝑡𝑝subscript𝑥𝑡\nabla_{x_{t}}\log p(x_{t})∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), an empty prompt \emptyset is fed into the model, i.e, ϵθ(xt,,t)subscriptitalic-ϵ𝜃subscript𝑥𝑡𝑡\epsilon_{\theta}(x_{t},\emptyset,t)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∅ , italic_t ), where \emptyset is usually represented as a list of padding tokens with a start token. Based on our approach in Section 4.1, we can detect the semantic region of the START token mt,STARTsubscript𝑚𝑡STARTm_{t,\text{START}}italic_m start_POSTSUBSCRIPT italic_t , START end_POSTSUBSCRIPT, which effectively indicates the background area in our implementation (see the last column in Figure 3). Therefore, we can align the benchmarking region with the foreground region by setting:

mt,b=1mt,START.subscript𝑚𝑡𝑏1subscript𝑚𝑡START\begin{split}m_{t,b}=1-m_{t,\text{START}}.\end{split}start_ROW start_CELL italic_m start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT = 1 - italic_m start_POSTSUBSCRIPT italic_t , START end_POSTSUBSCRIPT . end_CELL end_ROW (14)

5 Experiments

Benchmark Models. We include two diffusion models as base models: Stable diffusion (SD) [34], which operates in the latent image space, and DeepFloyd IF (IF) [39], which operates in the image pixel space. Specifically, we consider two versions of SD: SD-v1.5 and SD-v2.1, which differ in terms of model sizes and generative qualities. For the IF model, we use the middle-scale version, IF-M, which is constructed using multiple diffusion models. To maintain simplicity, two model stages are used, where the base diffusion model produces low-resolution samples and an upscale diffusion model boosts them to a higher resolution. Both stages can benefit from the CFG or S-CFG strategy. Additionally, the IF model uses the T5XXL as the text encoder without using the start token. Therefore, instead of assigning the foreground region based on the start token, we set the benchmarking mask mt,bsubscript𝑚𝑡𝑏m_{t,b}italic_m start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT in Equation 13 as 1 for any patch. All three models are publicly accessible.

Meanwhile, two samplers are discussed for all three models, i.e., DDIM [44] and DPMSolver++ [24], which are both the most widely used in practice. Specifically, for DDIM, we follow [34] and set the number of sampling steps as 250 for SD models with the noise variance parameter as 0. Regarding the IF model, which employs learnable noise variance parameters, we adhere to the original noise settings and conduct DDIM sampling with 50 steps. As for DPMSolver++, we set the number of sampling steps as 50.

Refer to caption
Figure 5: Samples generated by different base models with CFG (left) or S-CFG (right).

5.1 Quantitative Evaluation

We compare the benchmark models with CFG and S-CFG on the MSCOCO 256×\times× 256 dataset. Two qualitative metrics are used: 1) FID-30K: zero-shot Frechet Inception Distance with 30K images and the corresponding captions, which measures the quality and diversity of images. 2) CLIP Score [28]: which randomly selects 5K captions as prompts and uses the CLIP model to assess the alignments between the generated images and their corresponding text prompts. In particular, the trade-off between FID and CLIP scores has been widely reported with varying CFG scales [26]. Therefore, we present the trade-off curve across a range of the global scale γ[2.0,3.0,5.0,7.5,10.0]𝛾2.03.05.07.510.0\gamma\in[2.0,~{}3.0,~{}5.0,~{}7.5,~{}10.0]italic_γ ∈ [ 2.0 , 3.0 , 5.0 , 7.5 , 10.0 ].

Based on the results presented in Figure 4, it is evident that our S-CFG strategy consistently outperforms the original CFG strategy across most experimental settings, where the trade-off curve of S-CFG consistently favors a position towards the bottom right of that of the original CFG strategy in each setting (See Appendix for a full detailed table). This phenomenon demonstrates the effectiveness and robustness of S-CFG, establishing its applicability in both latent image space and pixel space for diffusion models with different model sizes. In addition, we can find that the diffusion sampler may be crucial for the generative quality, specifically for the pixel space model, i.e., IF, where a significant performance gap is observed for DDIM and DPMSolver++. However, S-CFG also achieve performance improvement.

5.2 Human-Level Evaluation

Here, 80 prompts are randomly selected from MSCOCO validation dataset for generative images with CFG and S-CFG. Then, we asked 5 participants to assess both the image quality and image-text alignment. Human raters are asked to select the superior respectively from the given two synthesized images, one from the original CFG strategy, and another from our S-CFG strategy. For fairness, we use the same random seed for generating both images. The voting results are summarised in Table 1. The majority of votes go to our S-CFG strategy for all base models, demonstrating superiority in both evaluated aspects.

Table 1: Human-level evaluation results.
Image Quality Image-Text
CFG S-CFG CFG S-CFG
SD-v1.5 26.78% 73.22% 23.20% 76.80%
SD-v2.1 28.16% 71.84 % 31.85% 68.15%
IF 32.39% 67.61% 29.17% 70.83%

5.3 Qualitative Evaluation

In Figure 5, we show some samples generated by different models with CFG and S-CFG. For fairness, we use the same setting and random seed for different strategies. The results exhibit a notable enhancement in the model’s generative capacity from the aspects of semantic expressiveness and entity portrayal. For example, when given the prompt “A boy is playing Pokemon”, S-CFG improves SD-v1.5 by ensuring the boy’s appearance in a normal manner. In the case of “A person petting a small elephant statue”, S-CFG eliminates the irregular elephant’s trunk. Similar improvement in fine-grained structure completion can also be observed for SD-v2.1 and IF in the first two rows. Furthermore, for scenarios in the last rows, such as “A cat sitting … on a park bench”, “A plate of meat topped …” and “A man in a suit with a blue tie …”, S-CFG helps models generate images that accurately represent the semantic descriptions.

Refer to caption
Figure 6: The ablation analysis by evaluating the performance of different components in S-CFG.

5.4 Ablation Analysis

Here, three variants of S-CFG are introduced: 1) S-CFG-mean sets the benchmarking mask mt,bsubscript𝑚𝑡𝑏m_{t,b}italic_m start_POSTSUBSCRIPT italic_t , italic_b end_POSTSUBSCRIPT as 1 for all patches. 2) S-CFG w/o sa is the variant without the segmentation completion based on self-attention maps. 3) S-CFG-sa is the variant with R=1𝑅1R=1italic_R = 1 in Equation 10.

The results in Figure 6 based on SD-v1.5 demonstrate that all variants of S-CFG consistently outperform the original CFG strategy. This observation strongly supports our core idea of setting customized CFG scales for different semantic regions throughout the denoising process. In addition, when compared to other variants, S-CFG-mean exhibits increased performance instability and fails to achieve the optimal CLIP Score at the lowest FID score. It verifies the advantage of using the foreground region described in Equation 14 as the benchmarking region. Meanwhile, S-CFG w/o sa falls short in outperforming S-CFG-sa and S-CFG, albeit by a relatively small margin. This outcome highlights the effectiveness of self-attention-based segmentation completion. Furthermore, while S-CFG-sa and S-CFG demonstrate similar performance levels, Figure 3 shows that S-CFG exhibits superior segmentation capability, which should result in more accurate image generation. However, these improvements may not be fully captured by the current evaluation metrics.

Table 2: Performance comparisons of ControlNet with CFG and S-CFG, where the base model is SD-v1.5, the parameter γ=3.0𝛾3.0\gamma=3.0italic_γ = 3.0 and that sampler is DPMSolver++ with 50 steps.
FID CLIP Score
CFG S-CFG CFG S-CFG
Canny 8.670 8.382 0.3006 0.3019
Segmentation 9.595 9.549 0.3004 0.3017

5.5 Downstream tasks

Here, we extend the evaluations from foundational image generation to more specialized downstream tasks.

First, we incorporate S-CFG into ControlNet [50], which is a neural network architecture for adding various spatial conditioning controls to text-to-image diffusion models. Specifically, we utilize SD-v1.5 as the base model, incorporating image canny edge and image segmentation as the spatial conditions. Table 2 presents a performance comparison between CFG and S-CFG. The results demonstrate consistent improvement with the incorporation of S-CFG. Some examples are illustrated in Figure 7, showcasing notable improvements in image realism. Specifically, in the canny case of the duck toy, S-CFG enhances the structure of the duck’s mouth and rectifies color imbalances around the tail. Likewise, in the segmentation case of the house, the ControlNet with CFG fails to synthesize the background sky, whereas S-CFG successfully addresses this issue.

We have also integrated S-CFG into DreamBooth [36], which enables the personalization of text-to-image diffusion models with specific subjects using only a few subject images. The examples presented in Figure 8 highlight the improvements in image quality and text-image alignment achieved by S-CFG. For instance, S-CFG enhances the appearance of the dog’s mouth and brings the length of the toy’s legs closer to the input images. Notably, in the second row, DreamBooth with CFG fails to align the image with the text prompt “river”, whereas S-CFG succeeds.

Refer to caption
Figure 7: Samples generated by ControlNet with CFG (middle) or S-CFG (right).
Refer to caption
Figure 8: Samples generated by DreamBooth with CFG (middle) or S-CFG (right). The token “sks” represents the shared subject among the input images.

6 Conclusion

This paper argues that classifier-free guidance (CFG) in text-to-image diffusion models suffers from spatial inconsistency in semantic strengths and suboptimal image quality. To this end, we proposed Semantic-aware CFG (S-CFG), customizing the guidance degrees for different semantic units. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. Then, the CFG scales across regions are adaptively adjusted to rescale the classifier scores into a uniform level. Experiments on multiple diffusion models demonstrated the superiority of S-CFG.

7 Acknowledgments

This research was supported by grants from the National Key R&D Program of China (No. 2022ZD0119302).

References

  • Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European conference on computer vision, pages 707–723. Springer, 2022.
  • Betker et al. [2023] James Betker, Gabriel Goh, Li **g, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, and Yunxin Jiao. Improving image generation with better captions. openai.com, 2023.
  • Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
  • Chefer et al. [2023] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  • Chen et al. [2024] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5343–5353, 2024.
  • Couairon et al. [2023] Guillaume Couairon, Marlene Careil, Matthieu Cord, Stéphane Lathuilière, and Jakob Verbeek. Zero-shot spatial layout conditioning for text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2174–2183, 2023.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Guo and Lin [2023] Qin Guo and Tianwei Lin. Focus on your instruction: Fine-grained and multi-instruction image editing by attention modulation. arXiv preprint arXiv:2312.10113, 2023.
  • Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Ho et al. [2022] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
  • Huang et al. [2023a] Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36, 2023a.
  • Huang et al. [2023b] Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and **gren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023b.
  • Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
  • Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
  • Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • Liu et al. [2022] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
  • Liu et al. [2023] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023.
  • Lu et al. [2022a] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
  • Lu et al. [2022b] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
  • Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  • Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • Parmar et al. [2023] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, **gwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Razavi et al. [2019] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  • Reed et al. [2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  • Shen et al. [2021] Dazhong Shen, Chuan Qin, Chao Wang, Hengshu Zhu, Enhong Chen, and Hui Xiong. Regularizing variational autoencoder with diversity and uncertainty awareness. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 2964–2970. International Joint Conferences on Artificial Intelligence Organization, 2021. Main Track.
  • Shonenkov et al. [2023] Alex Shonenkov, Misha Konstantinov, Daria Bakshandaeva, Christoph Schuhmann, Ksenia Ivanova, and Nadiia Klokova. Deepfloyd if, 2023. https://www.deepfloyd.ai/deepfloyd-if.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
  • Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Song et al. [2023] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
  • Wang et al. [2024] Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi, Weikang Bian, Guanglu Song, Yu Liu, and Hongsheng Li. Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769, 2024.
  • Wang et al. [2023a] **glong Wang, Xiawei Li, **g Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023a.
  • Wang et al. [2023b] Kai Wang, Fei Yang, Shiqi Yang, Muhammad Atif Butt, and Joost van de Weijer. Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. arXiv preprint arXiv:2309.15664, 2023b.
  • Wang et al. [2023c] Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, and Xiaodong Lin. Compositional text-to-image synthesis with attention map control of diffusion models. arXiv preprint arXiv:2305.13921, 2023c.
  • Xie et al. [2023] **heng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
  • Xu et al. [2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  • Zhang et al. [2023a] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023a.
  • Zhang et al. [2023b] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
  • Zhang and Chen [2022] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
  • Zhang et al. [2022] Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022.
  • Zhao et al. [2023] Peiang Zhao, Han Li, Ruiyang **, and S Kevin Zhou. Loco: Locally constrained training-free layout-to-image synthesis. arXiv preprint arXiv:2311.12342, 2023.
  • Zhu and Koniusz [2020] Hao Zhu and Piotr Koniusz. Simple spectral graph convolution. In International conference on learning representations, 2020.
\thetitle

Supplementary Material

8 Deriving Equation 11

In this section, we provide a derivation for Equation 11 based on one assumption that may be not particularly strict, i.e., for any denoising step t𝑡titalic_t, the semantic units, corresponding to token set {w1,,wL}subscript𝑤1normal-…subscript𝑤𝐿\{w_{1},...,w_{L}\}{ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT }, with masks {mt,1,,mt,L}subscript𝑚𝑡1normal-…subscript𝑚𝑡𝐿\{m_{t,1},...,m_{t,L}\}{ italic_m start_POSTSUBSCRIPT italic_t , 1 end_POSTSUBSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_t , italic_L end_POSTSUBSCRIPT } are independent of each other. Along this line, we can derive:

p(wi|xt)=p(wi|j=1Lmt,jxt)=j=1Lp(mt,jxt|wi)p(wi)j=1Lp(mt,jxt)=p(mt,ixt|wi)p(wi)j=1,jiLp(mt,jxt)j=1Lp(mt,jxt)=p(mt,ixt|wi)p(wi)p(mt,ixt)=p(wi|mt,ixt).𝑝conditionalsubscript𝑤𝑖subscript𝑥𝑡𝑝conditionalsubscript𝑤𝑖superscriptsubscript𝑗1𝐿direct-productsubscript𝑚𝑡𝑗subscript𝑥𝑡superscriptsubscriptproduct𝑗1𝐿𝑝conditionaldirect-productsubscript𝑚𝑡𝑗subscript𝑥𝑡subscript𝑤𝑖𝑝subscript𝑤𝑖superscriptsubscriptproduct𝑗1𝐿𝑝direct-productsubscript𝑚𝑡𝑗subscript𝑥𝑡𝑝conditionaldirect-productsubscript𝑚𝑡𝑖subscript𝑥𝑡subscript𝑤𝑖𝑝subscript𝑤𝑖superscriptsubscriptproductformulae-sequence𝑗1𝑗𝑖𝐿𝑝direct-productsubscript𝑚𝑡𝑗subscript𝑥𝑡superscriptsubscriptproduct𝑗1𝐿𝑝direct-productsubscript𝑚𝑡𝑗subscript𝑥𝑡𝑝conditionaldirect-productsubscript𝑚𝑡𝑖subscript𝑥𝑡subscript𝑤𝑖𝑝subscript𝑤𝑖𝑝direct-productsubscript𝑚𝑡𝑖subscript𝑥𝑡𝑝conditionalsubscript𝑤𝑖direct-productsubscript𝑚𝑡𝑖subscript𝑥𝑡\begin{split}p(w_{i}|x_{t})&=\ p(w_{i}|\sum_{j=1}^{L}m_{t,j}\odot x_{t})\\ &=\frac{\prod_{j=1}^{L}p(m_{t,j}\odot x_{t}|w_{i})p(w_{i})}{\prod_{j=1}^{L}p(m% _{t,j}\odot x_{t})}\\ &=\frac{p(m_{t,i}\odot x_{t}|w_{i})p(w_{i})\prod_{j=1,j\neq i}^{L}p(m_{t,j}% \odot x_{t})}{\prod_{j=1}^{L}p(m_{t,j}\odot x_{t})}\\ &=\frac{p(m_{t,i}\odot x_{t}|w_{i})p(w_{i})}{p(m_{t,i}\odot x_{t})}\\ &=p(w_{i}|m_{t,i}\odot x_{t}).\end{split}start_ROW start_CELL italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_p ( italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_j = 1 , italic_j ≠ italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_m start_POSTSUBSCRIPT italic_t , italic_j end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG italic_p ( italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_p ( italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW

Then, we can deduce Equation 11 as follows:

p(c|xt)=i=1Lp(wi|xt)=i=1Lp(wi|mt,ixt).xtlogp(wi|mt,ixt)=mt,ixtlogp(wi|mt,ixt)=mt,ixtlogp(wi|xt)=mt,ixtlogp(c|xt)=mt,ixtlogp(c|xt).formulae-sequence𝑝conditional𝑐subscript𝑥𝑡superscriptsubscriptproduct𝑖1𝐿𝑝conditionalsubscript𝑤𝑖subscript𝑥𝑡superscriptsubscriptproduct𝑖1𝐿𝑝conditionalsubscript𝑤𝑖direct-productsubscript𝑚𝑡𝑖subscript𝑥𝑡subscriptsubscript𝑥𝑡𝑝conditionalsubscript𝑤𝑖direct-productsubscript𝑚𝑡𝑖subscript𝑥𝑡subscriptdirect-productsubscript𝑚𝑡𝑖subscript𝑥𝑡𝑝conditionalsubscript𝑤𝑖direct-productsubscript𝑚𝑡𝑖subscript𝑥𝑡subscriptdirect-productsubscript𝑚𝑡𝑖subscript𝑥𝑡𝑝conditionalsubscript𝑤𝑖subscript𝑥𝑡subscriptdirect-productsubscript𝑚𝑡𝑖subscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡direct-productsubscript𝑚𝑡𝑖subscriptsubscript𝑥𝑡𝑝conditional𝑐subscript𝑥𝑡\begin{split}p(c|x_{t})&=\prod_{i=1}^{L}p(w_{i}|x_{t})\\ &=\prod_{i=1}^{L}p(w_{i}|m_{t,i}\odot x_{t}).\\ \nabla_{x_{t}}\log p(w_{i}|m_{t,i}\odot x_{t})&=\nabla_{m_{t,i}\odot x_{t}}% \log p(w_{i}|m_{t,i}\odot x_{t})\\ &=\nabla_{m_{t,i}\odot x_{t}}\log p(w_{i}|x_{t})\\ &=\nabla_{m_{t,i}\odot x_{t}}\log p(c|x_{t})\\ &=m_{t,i}\odot\nabla_{x_{t}}\log p(c|x_{t}).\end{split}start_ROW start_CELL italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW start_ROW start_CELL ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL start_CELL = ∇ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∇ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∇ start_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = italic_m start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT ⊙ ∇ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p ( italic_c | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . end_CELL end_ROW

Note that the prior assumption may not be strict in practice. However, it is intuitive that the patches among different semantic regions are more independent than those in the same patches. Meanwhile, based on the segmentation examples in Figure 3 and our experimental results, we believe that it is beneficial to segment the latent image and customize guidance degrees for different semantic regions.

9 More Experimental Details

Benchmark Models. In our experiment, we involve three special diffusion models as the benchmarks, which are all publicly accessible:

  • Stable Diffusion v1.5 (SD-v1.5), a diffusion model in the latent space of powerful pre-trained autoencoders 111https://huggingface.co/runwayml/stable-diffusion-v1-5, which use the CLIP [28] as the text encoder and output images with the resolution 512x512.

  • Stable Diffusion v2.1 (SD-v2.1), a variant of SD-v1.5 with more model size 222https://huggingface.co/stabilityai/stable-diffusion-2-1, which can output images with the resolution 768×\times×768.

  • DeepFloyd IF (IF), is a diffusion model in the pixel image space 333https://huggingface.co/DeepFloyd/IF-I-M-v1.0, which is constructed using multiple diffusion models with T5XXL as the text encoder. In particular, we use the first two stages of the middle-scale version, i.e., IF-I-M-v1.0 and IF-II-M-v1.0, which produce the 64×\times×64 resolution image and boost them into 256×\times× 256 resolution, respectively.

Quantitative Metric. Two qualitative metrics based on the MSCOCO validation dataset are used:

  • FID-30K, where the FID score is computed on the 30K generated images with prompts selected from the validation set and the corresponding original images.

  • CLIP Score, where 5K captions are selected randomly for guiding image synthesis, and CLIP-VIT-G-14 444https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s34B-b88K is used to compute the similarity between the generated image and the corresponding caption.

In particular, our metric settings may be different from those in the official reports of the SD and IF models. It is somewhat weird that SD-v2.1 fails to outperform SD-v1.5 in our settings. Here, we also add another comparison on them based on a similar setting to their official report 555https://huggingface.co/stabilityai/stable-diffusion-2, i.e., where FID-10k and CLIP Score (CLIP-VIT-G-14) on MSCOCO dataset are used with the 50-step DDIM sampler. The results are shown in Figure 9. We can find that our S-CFG strategy also outperforms the original CFG strategy.

Refer to caption
(a) SD-v1.5
Refer to caption
(b) SD-v2.1
Figure 9: The trade-off curve of FID-10K VS CLIP Score with DDIM sampler.

10 Analysis on the Efficiency

Here, we provide an additional analysis of the time cost of our S-CFG strategy. Specifically, we use DPMSolver++ with 50 steps as the sampler to generate images with different base models. All programs run on a single A100 GPU. Table 3 shows the average time cost for generating a sample in 10 runs. We can find only a tiny time cost has been required compared with the original CFG strategy.

Table 3: The analysis on the time cost.
CFG S-CFG improv.
SD-v1.5 2.773 2.848 2.70%
SD-v2.1 7.054 7.167 1.60%
IF 8.595 8.847 2.93%

11 More Ablation Analysis

Here, we provide an additional ablation analysis of the S-CFG on the diffusion model with multiple stages, such as DeepFloyd IF [39]. We try to respond to the question: should the S-CFG strategy be used on all diffusion stages? Specifically, based on the IF model used in our paper, we compare the performance of three methods:

  • S-CFG-first, where the S-CFG strategy is only used in the first diffusion model, i.e., IF-I-M-v1.0.

  • S-CFG-second, where the S-CFG strategy is only used in the second diffusion model, i.e., IF-II-M-v1.0.

  • S-CFG, where the S-CFG strategy is used in both two diffusion models.

In addition, the original CFG strategy is involved as a baseline. We use DPMSolver++ as the sampler with 50 steps and vary the parameter γ𝛾\gammaitalic_γ in [2.0, 3.0, 5.0, 7.5, 10.0]. The trade-off curve of FID-30k VS CLIP Score is shown in Figure 10. We can find that S-CFG tends to achieve the best trade-off between FID-30K and ClIP Score, while S-CFG-first and S-CFG-second perform similarly.

Refer to caption
Figure 10: The ablation analysis of the S-CFG on the diffusion model with multiple stages.

12 More Evaluation on Effectiveness

Recently, a new metric called T2I-CompBench [14] was introduced to evaluate diffusion models, which assesses image quality from 6 aspects and aligns with human preference better. Here, we provide another comparison based on this metric. The results in Table 4 show that SD-v2.1 outperforms SD-v1.5 significantly, and S-CFG performs better than CFG.

Table 4: Evaluation on T2I-CompBench, where the γ=7.5𝛾7.5\gamma=7.5italic_γ = 7.5.
Model Attribute Binding Object Relationship Complex
Shape Color Texture Non-Spatial Spatial
SD-v1.5+CFG 0.3664 0.3761 0.4286 0.3109 0.111 0.2969
SD-v1.5+S-CFG 0.3793 0.3879 0.4288 0.3111 0.1182 0.2993
SD-v2.1+CFG 0.4518 0.549 0.5146 0.3096 0.1512 0.3154
SD-v2.1+S-CFG 0.4558 0.5649 0.5333 0.3104 0.1567 0.3168

13 Detailed Table of Experiments

Here, we show the detailed tables for experiments in Figures 4 and 6. We can find that our S-CFG achieves the best performance on all settings, with the best FID-30K score and CLIP Score.

Table 5: The trade-off curve of SD-v1.5, where the best FID-30k and CLIP Score are highlighted.
DDIM DPMSolver++
CFG S-CFG CFG S-CFG
γ𝛾\gammaitalic_γ FID-30K CLIP Score FID-30K CLIP Score FID-30K CLIP Score FID-30K CLIP Score
2.0 8.696 0.2948 8.656 0.2972 8.991 0.2954 9.023 0.2964
3.0 7.904 0.3097 7.802 0.3107 7.760 0.3091 7.717 0.3099
5.0 10.366 0.3184 10.069 0.3196 10.026 0.3182 9.757 0.3187
7.5 13.008 0.3217 12.620 0.3228 12.466 0.3223 12.059 0.3226
10.0 14.682 0.3230 14.101 0.3231 14.107 0.3235 13.694 0.3236
Table 6: The trade-off curve of SD-v2.1, where the best FID-30k and CLIP Score are highlighted.
DDIM DPMSolver++
CFG S-CFG CFG S-CFG
γ𝛾\gammaitalic_γ FID-30K CLIP Score FID-30K CLIP Score FID-30K CLIP Score FID-30K CLIP Score
2.0 14.394 0.3053 13.892 0.3068 14.999 0.3040 14.864 0.3060
3.0 10.509 0.3191 10.227 0.3204 10.869 0.3187 10.797 0.3200
5.0 10.429 0.3286 10.137 0.3306 10.241 0.3291 10.016 0.3304
7.5 11.548 0.3331 11.278 0.3342 11.324 0.3339 10.944 0.3342
10.0 12.604 0.3357 12.371 0.3359 12.166 0.3356 11.833 0.3359
Table 7: The trade-off curve of IF, where the best FID-30k and CLIP Score are highlighted.
DDIM DPMSolver++
CFG S-CFG CFG S-CFG
γ𝛾\gammaitalic_γ FID-30K CLIP Score FID-30K CLIP Score FID-30K CLIP Score FID-30K CLIP Score
2.0 9.820 0.3076 9.309 0.299 7.242 0.2997 8.494 0.2926
3.0 13.804 0.3195 10.864 0.3152 7.799 0.3147 7.227 0.314
5.0 17.267 0.3257 14.473 0.3259 11.396 0.3233 9.67 0.3226
7.5 18.532 0.329 16.621 0.3288 13.968 0.327 12.402 0.3265
10.0 19.029 0.3296 17.634 0.3299 15.31 0.3280 13.99 0.3280
Table 8: The trade-off curve in the ablation analysis , where the best FID-30k and CLIP Score are highlighted. The experiment is based on SD-v1.5 with 50-step DPMSolver++ Sampler.
S-CFG-mean S-CFG w/o sa S-CFG-sa S-CFG
γ𝛾\gammaitalic_γ FID-30K CLIP Score FID-30K CLIP Score FID-30K CLIP Score FID-30K CLIP Score
2.0 10.703 0.2869 9.110 0.2963 9.063 0.2966 9.023 0.2964
3.0 7.695 0.3044 7.811 0.3089 7.736 0.3099 7.717 0.3099
5.0 8.813 0.3162 9.822 0.3185 9.755 0.3185 9.757 0.3187
7.5 11.204 0.3213 12.102 0.3222 12.083 0.3227 12.059 0.3226
10.0 12.838 0.3233 13.722 0.3235 13.690 0.3235 13.694 0.3236

14 Additional Qualitative Samples

In this section, we present supplementary samples in Figure 11 generated by different base models with CFG and S-CFG. These additional samples further exhibit the superiority of S-CFG compared with the original CFG strategy.

Refer to caption
Figure 11: More samples generated by different base models with CFG (left) or S-CFG (right).