Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

Dazhong Shen

{}^{1}

, Guanglu Song

{}^{2}

, Zeyue Xue

{}^{3}

, Fu-Yun Wang

{}^{4}

, Yu Liu

{}^{1,2,\thanks{the corresponding author: [email protected]}}

{}^{1}

Shanghai Artificial Intelligence Laboratory,

{}^{2}

SenseTime Research,

{}^{3}

The University of Hong Kong,

{}^{4}

The Chinese University of Hong Kong
the corresponding author: [email protected]

Abstract

Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models, where the CFG scale is introduced to control the strength of text guidance on the whole image space. However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem, we present a novel approach, Semantic-aware Classifier-Free Guidance (S-CFG), to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic units, we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally, extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models, without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.

1 Introduction

Recently, text-to-image generation has witnessed rapid development and various applications [33, 48, 30, 34, 31], where visually stunning images can be created by simply ty** in a text prompt. In particular, after DDPM [12, 7] succeeded GANs [8, 3], diffusion models [40], such as Stable Diffusion [34] and DallE-3 [2], have emerged as the new state-of-the-art family for image-generative models.

The key feature of diffusion models is to approximate the true data distribution $p(x)$ by reversing the process of perturbing the data with noise progressively in a long iterative chain. To incorporate the text prompt $c$ into the final generation, it is necessary to enhance the likelihood of $c$ given the current latent image $x_{t}$ at each reversed diffusion step $t$ . Instead of training extra classifiers to model $p(c|x_{t})$ at each diffusion step $t$ [7], classifier-free guidance (CFG) [11] has recently been proposed to estimate both the classifier score $\nabla_{x_{t}}\log p(c|x_{t})$ and the diffusion score $\nabla_{x_{t}}p(x_{t})$ with the same neural models, such as U-net [35]. In particular, an empirical CFG scale is introduced to control the strength of the text guidance on the whole image space.

Refer to caption — Figure 1: A motivation example. The first line shows images generated by Stable Diffusion with CFG and S-CFG, where the prompt is “a photo of an astronaut riding a horse” and the segmentation maps are manually labeled (Ground, Sky, Horse, Astronaut). The below line shows the average norm curves of the estimated classifier score $\ \nabla_{x_{t}}\log p(c|x_{t})$ (solid line) and diffusion score $\nabla_{x_{t}}\log p(x_{t})$ (dashed line) in each semantic region. The Y-axis scale unit is set as the dynamic variance parameter $\sigma_{t}$ for better illustrations without damaging the conclusion.

However, we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths during the denoising process and suboptimal quality of the final image. Figure 1 shows samples generated by Stable Diffusion [34]. The images can be segmented into four semantic regions corresponding to “astronaut”, “horse”, “sky” and “ground”. To compare the guidance degrees assigned to different semantic units, the figures in the second line illustrate the average norm curves of the estimated classifier score $\nabla_{x_{t}}\log p(c|x_{t})$ and diffusion score $\nabla_{x_{t}}\log p(x_{t})$ in each semantic region at any time step. as for the images with the original CFG strategy, we can find that the classifier score norm changes a lot on different semantic units, while the norms of diffusion scores seem to be closer. Intuitively, the larger classifier score implies a greater guidance degree received by the semantic unit. As a result, the final generative samples may exhibit spatial inconsistency in image qualities for different semantic units. For instance, the “astronaut” region, which consistently attains the highest score ratio, displays intricate and finely detailed structures that starkly contrast with the “sky” and “ground” regions.

Along this line, in contrast to the previous works, we propose to set customized CFG scales for different semantic regions of the latent image at each denoising step. In particular, we assume that the inter-patches in each semantic region serve a similar semantic concept and different regions are relatively independent. In this case, the classifier scores $\nabla_{x_{t}}\log p(c|x_{t})$ can be approximately deduced into the combination of that conditioning on all independent semantic regions. Therefore, customized CFG scales can be safely involved for each semantic region, without the disruption of relative relations among interdependent patches. However, it is not trivial to conduct semantic segmentation on the latent image without accessing the final generated image. Meanwhile, determining the customized CFG scales to balance semantic units is another challenge.

To this end, in this paper, we propose a novel approach, called Semantic-aware Classifier-Free Guidance (S-CFG), to dynamically and customizedly control the text guidance degrees in text-to-image diffusion models. Specifically, when modeling the conditional distribution $p(x|c)$ , diffusion models take $c$ as another input with self-attention and cross-attention layers to mix up the image and text, which preserves the underlying semantic information. Along this line, we first design a training-free segmentation method for the latent images at each denoising step. In particular, the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic information, we rescale the classifier score $\nabla_{x_{t}}\log p(c|x_{t})$ across different semantic regions to a uniform level with the adaptive CFG scales. Finally, we conduct qualitative and quantitative analysis based on various diffusion models. The results demonstrate that S-CFG can outperform the original CFG strategy and obtain a robust improvement without any extra training cost. At first glance, the right part in Figure 1 demonstrates reduced disparities among the classifier score norms $\nabla_{x_{t}}\log p(c|x_{t})$ of different semantic units in the image with S-CFG. As a result, more abundant clouds float in the “sky”. The boundary between the “sky” and the “ground” is clearer.

2 Related Work

2.1 Image Diffusion Generative Models

Recently, diffusion models have emerged as an expressive and flexible family for image generation with remarkable image quality and various applications [30, 34, 31, 1, 18, 13, 25]. The general idea is to apply a forward diffusion process that adds tiny noise to the input data, then learn the reverse process with neural networks to gradually recover the original samples from the noisy data, step-by-step. Among them, Denoising Diffusion Probabilistic Model (DDPM) [12] is the representative baseline, which carefully designed the noise schedule on the pixel space during the forward process and the network architecture in the reverse process. As a result, diffusion models achieved better model coverage and training stability compared to GANs [8, 3, 16]. To further reduce computational costs, the subsequent study turned to combining DDPM and VAE [19, 32, 38] by applying diffusion models to the lower-dimensional latent space of a VAE trained on large-scale image datasets, such as Stable Diffusion [34]. In general, diffusion models suffer the downside of low inference speed compared to other generative models. However, this problem can be greatly alleviated by distillation strategies [42, 43] or advanced sampling strategies, such as DDIM [41, 52], DPMSolver [23, 24], PNDM [17], Euler [17], and DEIS [51], which can perform 10X to 100X speedup compared to the original DDPM sampler. Here, we further explore a better way for image generation based on diffusion models.

2.2 Text-guided Generation

Recently, the text-guided generation in diffusion models has reached an unprecedented level, like DallE-3 [2]. This generative power stems from three aspects. First, to represent the unstructured text, expressive language embedding models are used to embed each token in the given text, such as CLIP [28] in Stable Diffusion [34], and T5 [29] in Imagen [37]. Second, to facilitate the interaction between text and image information, diffusion models typically enhance the network backbone, such as the U-net backbone [35], with the cross-attention mechanism. This mechanism involves utilizing the image embedding as the query and the key and value embeddings derived from the text. Third, Classifier-Free Guidance (CFG) [11] has recently been widely involved as a lightweight and robust technique to encourage text prompt adherence in generations. Instead of training extra classifiers [7, 22], CFG mixes the score estimates of the diffusion model with or without the conditional prompt. Some other works [21, 15] further separate a prompt into multiple concepts and generate an image by combining a set of diffusion models with each of them conditioning on a certain concept component. Here, we further emphasize the importance of varying CFG scales across different image semantic regions and design the semantic-ware CFG strategy to improve image quality.

2.3 Applications with Cross-Attention Maps

Cross-attention maps in the diffusion U-net Backbone are derived to represent the spatial relation between image patches and prompt tokens. They provide valuable semantic information for image segmentation and can contribute to various applications. For example, some works [6, 5, 53, 47] introduce layout control in image generation by minimizing the difference between the cross-attention-based semantic segmentation and the given layout conditions. Prompt2Prompt [10] achieves image editing by simply replacing, adding, or re-weighting cross-attention maps. Attend-and-Excite [4] improves the text alignment by optimizing the cross-attention maps during the inference process. Subsequent works further extend those ideas for image-to-image translation [27], text-driven image editing [45, 9], and compositional image generation [46]. In this paper, we further use cross-attention maps to improve image quality by segmenting latent images and customizing the guidance degrees of different semantic regions.

3 Preliminary

3.1 Diffusion Models

Given the image data space $\mathcal{X}$ , diffusion models define a Markov Chain, known as the forward process, to corrupt the real data $x_{0}\in\mathcal{X}$ by progressively adding Gaussian noise from time steps $0$ to $T$ :

\begin{split}q(x_{t}|x_{t-1})=\mathcal{N}(x_{t};\sqrt{1-\beta_{t}}x_{t-1},% \beta_{t}\textbf{I}),\end{split}

(1)

where $\{\beta_{t}\}_{t=1:T}$ denotes the variance for each noise step, set as constant usually. Taking advantage of the properties of the Gaussian distribution, we can obtain $x_{t}$ at an arbitrary time step $t$ using the following closed form:

\begin{split}x_{t}=\sqrt{\overline{\alpha}_{t}}x_{0}+\sqrt{1-\overline{\alpha}% _{t}}\epsilon_{t},~{}\epsilon_{t}\sim\mathcal{N}(0,\textbf{I}),\end{split}

(2)

where $\alpha_{t}=1-\beta_{t}$ and $\overline{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}$ . $x_{T}$ will degrade to standard Gaussian noise with $\overline{\alpha}_{T}\approx 0$ .

The reverse denoising process aims to approximate the true posterior of each forward step via a time-dependent neural network parameterized by $\theta$ :

\begin{split}p_{\theta}(x_{t-1}|x_{t})=\mathcal{N}(x_{t-1};\mu_{\theta}(x_{t},% t),\sigma_{\theta}(x_{t},t)\textbf{I}),\end{split}

(3)

which can be used to generate image $x_{0}\sim p_{\theta}(x_{0})$ by sampling Gaussian noise $x_{T}\sim\mathcal{N}(0,\textbf{I})$ first and denoising step-by-step from $x_{T-1}$ to $x_{0}$ . In practice, to simplify the model training, $\sigma_{\theta}(x_{t},t)$ is set as constant $\sigma_{t}$ [7] and $\mu_{\theta}(x_{t},t)$ is parameterized as follows:

\begin{split}\mu_{\theta}(x_{t},t)=\frac{1}{\sqrt{\alpha_{t}}}\left(x_{t}-% \frac{\beta_{t}}{1-\overline{\alpha}_{t}}\epsilon_{\theta}(x_{t},t)\right),% \end{split}

(4)

where the neural model $\epsilon_{\theta}$ , such as U-net [35], is trained to predict the noise $\epsilon_{t}$ added in each forward step, which also mirrors the denoising score-matching, i.e, $\epsilon_{\theta}(x_{t},t)\approx-\sigma_{t}\nabla_{x_{t}}\log p(x_{t})$ .

3.2 Classifier-free Guidance

The vanilla diffusion model described above is an unconditional generative model $p_{\theta}(x_{0})$ to approximate the true data distribution $q(x_{0})$ . However, in practical scenarios, there is a growing demand to condition the generation on a label or text prompt $c$ [49]. To address this requirement, classifier-guidance [7] incorporates an auxiliary classifier $p_{\phi}(c|x_{t})$ to guide the sampling in each reverse denoising step, thereby increasing the likelihood of $c$ given $x_{t}$ . Specifically, the diffusion score is modified as follows:

\begin{split}\hat{\epsilon}_{\theta}(x_{t}.c,t)=\epsilon_{\theta}(x_{t},t)-% \gamma\sigma_{t}\nabla_{x_{t}}\log p_{\phi}(c|x_{t})\\ \approx-\sigma_{t}\nabla_{x_{t}}\log(p(x_{t})p^{\gamma}_{\phi}(c|x_{t})),\end{split}

(5)

where $\gamma$ is a scalar parameter to regulate the strength of the classifier guidance. While this method has demonstrated some performance improvements, training a robust classifier for all reverse steps, particularly for the highly noisy input at the initial step, poses a significant challenge and incurs additional training costs.

To avoid training a separate classifier model, classifier-free guidance [11] takes $c$ as another input of the denoising neural network to model the conditional diffusion score, i.e., $\epsilon_{\theta}(x_{t},c,t)\approx-\sigma_{t}\nabla_{x_{t}}\log p(x_{t}|c)$ , while the unconditional score $\epsilon_{\theta}(x_{t},t)$ is jointly estimated by randomly drop** the text prompt with a certain probability at each training iteration. Then the gradients for the classifier $p_{\phi}(c|x_{t})$ can be estimated as:

\begin{split}\nabla_{x_{t}}\log p(c|x_{t})&=\nabla_{x_{t}}\log p_{\theta}(x_{t% }|y)-\nabla_{x_{t}}\log p_{\theta}(x_{t})\\ &=-\frac{1}{\sigma_{t}}(\epsilon_{\theta}(x_{t},c,t)-\epsilon_{\theta}(x_{t},t% )).\end{split}

(6)

Along this line, the corresponding diffusion score in Equation 5 can be derived as:

\begin{split}\hat{\epsilon}_{\theta}(x_{t}.c,t)=\epsilon_{\theta}(x_{t},t)+% \gamma(\epsilon_{\theta}(x_{t},c,t)-\epsilon_{\theta}(x_{t},t)),\end{split}

(7)

where $\gamma$ is also usually set as a global scalar parameter to control the guidance degree of the condition. However, in this paper, we argue that the CFG scale should be spatially adaptive, allowing for balancing the inconsistency of semantic strengths for diverse semantic units in the image.

4 Methods

In this section, we introduce the technical details of Semantic-aware Classifier-Free Guidance (S-CFG). where the overview of the framework is shown in Figure 2. At each denoising step in diffusion models, the current latent image is fed into the U-net backbone to estimate both diffusion score and conditional diffusion score without or with text prompt input. With the extracted attention maps, we can derive region masks for the relatively independent semantic units. In particular, the cross-attention map is renormalized for assigning each patch to the corresponding token, while the self-attention map is used to complete the semantic regions. Then, to balance the amplification of diverse semantic information, we set adaptive CFG scales on diverse region masks and obtain the scale map to rescale their classifier scores into a uniform level.

4.1 Segmantic Map Generation

To customizedly control the amplification of diverse semantic units, we need to segment the latent image once using the CFG strategy defined in Equation 7, i.e., at each denoising step. However, this task is not trivial because the final image can not be accessed during the generation process. Fortunately, the attention layers in the U-net backbone have been reported to contain valuable semantic information for capturing relationships between image and text prompts [4, 44], which can be leveraged to efficiently extract semantic units.

Specifically, for most text-to-image diffusion models, the interaction between the text prompt and the generation image is performed using cross-attention mechanisms. In general, the denoising U-net network consists of self-attention layers followed by cross-attention layers at certain resolutions. For example, SD puts 16 self- and cross-attention layers at the resolution of 64, 32, 16, 8. In the $k$ -th attention layer, a self-attention map $S^{k}_{t}\in\mathbb{R}^{HW\times HW}$ and a cross-attention map $C^{k}_{t}\in\mathbb{R}^{HW\times L}$ are calculated over linear projections of the intermediate image spatial feature $z^{k}_{t}\in\mathbb{R}^{HW\times C}$ or text embedding $e\in\mathbb{R}^{L\times D}$ ,

\begin{split}S_{t}^{k}={\rm Softmax}\left(\frac{Q_{s}(z^{k}_{t})K_{s}(z^{k}_{t% })^{T}}{\sqrt{d}}\right),\\ C_{t}^{k}={\rm Softmax}\left(\frac{Q_{c}(z^{k}_{t})K_{c}(e)^{T}}{\sqrt{d}}% \right),\end{split}

(8)

where $H$ and $W$ are the current resolutions, $L$ is the number of text tokens, $C$ is the image feature channel, $D$ is the token embedding dimension, and $Q_{*}(\cdot)$ and $K_{*}(\cdot)$ are linear projections with the dimension of output as $d$ .

4.1.1 Cross-Attention-based Semantic Segmentation

Intuitively, at each denoising step $t$ , each row in $C_{t}^{k}$ defines the distribution over the text tokens, which is used to augment with the most relevant textual token for each patch. Therefore, a higher probability $C_{t}^{k}[s,i]$ indicates a closer relationship between the current patch $s$ and the corresponding token $w_{i}$ . Along this line, we propose to segment the latent image $x_{t}$ as the set of regions masked by $\{m_{t,1},...,m_{t,L}\}$ , where $i$ -th masked region $m_{t,i}\in\{0,1\}^{HW}$ corresponds to the semantic token $w_{i}$ .

Specifically, we first employ a fusion process to obtain the final cross-attention map $C_{t}\in\mathbb{R}^{HW\times L}$ . This fusion involves averaging the cross-attention layers and heads with the smallest two resolutions, as these have been shown to contain the most substantial semantic information [10]. In particular, all attention maps are upsampled into the same size. Then, $C_{t}$ is renormalized along the spatial dimension, and the argmax operation is applied on the token dimension to determine the activation of the current patch, denoted as:

\begin{split}\hat{C}_{t}[s,i]=\frac{C_{t}[s,i]}{\sum_{s^{\prime}=1}^{HW}C_{t}[% s^{\prime},i]},\\ i_{s}=\arg\max_{i}\hat{C}_{t}[s,i],\end{split}

(9)

where $\hat{C}_{t}[s,i]$ estimates the possibility assigned to the patch $s$ for the token $w_{i}$ . The corresponding region mask $m_{t,i}$ can be derived by setting the element in the patch set $\{s:i_{s}=i\}$ as 1, and 0 for others. Note that the renormalization in the above equation plays a crucial role in aligning the token with the image patch in our practice. Without the renormalization, $C_{t}$ would tend to concentrate most of the attention on a single token, such as the START token, for all patches, damaging the semantic segmentation.

The second column in Figure 3 shows an example result of the above semantic segmentation, we can find that the semantic maps could successfully detect the rough locations of several important tokens, such as “astronaut” and “horse”. However, it is worth noting that they often exhibit unclear object boundaries and may contain internal holes, particularly during the initial denoising steps. To alleviate this problem, we propose to refine and complete the semantic map with self-attention maps in the following section.

4.1.2 Self-Attention-based Segmentation Completion

Specifically, we follow [44] and refine each cross-attention map $C_{t}^{k}$ by multiplying it with the corresponding self-attention maps at each attention layer. The hidden logic is rooted in the ability of self-attention maps to estimate the correlation between patches, enabling cross-attention to compensate for incomplete activation regions and perform region completion. Meanwhile, note that $S_{t}^{k}$ can be interpreted as a transition matrix among all patches, where each element is nonnegative and the sum of each row equals 1. We can also enhance the region completion by transmitting semantic information among patches following the idea of feature propagation in graph [20]. Therefore, same as [54], we refine the cross-attention map $C_{t}^{k}$ as follows:

\begin{split}\overline{C}_{t}^{k}=\frac{1}{R}\sum_{{r}=1}^{R}(S_{t}^{k})^{r}C_% {t}^{k},\end{split}

(10)

where $R$ is a hyper-parameter and set as 4 in our experiments. Combining Eqaution 10, a refined version of cross-attention map, i..e, $\overline{C}_{t}$ , would be computed, which would be put into Equation 9 for deriving refined segmentation masks. The fourth column in Figure 3 shows the corresponding results, where segmentation maps become better with clearer object boundaries and fewer internal holes, even better than the third column which sets $R=1$ .

4.2 Semantic-Aware CFG

At each denoising step $t$ , given the semantic units with masks $\{m_{t,1},...,m_{t,M}\}$ , we turn to design the semantic-aware CFG strategy to control the strength of each semantic unit separately. In particular, note that the image patches in the different semantic units usually have a more distant relationship than that among the same semantic unit. To simplify the discussion, we assume that different semantic units are independent of each other at any time step. Based on this assumption, we can derive the following expressions about the classifier $p(c|x_{t})$ :

\begin{split}p(c|x_{t})=\prod_{i=1}^{L}p(w_{i}|m_{t,i}\odot x_{t}),\\ \nabla_{x_{t}}\log p(w_{i}|m_{t,i}\odot x_{t})=m_{t,i}\odot\nabla_{x_{t}}\log p% (c|x_{t}),\end{split}

(11)

where $m_{t,i}$ is interpolated and reshaped to the same size as $x_{t}$ and $\odot$ is the element-wise product. (The detailed derivation can be found in the Appendix.) Then, instead of using a single scalar to control the guidance degrees of all semantic units, like that in Equation 5 and 7, we define the composed diffusion score function as follows:

\begin{split}\hat{\epsilon}_{\theta}(x_{t},&c,t)=\epsilon_{\theta}(x_{t},t)\\ &+\sum_{i=1}^{M}\gamma_{t,i}m_{t,i}\odot(\epsilon_{\theta}(x_{t},c,t)-\epsilon% _{\theta}(x_{t},t)),\end{split}

(12)

where each term in the sum operation is the estimation of log-density for each semantic token $w_{i}$ , and $\gamma_{t,i}$ is the scalar parameter to strengthen the corresponding semantic information. In particular, when all parameter $\gamma_{t,i}$ is set as the same as $\gamma$ , the above equation reduces into the same as the original CFG strategy in Equation 7.

4.2.1 Adaptive CFG Scale $\gamma_{t,i}$

Here, we further propose an approach to adaptively set the CFG scale $\gamma_{t,i}$ . The primary objective is to achieve a balanced amplification of diverse semantic units during each denoising step. To achieve this, an intuitive idea is to rescale the classifier scores in different semantic regions to a benchmark scale. This ensures that all semantic units undergo a comparable magnitude of change throughout the denoising process. Specifically, $\gamma_{t,i}$ is defined as follows:

\begin{split}\eta_{t}&=\|\epsilon_{\theta}(x_{t},c,t)-\epsilon_{\theta}(x_{t},% t)\|_{2}\in\mathbb{R}^{HW},\\ \gamma_{t,i}&=\gamma\frac{|m_{t,b}\odot\eta_{t}|}{|m_{t,i}\odot\eta_{t}|}\frac% {|m_{t,i}|}{|m_{t,b}|},\end{split}

(13)

where $\|\cdot\|_{2}$ is the 2-norm operator of vectors used on the last dimension of a tensor, and $|\cdot|$ is the sum operator of a vector or matrix. $\gamma$ is a hyper-parameter shared for all samples and time steps, like that in the original CFG strategy. In particular, the mask $m_{t,b}\in\{0,1\}^{HW}$ is introduced to assign the benchmarking region. For example, when setting $m_{t,b}$ as 1 for any patch, the average patch norm of the current latent image is the benchmark scale. Here we also introduce another benchmark region for better performance, i.e., the foreground region, such as the union of the regions of “astronaut” and “horse” in Figure 1.

Specifically, when estimating the unconditional score $\nabla_{x_{t}}\log p(x_{t})$ , an empty prompt $\emptyset$ is fed into the model, i.e, $\epsilon_{\theta}(x_{t},\emptyset,t)$ , where $\emptyset$ is usually represented as a list of padding tokens with a start token. Based on our approach in Section 4.1, we can detect the semantic region of the START token $m_{t,\text{START}}$ , which effectively indicates the background area in our implementation (see the last column in Figure 3). Therefore, we can align the benchmarking region with the foreground region by setting:

\begin{split}m_{t,b}=1-m_{t,\text{START}}.\end{split}

(14)

5 Experiments

Benchmark Models. We include two diffusion models as base models: Stable diffusion (SD) [34], which operates in the latent image space, and DeepFloyd IF (IF) [39], which operates in the image pixel space. Specifically, we consider two versions of SD: SD-v1.5 and SD-v2.1, which differ in terms of model sizes and generative qualities. For the IF model, we use the middle-scale version, IF-M, which is constructed using multiple diffusion models. To maintain simplicity, two model stages are used, where the base diffusion model produces low-resolution samples and an upscale diffusion model boosts them to a higher resolution. Both stages can benefit from the CFG or S-CFG strategy. Additionally, the IF model uses the T5XXL as the text encoder without using the start token. Therefore, instead of assigning the foreground region based on the start token, we set the benchmarking mask $m_{t,b}$ in Equation 13 as 1 for any patch. All three models are publicly accessible.

Meanwhile, two samplers are discussed for all three models, i.e., DDIM [44] and DPMSolver++ [24], which are both the most widely used in practice. Specifically, for DDIM, we follow [34] and set the number of sampling steps as 250 for SD models with the noise variance parameter as 0. Regarding the IF model, which employs learnable noise variance parameters, we adhere to the original noise settings and conduct DDIM sampling with 50 steps. As for DPMSolver++, we set the number of sampling steps as 50.

5.1 Quantitative Evaluation

We compare the benchmark models with CFG and S-CFG on the MSCOCO 256 $\times$ 256 dataset. Two qualitative metrics are used: 1) FID-30K: zero-shot Frechet Inception Distance with 30K images and the corresponding captions, which measures the quality and diversity of images. 2) CLIP Score [28]: which randomly selects 5K captions as prompts and uses the CLIP model to assess the alignments between the generated images and their corresponding text prompts. In particular, the trade-off between FID and CLIP scores has been widely reported with varying CFG scales [26]. Therefore, we present the trade-off curve across a range of the global scale $\gamma\in[2.0,~{}3.0,~{}5.0,~{}7.5,~{}10.0]$ .

Based on the results presented in Figure 4, it is evident that our S-CFG strategy consistently outperforms the original CFG strategy across most experimental settings, where the trade-off curve of S-CFG consistently favors a position towards the bottom right of that of the original CFG strategy in each setting (See Appendix for a full detailed table). This phenomenon demonstrates the effectiveness and robustness of S-CFG, establishing its applicability in both latent image space and pixel space for diffusion models with different model sizes. In addition, we can find that the diffusion sampler may be crucial for the generative quality, specifically for the pixel space model, i.e., IF, where a significant performance gap is observed for DDIM and DPMSolver++. However, S-CFG also achieve performance improvement.

5.2 Human-Level Evaluation

Here, 80 prompts are randomly selected from MSCOCO validation dataset for generative images with CFG and S-CFG. Then, we asked 5 participants to assess both the image quality and image-text alignment. Human raters are asked to select the superior respectively from the given two synthesized images, one from the original CFG strategy, and another from our S-CFG strategy. For fairness, we use the same random seed for generating both images. The voting results are summarised in Table 1. The majority of votes go to our S-CFG strategy for all base models, demonstrating superiority in both evaluated aspects.

Table 1: Human-level evaluation results.

	Image Quality		Image-Text
	CFG	S-CFG	CFG	S-CFG
SD-v1.5	26.78%	73.22%	23.20%	76.80%
SD-v2.1	28.16%	71.84 %	31.85%	68.15%
IF	32.39%	67.61%	29.17%	70.83%

5.3 Qualitative Evaluation

In Figure 5, we show some samples generated by different models with CFG and S-CFG. For fairness, we use the same setting and random seed for different strategies. The results exhibit a notable enhancement in the model’s generative capacity from the aspects of semantic expressiveness and entity portrayal. For example, when given the prompt “A boy is playing Pokemon”, S-CFG improves SD-v1.5 by ensuring the boy’s appearance in a normal manner. In the case of “A person petting a small elephant statue”, S-CFG eliminates the irregular elephant’s trunk. Similar improvement in fine-grained structure completion can also be observed for SD-v2.1 and IF in the first two rows. Furthermore, for scenarios in the last rows, such as “A cat sitting … on a park bench”, “A plate of meat topped …” and “A man in a suit with a blue tie …”, S-CFG helps models generate images that accurately represent the semantic descriptions.

5.4 Ablation Analysis

Here, three variants of S-CFG are introduced: 1) S-CFG-mean sets the benchmarking mask $m_{t,b}$ as 1 for all patches. 2) S-CFG w/o sa is the variant without the segmentation completion based on self-attention maps. 3) S-CFG-sa is the variant with $R=1$ in Equation 10.

The results in Figure 6 based on SD-v1.5 demonstrate that all variants of S-CFG consistently outperform the original CFG strategy. This observation strongly supports our core idea of setting customized CFG scales for different semantic regions throughout the denoising process. In addition, when compared to other variants, S-CFG-mean exhibits increased performance instability and fails to achieve the optimal CLIP Score at the lowest FID score. It verifies the advantage of using the foreground region described in Equation 14 as the benchmarking region. Meanwhile, S-CFG w/o sa falls short in outperforming S-CFG-sa and S-CFG, albeit by a relatively small margin. This outcome highlights the effectiveness of self-attention-based segmentation completion. Furthermore, while S-CFG-sa and S-CFG demonstrate similar performance levels, Figure 3 shows that S-CFG exhibits superior segmentation capability, which should result in more accurate image generation. However, these improvements may not be fully captured by the current evaluation metrics.

Table 2: Performance comparisons of ControlNet with CFG and S-CFG, where the base model is SD-v1.5, the parameter

\gamma=3.0

and that sampler is DPMSolver++ with 50 steps.

	FID		CLIP Score
	CFG	S-CFG	CFG	S-CFG
Canny	8.670	8.382	0.3006	0.3019
Segmentation	9.595	9.549	0.3004	0.3017

5.5 Downstream tasks

Here, we extend the evaluations from foundational image generation to more specialized downstream tasks.

First, we incorporate S-CFG into ControlNet [50], which is a neural network architecture for adding various spatial conditioning controls to text-to-image diffusion models. Specifically, we utilize SD-v1.5 as the base model, incorporating image canny edge and image segmentation as the spatial conditions. Table 2 presents a performance comparison between CFG and S-CFG. The results demonstrate consistent improvement with the incorporation of S-CFG. Some examples are illustrated in Figure 7, showcasing notable improvements in image realism. Specifically, in the canny case of the duck toy, S-CFG enhances the structure of the duck’s mouth and rectifies color imbalances around the tail. Likewise, in the segmentation case of the house, the ControlNet with CFG fails to synthesize the background sky, whereas S-CFG successfully addresses this issue.

We have also integrated S-CFG into DreamBooth [36], which enables the personalization of text-to-image diffusion models with specific subjects using only a few subject images. The examples presented in Figure 8 highlight the improvements in image quality and text-image alignment achieved by S-CFG. For instance, S-CFG enhances the appearance of the dog’s mouth and brings the length of the toy’s legs closer to the input images. Notably, in the second row, DreamBooth with CFG fails to align the image with the text prompt “river”, whereas S-CFG succeeds.

6 Conclusion

This paper argues that classifier-free guidance (CFG) in text-to-image diffusion models suffers from spatial inconsistency in semantic strengths and suboptimal image quality. To this end, we proposed Semantic-aware CFG (S-CFG), customizing the guidance degrees for different semantic units. Specifically, we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. Then, the CFG scales across regions are adaptively adjusted to rescale the classifier scores into a uniform level. Experiments on multiple diffusion models demonstrated the superiority of S-CFG.

7 Acknowledgments

This research was supported by grants from the National Key R&D Program of China (No. 2022ZD0119302).

References

Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European conference on computer vision, pages 707–723. Springer, 2022.
Betker et al. [2023] James Betker, Gabriel Goh, Li **g, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, and Yunxin Jiao. Improving image generation with better captions. openai.com, 2023.
Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2018.
Chefer et al. [2023] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
Chen et al. [2024] Minghao Chen, Iro Laina, and Andrea Vedaldi. Training-free layout control with cross-attention guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5343–5353, 2024.
Couairon et al. [2023] Guillaume Couairon, Marlene Careil, Matthieu Cord, Stéphane Lathuilière, and Jakob Verbeek. Zero-shot spatial layout conditioning for text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2174–2183, 2023.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
Guo and Lin [2023] Qin Guo and Tianwei Lin. Focus on your instruction: Fine-grained and multi-instruction image editing by attention modulation. arXiv preprint arXiv:2312.10113, 2023.
Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
Ho and Salimans [2021] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
Ho et al. [2022] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
Huang et al. [2023a] Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation. Advances in Neural Information Processing Systems, 36, 2023a.
Huang et al. [2023b] Lianghua Huang, Di Chen, Yu Liu, Yujun Shen, Deli Zhao, and **gren Zhou. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778, 2023b.
Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 35:26565–26577, 2022.
Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2426–2435, 2022.
Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kipf and Welling [2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
Liu et al. [2022] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B Tenenbaum. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pages 423–439. Springer, 2022.
Liu et al. [2023] Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, and Trevor Darrell. More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 289–299, 2023.
Lu et al. [2022a] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in Neural Information Processing Systems, 35:5775–5787, 2022a.
Lu et al. [2022b] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022b.
Lugmayr et al. [2022] Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
Parmar et al. [2023] Gaurav Parmar, Krishna Kumar Singh, Richard Zhang, Yijun Li, **gwan Lu, and Jun-Yan Zhu. Zero-shot image-to-image translation. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Raffel et al. [2020] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Razavi et al. [2019] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
Reed et al. [2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
Shen et al. [2021] Dazhong Shen, Chuan Qin, Chao Wang, Hengshu Zhu, Enhong Chen, and Hui Xiong. Regularizing variational autoencoder with diversity and uncertainty awareness. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 2964–2970. International Joint Conferences on Artificial Intelligence Organization, 2021. Main Track.
Shonenkov et al. [2023] Alex Shonenkov, Misha Konstantinov, Daria Bakshandaeva, Christoph Schuhmann, Ksenia Ivanova, and Nadiia Klokova. Deepfloyd if, 2023. https://www.deepfloyd.ai/deepfloyd-if.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, pages 2256–2265. PMLR, 2015.
Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
Song et al. [2023] Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. arXiv preprint arXiv:2303.01469, 2023.
Wang et al. [2024] Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi, Weikang Bian, Guanglu Song, Yu Liu, and Hongsheng Li. Animatelcm: Accelerating the animation of personalized diffusion models and adapters with decoupled consistency learning. arXiv preprint arXiv:2402.00769, 2024.
Wang et al. [2023a] **glong Wang, Xiawei Li, **g Zhang, Qingyuan Xu, Qin Zhou, Qian Yu, Lu Sheng, and Dong Xu. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023a.
Wang et al. [2023b] Kai Wang, Fei Yang, Shiqi Yang, Muhammad Atif Butt, and Joost van de Weijer. Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. arXiv preprint arXiv:2309.15664, 2023b.
Wang et al. [2023c] Ruichen Wang, Zekang Chen, Chen Chen, Jian Ma, Haonan Lu, and Xiaodong Lin. Compositional text-to-image synthesis with attention map control of diffusion models. arXiv preprint arXiv:2305.13921, 2023c.
Xie et al. [2023] **heng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, and Mike Zheng Shou. Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7452–7461, 2023.
Xu et al. [2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
Zhang et al. [2023a] Chenshuang Zhang, Chaoning Zhang, Mengchun Zhang, and In So Kweon. Text-to-image diffusion model in generative ai: A survey. arXiv preprint arXiv:2303.07909, 2023a.
Zhang et al. [2023b] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023b.
Zhang and Chen [2022] Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022.
Zhang et al. [2022] Qinsheng Zhang, Molei Tao, and Yongxin Chen. gddim: Generalized denoising diffusion implicit models. arXiv preprint arXiv:2206.05564, 2022.
Zhao et al. [2023] Peiang Zhao, Han Li, Ruiyang **, and S Kevin Zhou. Loco: Locally constrained training-free layout-to-image synthesis. arXiv preprint arXiv:2311.12342, 2023.
Zhu and Koniusz [2020] Hao Zhu and Piotr Koniusz. Simple spectral graph convolution. In International conference on learning representations, 2020.

\thetitle

Supplementary Material

8 Deriving Equation 11

In this section, we provide a derivation for Equation 11 based on one assumption that may be not particularly strict, i.e., for any denoising step $t$ , the semantic units, corresponding to token set $\{w_{1},...,w_{L}\}$ , with masks $\{m_{t,1},...,m_{t,L}\}$ are independent of each other. Along this line, we can derive:

\begin{split}p(w_{i}|x_{t})&=\ p(w_{i}|\sum_{j=1}^{L}m_{t,j}\odot x_{t})\\ &=\frac{\prod_{j=1}^{L}p(m_{t,j}\odot x_{t}|w_{i})p(w_{i})}{\prod_{j=1}^{L}p(m% _{t,j}\odot x_{t})}\\ &=\frac{p(m_{t,i}\odot x_{t}|w_{i})p(w_{i})\prod_{j=1,j\neq i}^{L}p(m_{t,j}% \odot x_{t})}{\prod_{j=1}^{L}p(m_{t,j}\odot x_{t})}\\ &=\frac{p(m_{t,i}\odot x_{t}|w_{i})p(w_{i})}{p(m_{t,i}\odot x_{t})}\\ &=p(w_{i}|m_{t,i}\odot x_{t}).\end{split}

Then, we can deduce Equation 11 as follows:

\begin{split}p(c|x_{t})&=\prod_{i=1}^{L}p(w_{i}|x_{t})\\ &=\prod_{i=1}^{L}p(w_{i}|m_{t,i}\odot x_{t}).\\ \nabla_{x_{t}}\log p(w_{i}|m_{t,i}\odot x_{t})&=\nabla_{m_{t,i}\odot x_{t}}% \log p(w_{i}|m_{t,i}\odot x_{t})\\ &=\nabla_{m_{t,i}\odot x_{t}}\log p(w_{i}|x_{t})\\ &=\nabla_{m_{t,i}\odot x_{t}}\log p(c|x_{t})\\ &=m_{t,i}\odot\nabla_{x_{t}}\log p(c|x_{t}).\end{split}

Note that the prior assumption may not be strict in practice. However, it is intuitive that the patches among different semantic regions are more independent than those in the same patches. Meanwhile, based on the segmentation examples in Figure 3 and our experimental results, we believe that it is beneficial to segment the latent image and customize guidance degrees for different semantic regions.

9 More Experimental Details

Benchmark Models. In our experiment, we involve three special diffusion models as the benchmarks, which are all publicly accessible:

•

Stable Diffusion v1.5 (SD-v1.5), a diffusion model in the latent space of powerful pre-trained autoencoders ¹¹1https://huggingface.co/runwayml/stable-diffusion-v1-5, which use the CLIP [28] as the text encoder and output images with the resolution 512x512.
•

Stable Diffusion v2.1 (SD-v2.1), a variant of SD-v1.5 with more model size ²²2https://huggingface.co/stabilityai/stable-diffusion-2-1, which can output images with the resolution 768 $\times$ 768.
•

DeepFloyd IF (IF), is a diffusion model in the pixel image space ³³3https://huggingface.co/DeepFloyd/IF-I-M-v1.0, which is constructed using multiple diffusion models with T5XXL as the text encoder. In particular, we use the first two stages of the middle-scale version, i.e., IF-I-M-v1.0 and IF-II-M-v1.0, which produce the 64 $\times$ 64 resolution image and boost them into 256 $\times$ 256 resolution, respectively.

Quantitative Metric. Two qualitative metrics based on the MSCOCO validation dataset are used:

•

FID-30K, where the FID score is computed on the 30K generated images with prompts selected from the validation set and the corresponding original images.
•

CLIP Score, where 5K captions are selected randomly for guiding image synthesis, and CLIP-VIT-G-14 ⁴⁴4https://huggingface.co/laion/CLIP-ViT-g-14-laion2B-s34B-b88K is used to compute the similarity between the generated image and the corresponding caption.

In particular, our metric settings may be different from those in the official reports of the SD and IF models. It is somewhat weird that SD-v2.1 fails to outperform SD-v1.5 in our settings. Here, we also add another comparison on them based on a similar setting to their official report ⁵⁵5https://huggingface.co/stabilityai/stable-diffusion-2, i.e., where FID-10k and CLIP Score (CLIP-VIT-G-14) on MSCOCO dataset are used with the 50-step DDIM sampler. The results are shown in Figure 9. We can find that our S-CFG strategy also outperforms the original CFG strategy.

10 Analysis on the Efficiency

Here, we provide an additional analysis of the time cost of our S-CFG strategy. Specifically, we use DPMSolver++ with 50 steps as the sampler to generate images with different base models. All programs run on a single A100 GPU. Table 3 shows the average time cost for generating a sample in 10 runs. We can find only a tiny time cost has been required compared with the original CFG strategy.

Table 3: The analysis on the time cost.

	CFG	S-CFG	improv.
SD-v1.5	2.773	2.848	2.70%
SD-v2.1	7.054	7.167	1.60%
IF	8.595	8.847	2.93%

11 More Ablation Analysis

Here, we provide an additional ablation analysis of the S-CFG on the diffusion model with multiple stages, such as DeepFloyd IF [39]. We try to respond to the question: should the S-CFG strategy be used on all diffusion stages? Specifically, based on the IF model used in our paper, we compare the performance of three methods:

•

S-CFG-first, where the S-CFG strategy is only used in the first diffusion model, i.e., IF-I-M-v1.0.
•

S-CFG-second, where the S-CFG strategy is only used in the second diffusion model, i.e., IF-II-M-v1.0.
•

S-CFG, where the S-CFG strategy is used in both two diffusion models.

In addition, the original CFG strategy is involved as a baseline. We use DPMSolver++ as the sampler with 50 steps and vary the parameter $\gamma$ in [2.0, 3.0, 5.0, 7.5, 10.0]. The trade-off curve of FID-30k VS CLIP Score is shown in Figure 10. We can find that S-CFG tends to achieve the best trade-off between FID-30K and ClIP Score, while S-CFG-first and S-CFG-second perform similarly.

12 More Evaluation on Effectiveness

Recently, a new metric called T2I-CompBench [14] was introduced to evaluate diffusion models, which assesses image quality from 6 aspects and aligns with human preference better. Here, we provide another comparison based on this metric. The results in Table 4 show that SD-v2.1 outperforms SD-v1.5 significantly, and S-CFG performs better than CFG.

Table 4: Evaluation on T2I-CompBench, where the

\gamma=7.5

Model	Attribute Binding			Object Relationship		Complex
Model	Shape	Color	Texture	Non-Spatial	Spatial	Complex
SD-v1.5+CFG	0.3664	0.3761	0.4286	0.3109	0.111	0.2969
SD-v1.5+S-CFG	0.3793	0.3879	0.4288	0.3111	0.1182	0.2993
SD-v2.1+CFG	0.4518	0.549	0.5146	0.3096	0.1512	0.3154
SD-v2.1+S-CFG	0.4558	0.5649	0.5333	0.3104	0.1567	0.3168

13 Detailed Table of Experiments

Here, we show the detailed tables for experiments in Figures 4 and 6. We can find that our S-CFG achieves the best performance on all settings, with the best FID-30K score and CLIP Score.

Table 5: The trade-off curve of SD-v1.5, where the best FID-30k and CLIP Score are highlighted.

	DDIM				DPMSolver++
	CFG		S-CFG		CFG		S-CFG
$\gamma$	FID-30K	CLIP Score	FID-30K	CLIP Score	FID-30K	CLIP Score	FID-30K	CLIP Score
2.0	8.696	0.2948	8.656	0.2972	8.991	0.2954	9.023	0.2964
3.0	7.904	0.3097	7.802	0.3107	7.760	0.3091	7.717	0.3099
5.0	10.366	0.3184	10.069	0.3196	10.026	0.3182	9.757	0.3187
7.5	13.008	0.3217	12.620	0.3228	12.466	0.3223	12.059	0.3226
10.0	14.682	0.3230	14.101	0.3231	14.107	0.3235	13.694	0.3236

Table 6: The trade-off curve of SD-v2.1, where the best FID-30k and CLIP Score are highlighted.

	DDIM				DPMSolver++
	CFG		S-CFG		CFG		S-CFG
$\gamma$	FID-30K	CLIP Score	FID-30K	CLIP Score	FID-30K	CLIP Score	FID-30K	CLIP Score
2.0	14.394	0.3053	13.892	0.3068	14.999	0.3040	14.864	0.3060
3.0	10.509	0.3191	10.227	0.3204	10.869	0.3187	10.797	0.3200
5.0	10.429	0.3286	10.137	0.3306	10.241	0.3291	10.016	0.3304
7.5	11.548	0.3331	11.278	0.3342	11.324	0.3339	10.944	0.3342
10.0	12.604	0.3357	12.371	0.3359	12.166	0.3356	11.833	0.3359

Table 7: The trade-off curve of IF, where the best FID-30k and CLIP Score are highlighted.

	DDIM				DPMSolver++
	CFG		S-CFG		CFG		S-CFG
$\gamma$	FID-30K	CLIP Score	FID-30K	CLIP Score	FID-30K	CLIP Score	FID-30K	CLIP Score
2.0	9.820	0.3076	9.309	0.299	7.242	0.2997	8.494	0.2926
3.0	13.804	0.3195	10.864	0.3152	7.799	0.3147	7.227	0.314
5.0	17.267	0.3257	14.473	0.3259	11.396	0.3233	9.67	0.3226
7.5	18.532	0.329	16.621	0.3288	13.968	0.327	12.402	0.3265
10.0	19.029	0.3296	17.634	0.3299	15.31	0.3280	13.99	0.3280

Table 8: The trade-off curve in the ablation analysis , where the best FID-30k and CLIP Score are highlighted. The experiment is based on SD-v1.5 with 50-step DPMSolver++ Sampler.

	S-CFG-mean		S-CFG w/o sa		S-CFG-sa		S-CFG
$\gamma$	FID-30K	CLIP Score	FID-30K	CLIP Score	FID-30K	CLIP Score	FID-30K	CLIP Score
2.0	10.703	0.2869	9.110	0.2963	9.063	0.2966	9.023	0.2964
3.0	7.695	0.3044	7.811	0.3089	7.736	0.3099	7.717	0.3099
5.0	8.813	0.3162	9.822	0.3185	9.755	0.3185	9.757	0.3187
7.5	11.204	0.3213	12.102	0.3222	12.083	0.3227	12.059	0.3226
10.0	12.838	0.3233	13.722	0.3235	13.690	0.3235	13.694	0.3236

14 Additional Qualitative Samples

In this section, we present supplementary samples in Figure 11 generated by different base models with CFG and S-CFG. These additional samples further exhibit the superiority of S-CFG compared with the original CFG strategy.