ControlNet-XS: Designing an Efficient and Effective Architecture
for Controlling Text-to-Image Diffusion Models

Denis Zavadski

{}^{*}

[email protected] Johann-Friedrich Feiden

{}^{*}

[email protected] Carsten Rother

{}^{*}

[email protected]

{}^{*}

Computer Vision and Learning Lab, IWR, Heidelberg University

Abstract

The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. For this, a recent and highly popular approach is to use a controlling network, such as ControlNet [72], in combination with a pre-trained image generation model, such as Stable Diffusion [51]. When evaluating the design of existing controlling networks, we observe that they all suffer from the same problem of a delay in information flowing between the generation and controlling process. This, in turn, means that the controlling network must have generative capabilities. In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem, and hence can focus on the given task of learning to control. In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time. Furthermore, the generated images are of higher quality and the control is of higher fidelity. All code and pre-trained models will be made publicly available.

Figure 1: Image synthesis with the production-quality model of Stable Diffusion XL [40], using text-prompts, as well as, depth control (left) and canny-edge control (right). We achieve these results with a new controlling network called ControlNet-XS. In contrast to the well-known ControlNet [72], our design requires only a small fraction of parameters while at the same time it improves image quality and provides control with higher fidelity. Furthermore, it is about two times faster than ControlNet with respect to inference and training time.

1 Introduction

Using Generative Artificial Intelligence to synthesize new images, is a topic that has received large attention in social media, press, research and industry. It started off in $2014$ with the use of Generative Adversarial Networks (GAN) [14] that were able to synthesize small-sized images of a given class [43], e.g. celebrity faces. Today, we have commercial and non-commercial products, such as Midjourney [1] and Stable Diffusion XL [40], which are able to generate large-sized images (up to $1024\times 1024$ ) with almost arbitrary content, e.g. ranging from professional photographs to manga art. This can be considered as a truly disruptive technology, for which the development is still far from being completed. One ongoing development is to create control tools, with which users can steer the image generation process towards their desired output. A common control mechanism is text-prompts. Another choice is to use a guidance image, which defines the desired output image in an abstract form such as a sketch or a depth map. This control mechanism is known as image-to-image translation [38, 55, 65, 21, 66, 34, 75]. The control mechanisms can also be combined by adding image guidance to a Text-to-Image model [72, 33, 74, 20]. Our work falls into this class of methods. There are in general two different conceptional choices for implementing the image guidance in a Text-to-Image model.

On one hand, there is the approach of fine-tuning a generative process with a new control mechanism at hand, e.g. [65]. Such methods use annotated guidance images as additional input to fine-tune a pre-trained network. However, such an end-to-end learning approach is challenging since oftentimes there is a large in-balance between the original training data for the generative process, e.g. $\sim 3$ B images for training Stable Diffusion [51], in contrast to only $1$ M images with known control, as in [72]. Such an imbalance can lead to effects like “catastrophic forgetting” [32], which means that known properties of the generative model disappear after fine-tuning. Additionally, fine-tuning often requires access to a large computing cluster.

On the other hand, there are approaches that lock the parameters of the generative network and then train a separate controlling network. The outputs of the controlling network and the generation network are added internally, in order to enable a control. This design choice is currently most popular [72, 33, 74, 20], and, to the best of our knowledge, ControlNet [72] defines the current state-of-the-art. They have observed that the following training strategy works best. The controlling network copies at first part of the generative network. The connections from the controlling network to the generative network are initialized by so-called zero-convolution, to make sure that generative capabilities of the generative network do not diminish at the start of training. In training, the generative model is locked and the control model gently moves from being a pure image generation process towards a process that takes the given control into account. They have shown that this strategy is essential for success, since a randomly initialized, small-sized ControlNet performed poorly. In contrast, we mange to achieve this by questioning one important design choice in the architecture, which is present in all previous works to the best of our knowledge.

The problem of previous works is that there is a significant delay in transferring information from the generation process to the control process, and vice versa. We show that by removing this delay we are able to use a control network that is drastically smaller than ControlNet [72] and even performs better in terms of image quality and control with higher fidelity. The reason for this improvement is that by resolving this delay, the network can focus on the given task: Learning to control the generation process.

To understand this better, let us consider an analogy to this problem. Assume that the generative process is an autonomous vehicle. The vehicle is a complex system with surely a large number of parameters and many different sources of input. Assume that the control input is a specific address it must navigate to. The controlling model is the satellite navigator system in the vehicle. The navigator is a much simpler system than the vehicle itself. Furthermore, it only needs little information, i.e. the current position of the vehicle, and communicates with the vehicle with simple commands like “turn left at the next junction”. A crucial requirement for both systems to operate smoothly together is that there is no delay in information flowing in both directions. For instance, if the navigator were to know the position of the vehicle with a delay of ten seconds, then the vehicle has passed already the junction where it should have turned left! To design a smart navigator system that receives information which is outdated by ten seconds is possible, but much more challenging. The navigator has to guess where the vehicle would drive to in the next ten seconds in order to give commands that are useful for the vehicle at that point of time in the future. This is exactly what happens in ControlNet [72]. Due to the delay in information flowing between the two networks, the controlling network needs generative power to predict what the generation network may do next, before the generation network receives the control signal. We conjecture that this is the reason why a small-sized ControlNet, trained from scratch, performed poorly [72]. In our work we are able to use a small-sized control network, trained from scratch, since we remove the delay in information flow.

In summary, our contributions are as follows: (1) An efficient, small-sized architecture for controlling a pre-trained Text-to-Image diffusion model, that does not suffer from the problem of delay in information flow; (2) Superior results compared to the state-of-the-art in terms of both image quality (FID score) and control with higher fidelity; (3) Controlling the production-quality Stable Diffusion XL model [40] with $2.6$ B parameter by a control network that has only $20$ M parameters; (4) Release of all code and pre-trained models.

2 Related Work

2.1 Image Generation and Translation

Generative Adversarial Networks (GANs) [14] are probably the most established generative models for unconditional [25, 26, 24], class-conditional [58] and text-conditional [57, 22, 49, 50, 63, 67, 71, 76] image generation. While achieving state-of-the-art results for particular semantic classes [23, 64, 54], generalising GANs to synthesise images of arbitrary content remains an active area of research. With GigaGAN [22] demonstrating state-of-the-art performance, generic Text-to-Image synthesis is still dominated by autoregressive networks like DALL-E [46], Parti [70], CogView [9], Make-A-Scene [11] and Diffusion Models in particular. Since their introduction, Image Diffusion Models [60] rapidly became one of the best performing model-families [8, 37, 18, 17, 27, 61]. Diffusion models learn to transform a point in a simple high-dimensional distribution, such as a Gaussian, to a complex distribution, like the space of all images. This transformation is done by iteratively applying a U-Net that gradually removes the Gaussian noise in the image. This process allows to theoretically model arbitrary complex data distributions [60].

Conditioning Image Synthesis Models. To achieve a desired output, one popular choice is to use text-prompts as guidance. This is done by conditioning a generic image synthesis model on a textual-embedding, provided by pre-trained text-encoders like BERT [7], T5 [44] or CLIP [42]. This combination has lead to impressive results for complex Text-to-Image generation task, by models like Stable Diffusion [51], DALL-E2 [45], DALL-E3 [3], Stable Diffusion XL [40], Imagen [56] and many others [36, 68]. However, text-prompts provide very little control for spatial guidance within an image. To address this problem, one line of work proposed to personalise the generation by enabling the insertion of specific instances to an image [53, 12, 69]. Another direction is to manipulate only a given mask within an image, such as in-painting [2, 13].

Image-to-Image Translation. Many methods use images instead of, or in addition to, textual prompts to condition generative networks. These images describe the desired scene in an abstract form, such as a semantic segmentation, a depth map or a sketch. One approach for implementing this idea is to train the full generative model, from the beginning, with the additional image conditioning [10, 55, 68]. Alternatively, heavy fine tuning of a pre-trained model [65] can be done. These approaches have the drawback that substantial computational resources are required, and that new types of image conditioning cannot easily be added. Due to these limitations, controlling pre-trained networks has become a popular design choice.

2.2 Controlling Pre-Trained Networks

With large-scale, pre-trained models, it has become popular to leave the base model unaltered in order to keep its generative capabilities. A separate controlling network is trained to control the base model. Typically, the internal outputs of the controlling network are simply added to the internal outputs of the base model. An example is a popular technique from natural language processing called “Low-Rank-Adaptation” (LoRA) [19]. The weight matrix of the controlling network is a multiplication of two low-rank matrices, which is a simple way to manage the number of parameters of the controlling network. Following upon this, “orthogonal fine-tuning” makes use of a learnable transformation of weights in order to adapt new concepts [41]. An alternative concept for controlling is to use so-called adapters, which insert new trainable modules, e.g. neural blocks, to the pre-trained generative network. However, in contrast to LoRA, inference time increases. Adapters are popular in natural language processing [62, 39, 30] and have been transferred to image generation models [52, 48], vision transformers (ViT) [28] and are also used for dense predictions in the form of ViT-Adapters [6].

Image Control Models. The idea of adding image guidance to Text-to-Image generation models has recently become popular [33, 72, 74, 20]. The T2I-Adapter [33] trains an additional controlling encoder that adds an intermediate representation to the intermediate feature maps of the locked, pre-trained encoder of Stable Diffusion. Similarly, ControlNet [72] copies the whole encoder network of Stable Diffusion as controlling network and fine-tunes it with an additional image input. However, both of these approaches [72, 33], as well as related work [74, 20], share a common problem. The controlling network receives outdated information from the pre-trained generative network, here Stable Diffusion, and therefore may struggle to send appropriate control signals to the generative network. We resolve this problem, and as a consequence can use a more efficient and effective controlling network.

3 Method

We start with a brief overview of the different components of the Stable Diffusion [51] architecture, which serves as our generative model (Sec. 3.1). The pre-trained generative model is controlled by a controlling network. In Sec. 3.2 we introduce the controlling network ControlNet [72], and explain why it suffers from the problem of delayed information flow. Lastly, we introduce our controlling network, called ControlNet-XS, that eliminates this problem (Sec. 3.3). In Sec. 3.4, we describe the respective training procedure.

Refer to caption — \thesubsubfigure ControlNet [72]

3.1 Stable Diffusion

Stable Diffusion [51] is a U-Net based diffusion model for Text-to-Image generation. As a conditional diffusion model, it receives a text embedding from a separate text encoder, as well as a learned time embedding. The output image is reconstructed from noise, by iteratively running the U-Net over typically $50$ time-steps. The U-Net generator is composed of a sequence of neural blocks involving cross-attention mechanisms for text conditioning. The image signal is processed by the encoder in four layers with diminishing resolution and three neural blocks per layer. Through the mirrored structure of the decoder and one middle block in between, the U-Net has a total of 25 neural blocks. The output of each neural block can be influenced individually by a controlling network.

3.2 ControlNet

ControlNet [72] starts with a pre-trained generative model, here the U-Net of Stable Diffusion [51]. The control model copies the encoder of the U-Net, and hence has a representation that is capable of generating images by itself. The control encoder receives the control information, e.g. in form of a depth map, as well as the intermediate, noisy generated image. It outputs control signals that are fed into the different decoder blocks of the generative process (see Fig. 2(a)). The connections from the control model to the generation model are initialised by so-called zero-convolutions, which have the effect that the generative capabilities of the controlled U-Net are not diminished at the beginning of training. In training, the encoder can learn to provide useful control signals to the generative process. The training objective is the one of Stable Diffusion, i.e. image denoising (see Sec. 3.4). While these sound like reasonable design choices, at first glance, it has the severe problem of a delay in the flow of information, as illustrated in Fig. 2. Due to the delayed information flow, the control model has two jobs at once: a) it has to process the input control information in order to make it useful for the generation process; b) it has to anticipate what the encoder of the generation process is going to do. In our design of the control network we remove the delay in information flow. In this way, the control model can focus on the first job.

3.3 ControlNet-XS

Given the insight of delayed information flow in ControlNet [72], we design an architecture that remedies this problem. The idea is to have the two encoders, from the generation and control model, to communicate directly with each other. Fig. 2 illustrates the improvement in information flow in our design. We see that the encoders in both processes share the same information, illustrated by colour coding. We refer to Appendix A for a detailed illustration of our architecture, while a less detailed version is shown in Fig. 2(c) (Type B). The connections between the two encoders are processed by zero-convolutions before concatenating them to the input of the respective neural block in the other encoder. During training, this has the same advantage as in ControlNet, i.e. to not diminish the generative capabilities of the generative process at the start of training.

Our new design allows to drastically reduce the size of the controlling network. Even a model with as little as $1.7$ M parameters performs on a par with ControlNet [72] with $361$ M parameters. Note that in ControlNet [72], a version of ControlNet with fewer parameters, called ControlNet-light, was evaluated but found to perform inferior. We call our efficient and effective architecture ControlNet-XS. The second major difference between ControlNet and ControlNet-XS is that we do not need to copy the pre-trained encoder of the generative U-Net. We only utilise the general structure of the encoder, with random weights. The reason for this is that our ControlNet-XS does not need any generative power, like ControlNet, due to no delay in information flow.

3.4 Training ControlNet-XS

For training we follow the same scheme as the original ControlNet [72] and keep all weights of the generative model frozen, i.e. updating only the weights of ControlNet-XS. Because we do not rely on ControlNet-XS being able to approximate a large-scale generative model, we do not need to start from pre-trained weights as initialisation. Instead, we initialise all parameter randomly. We train two versions of ControlNet-XS, one with edge control and one with depth control. As training data, we use one million images from the Laion-Aesthetics dataset [59]. As in ControlNet [72], we either extract canny-edges, or predict the respective depth maps using MiDaS [47]. The standard diffusion model objective remains unchanged:

\displaystyle\mathcal{L}=\mathbb{E}_{z_{0},t,c_{t},c_{f},\epsilon\sim\mathcal{% N}(0,1)}\bigl{[}\|\epsilon-\epsilon_{\theta}(z_{t},t,c_{t},c_{c})\|_{2}^{2}% \bigr{]},

(1)

with the target image $z_{0}$ , the noisy image $z_{t}$ , the timestep $t$ , the text conditioning $c_{t}$ and the control conditioning $c_{c}$ .

	Quality		Both	Control
Method	CLIP-Sc $\uparrow$	CLIP-Ae $\uparrow$	FID $\downarrow$	LPIPS $\downarrow$	MSE-d $\downarrow$
CN (361M)	28.96	6.08	19.01	0.532	29.1
CN-XS A (53M)	29.00	6.02	17.11	0.492	20.9
CN-XS B (55M)	29.21	6.09	16.36	0.468	19.6
CN-XS C (117M)	29.14	6.10	16.24	0.476	20.2

Table 1: Ablation study for four different architectures illustrated in Fig. 3. We see that all ControlNet-XS (CN-XS) architectures outperform ControlNet [72] (CN), both in terms of quality (FID) and control (MSE-depth). We select Type B as our final ControlNet-XS architecture for all experiments, since it performs best, on average, and has fewer parameters than Type C.

4 Experiments

We examine our ControlNet-XS model in terms of different network sizes, architecture configurations, and compare it to state-of-the-art approaches. To better understand the differences in performance of various architectural designs, we conduct a sensitivity analysis. Furthermore, we examine whether large control models induce a bias on the generative model. Finally, to demonstrate the versatility of our approach, we apply it to a larger generative model, namely Stable Diffusion XL. In order to compare to other approaches, we use Stable Diffusion version 1.5 [51] as the generative model in all remaining experiments. All evaluations are shown for depth control only, and we refer to Appendix C for results with respect to edge control.

4.1 Evaluation Metrics

We judge the performance of ControlNet-XS and its competitors by evaluating the fidelity of the control and the quality of the generated images to ensure that the generated quality does not reduce. For quality evaluation, we use the CLIP-Score [15] which approximates the similarity between a given text-prompt and an image, and the CLIP-Aesthetics score [59] which approximates the aesthetic appearance of an image as perceived by humans. The fidelity of the control is evaluated implicitly with the Learned Perceptual Image Patch Similarity (LPIPS) [73] and explicitly by a distance measure between two images, which is the MSE for depth control, denoted by MSE-depth, and the Hausdorff distance for canny-edges, denoted by MSE-HDD. Here, the first image is the reference control image, e.g. the depth map, and the second image is the extracted depth map from the generated image. The extraction algorithm is the same as the one used to generate the training data, i.e. MiDaS [47] for depth extraction. Note that for improved readability, the MSE-depth values are scaled by $10^{3}$ . We also evaluate the Fréchet Inception Distance (FID) [16], a metric that compares the distributions of intermediate features of a pre-trained network applied to generated and original images. For this and all other metrics we use the COCO validation dataset [29] of $5000$ images. The FID score measures both quality and control. Note that it measures the fidelity of the control since the control signal comes from a target image of the COCO validation set, and hence the features of the generated image are expected to be similar to the features of the target image.

4.2 Ablation Study: Architecture

We conduct an ablation study for three types of architectures of ControlNet-XS, see Fig. 3b-d. Architecture Type A eliminates the problem of delayed information flow by having information flowing from the encoder of the generation process to the encoder of the controlling process. For Type B we additionally have information flowing in the other direction, i.e. from controlling network to generation network. This makes sure that the generative encoder does not perform “uncontrolled processing” by instantly adjusting the feature maps. Finally, Type C evaluates whether the full mirroring of the generative U-Net, and hence more tailored control of the generative decoder, has any advantages. Tab. 1 shows a quantitative comparison of the three architecture types together with the original ControlNet. Note that our new ControlNet-XS architectures have considerably fewer parameters (53M, 55M and 117M) than ControlNet (361M). For two metrics, quality (FID) and control (MSE-depth), all ControlNet-XS architectures are clearly superior to ControlNet. For other metrics there is no clear trend. We attribute this to the improvement in information flow. Furthermore, Type B performs better than Type A for all measures. The performance of Type B and C are on par. However, Type C has the drawback of effectively doubling the model size. We explain this lack of quantitative improvement of Type C in a sensitivity analysis (Sec. 4.3). We choose type B as our final architecture for ControlNet-XS, and use it in all remaining experiments.

4.3 Sensitivity Analysis

In the following analysis, we want to understand by how much each individual block of the generative U-Net is affected by the control network. The study is shown in Fig. 4. We see that certain blocks are affected more than others. In particular, blocks in the encoder are affected considerably more than blocks in the decoder. We conjecture that this is the reason why our Type C architecture (Fig. 2(d)) with a mirrored generative decoder does not lead to a clear improvement in performance (Tab. 1).

4.4 Ablation Study: Network Size

We evaluate whether changing the parameter size of ControlNet-XS influences the performance, and if yes, by how much we can reduce the size until we notice an apparent decrease in quality and control. Tab. 2 shows the results for ControlNet-XS with $491$ M, $55$ M, $11.7$ M and $1.7$ M parameters, respectively, as well as the Stable Diffusion baseline without any control. We provide the MSE-depth score and LPIPS score for Stable Diffusion, which serves as an upper bound. Note that this bound is not very large, since the text-prompts themselves can already generate images with related depth maps. However, as expected, enforcing control does reduce these scores considerably. We roughly see the same trend when varying the sizes of ControlNet-XS for all scores: The performance increases slightly from $491$ M to $55$ M and decreases afterwards for smaller model sizes, up to $1.7$ M. Hence, we choose the 55M model as our best model, and show qualitative results in Fig. 5. In terms of control, it means that smaller models have reduced fidelity of the control. We show qualitative results of this effect in Fig. 6. The decrease in performance for smaller model sizes can be explained as follows. Control models with few parameters perform more similarly to the “uncontrolled” generative model, i.e. Stable Diffusion, which preforms worse in general. Note that CLIP-Aesthetic score is highest for Stable Diffusion, hence our 1.7M model performs best for this score. Note that control models with more parameter are more powerful and hence can considerably affect the overall performance. In Sec. 4.6 we analyse biases induced by large control models.

	Quality		Both	Control
Method	CLIP-Sc $\uparrow$	CLIP-Ae $\uparrow$	FID $\downarrow$	LPIPS $\downarrow$	MSE-d $\downarrow$
Stable Diffusion	28.40	6.16	22.69	(0.618)	(69.7)
CN (361M)	28.96	6.08	19.01	0.532	29.1
T2I (77M)	28.80	5.98	20.29	0.526	31.4
CN-XS (491M)	29.09	6.07	16.91	0.487	21.4
CN-XS (55M)	29.21	6.09	16.36	0.468	19.6
CN-XS (11.7M)	28.83	6.10	17.90	0.525	28.6
CN-XS (1.7M)	28.73	6.12	18.45	0.526	29.9

Table 2: Quantitative evaluation with respect to competitors and change in model size of ControlNet-XS. We observe that our best model, ControlNet-XS (CN-XS) with

55

M parameters, outperforms the two competitors, i.e. ControlNet (CN) [72] and T2I-Adapter (T2I) [33], for every single metric. Furthermore, for ControlNet-XS models with few parameters, e.g. 1.7M, we notice that the fidelity of the control diminishes, see MSE-depth score.

4.5 Quantitative Comparison

We compare our ControlNet-XS in Tab. 2 to both state-of-the-art models ControlNet [72] and T2I-Adapter [33]. Our best model with $55$ M parameters outperforms both competitors for every single metric. With respect to quality measures (CLIP-Score, CLIP-Aesthetics) we improve by a small margin, while in terms of MSE-depth and FID score we are clearly superior. Please note that all model sizes of ControlNet-XS, even with just $1.7$ M parameters, perform either on a par or better than the competitive approaches. We conjecture that this improvement stems from solving the delayed information flow problem. Tab. 3 compares inference and training times of ControlNet-XS and ControlNet. For both, we increase the speed by about a factor of $2$ .

Method	Inference $\downarrow$	Training $\downarrow$
ControlNet ( $361$ M)	1min 11sec	$\sim$ $500$ h (A100)
ControlNet-XS ( $55$ M)	38sec	$\sim$ $200$ h (A100)

Table 3: Comparison of inference and training times of our ControlNet-XS and ControlNet [72], trained to control depth. Inference times are averaged over seven runs and we evaluate for 50 DDIM steps with a batch size of 10. The training time is given in NVIDIA A

100

GPU hours.

Method	FID $\downarrow$	CLIP-Sc $\uparrow$	CLIP-Ae $\uparrow$
Stable Diffusion	22.69	28.40	6.16
ControlNet-XS ( $55$ M)	72.20	28.83	4.46
ControlNet-XS (1.7M)	58.95	28.57	4.65

Table 4: Bias induced by low-quality training data. We train two versions of ControlNet-XS (55M and 1.7M) with low resolution training data. For this, we down-scale the RGB training images by a factor of

8

and then up-scale them back to the original resolution using bicubic interpolation. We see that the FID and CLIP-Aesthetic scores are heavily affected by this change in training data. The bias is less pronounced in the smaller

1.7

M model.

4.6 Bias Induced by Large Control Models

In the following, we analyze two scenarios, in which the generative model is biased when combined with a large control model. We have already analysed above that ControlNet [72] needs to be of large size in order to have sufficient generative power to mitigate the problem of delayed information flow. In Fig. 7, we analyse a bias of ControlNet which we refer to as semantic bias for depth control. It is important to note that the strength of the control is not able to reduce the bias induced by the control model. The second bias shows a limitation of existing control models and is discussed in Tab. 4. We see that lower quality training data for the control model leads to lower quality of the generated images.

	Quality		Both	Control
Method	CLIP-Sc $\uparrow$	CLIP-Ae $\uparrow$	FID $\downarrow$	LPIPS $\downarrow$	MSE-d $\downarrow$
SD XL (2.6B)	27.06	5.84	59.47	(0.668)	(123.2)
T2I (77M)	27.96	5.67	61.03	0.627	49.0
CN-XS (400M)	29.51	6.13	19.28	0.528	27.2
CN-XS (104M)	29.45	6.23	19.12	0.511	26.2
CN-XS (20M)	29.41	6.19	18.75	0.505	22.6

Table 5: Quantitative evaluation with Stable Diffusion XL. All versions of ControlNet-XS (CN-XS) are able to control the generative model with

2.6

B parameters, as seen by the low MSE-depth score. Furthermore, all versions of ControlNet-XS outperform the T2I-Adapter [33] by a large margin. This is again a consequence of the problem of delayed information flow in the T2I-Adapter. Also, the T2I-Adapter does not reduce the FID score of the uncontrolled Stable Diffusion XL model. This is slightly surprising since the MSE-depth score of the T2I-Adapter, i.e.

49.0

, is lower then the upper-bound, i.e.

123.2

, of the uncontrolled model. We see that varying the model size has only a minor effect on the performance. This is in contrast to the results produced with Stable Diffusion (Tab. 2). Hence we choose the smallest ControlNet-XS with

20

M parameters as our best model, given that it induces the smallest bias on the performance of the generative model. Note, this model has less than

1

% of parameters of the generative model.

4.7 Evaluation with Stable Diffusion XL

We evaluate our ControlNet-XS model with Stable Diffusion XL [40] as generative model. Stable Diffusion XL has about $2.6$ B parameters and hence is over three times larger than its predecessor Stable Diffusion. Tab. 5 presents and discusses quantitative results with respect to model size and the T2I-Adapter [33]. Note that there is no official version of ControlNet available for Stable Diffusion XL. Qualitative results are shown in Fig. 1.

5 Limitations & Societal Impact

We see the main limitation of this work, and previous works, in the problem that a controlling network can add unwanted biases to the performance of the generative model (Sec. 4.6). We mitigate this problem by drastically reducing the size of the controlling network. However, ideally a controlling network does only do the job of controlling the output without inducing any unwanted biases. To achieve this, a better understanding of the generative model may be of help. Our sensitivity analysis (Sec. 4.3) is a first step in this direction.

As said in the introduction, synthetic image generation is a disruptive technology. Hence it is of paramount importance to conduct research on misuse of this technology, such as producing deep fakes. Example of such research is deep fake image detection [35, 31] or embedding watermarks into images [4].

6 Conclusion

We presented ControlNet-XS, a network for controlling pre-trained Text-to-Image Diffusion Models. Extensive experiments validated the superior performance with respect to the state-of-the-art, such as ControlNet [72], despite having a considerably smaller amount of parameters. There are many avenues for future research. One direction is to better understand the mechanisms of the generative model, both from a mathematical standpoint as well as by conducting empirical tests. We believe that this is important for offering in the future even better and application-specific control tools to the user.

7 Acknowledgements

We thank Nicolas Bender for his valuable feedback and his help in conducting experiments. The project has been supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) funded by the German Academic Exchange Service (DAAD). The project has also been supported by the Trilateral DFG Research Program (Germany-France-Japan). The project was also support by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant INST 35/1597-1 FUGG.

References

[1] Midjourney, 2023. https://www.midjourney.com/.
[2] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723, 2022.
[3] James Betker, Gabriel Goh, Li **g, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving Image Generation with Better Captions. 2023.
[4] Said Boujerfaoui, Rabia Riad, Hassan Douzi, Frederic Ros, and Rachid Harba. Image watermarking between conventional and learning-based techniques: A literature review. Electronics, 12(1):74, 2022.
[5] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
[6] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
[8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
[9] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. In M Ranzato, A Beygelzimer, Y Dauphin, P S Liang, and J Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19822–19835. Curran Associates, Inc., 2021.
[10] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
[11] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. In Gabriel Avidan Shai and Brostow, Cissé Moustapha, Farinella Giovanni Maria, and Hassner Tal, editors, Computer Vision – ECCV 2022, pages 89–106, Cham, 2022. Springer Nature Switzerland.
[12] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
[13] Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, and Humphrey Shi. PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models, 2023.
[14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
[15] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Ye** Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
[16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
[17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
[18] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
[19] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, 2021.
[20] Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, and Tat-Jen Cham. Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
[21] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
[22] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
[23] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[24] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34, 2021.
[25] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
[26] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
[27] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
[28] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring Plain Vision Transformer Backbones for Object Detection. In Computer Vision – ECCV 2022, pages 280–296. Springer Nature Switzerland, 2022.
[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[30] Yuning Mao1, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Wen-tau Yih, and Madian Khabsa. UNIPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers, volume 1, pages 6253–6264. ACL, 2022.
[31] Momina Masood, Mariam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Applied Intelligence, 53(4):3974–4026, Feb 2023.
[32] Michael McCloskey and Neal J Cohen. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. volume 24, pages 109–165. Academic Press, 1989.
[33] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
[34] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4500–4509, 2018.
[35] Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. Deep learning for deepfakes creation and detection: A survey. Computer Vision and Image Understanding, 223:103525, 2022.
[36] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
[37] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
[38] Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. Image-to-Image Translation: Methods and Applications. CoRR, abs/2101.08629, 2021.
[39] Jonas Pfeiffer, Aishwarya Kamath, Andreas Ruckl, Kyunghyun Cho, and Iryna Gurevych1. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 487–503. ACL, 2021.
[40] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, 2023.
[41] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling Text-to-Image Diffusion by Orthogonal Finetuning. In NeurIPS, 2023.
[42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and Others. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763, 2021.
[43] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.
[44] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
[45] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022.
[46] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. CoRR, abs/2102.1, 2021.
[47] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
[48] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient Parametrization of Multi-Domain Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[49] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
[50] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. Advances in neural information processing systems, 29, 2016.
[51] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[52] A Rosenfeld and J K Tsotsos. Incremental Learning Through Deep Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):651–663, 2020.
[53] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
[54] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
[55] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-Image Diffusion Models. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings. ACM, 2022.
[56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, and A Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022.
[57] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
[58] Axel Sauer, Katja Schwarz, and Andreas Geiger. StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
[59] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, and Mitchell Wortsman. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
[60] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
[61] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
[62] Asa Cooper Stickland and Iain Murray. BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 5986–5995. PMLR, 2019.
[63] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan **g, Bing-Kun Bao, and Changsheng Xu. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16515–16525, 2022.
[64] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
[65] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is All You Need for Image-to-Image Translation, 2022.
[66] Youya Xia, Josephine Monica, Wei-Lun Chao, Bharath Hariharan, Kilian Q Weinberger, and Mark Campbell. Image-to-Image Translation for Autonomous Driving from Coarsely-Aligned Image Pairs. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7756–7762. IEEE, 2023.
[67] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
[68] Xingqian Xu, Zhangyang Wang, Gong Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7754–7765, 2023.
[69] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xue** Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
[70] Jiahui Yu, Yuanzhong Xu, **g Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, and Others. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
[71] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
[72] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023.
[73] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
[74] Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2305.16322, 2023.
[75] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
[76] Minfeng Zhu, **bo Pan, Wei Chen, and Yi Yang. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5810, 2019.

Supplementary Material

In the following, we provide details about implementation specifics and training parameters. Note, in the main article we showed quantitative evaluations for depth-image control only. Here we include quantitative evaluations on edge-image control, as well as further qualitative results for ControlNet-XS applied to Stable Diffusion [51] and Stable Diffusion XL [40].

Appendix A Architecture

In Fig. 8, we illustrate the interaction between the generative model Stable Diffusion (left) and ControlNet-XS (right). The interactions for each block between the generative and controlling encoders are depicted in Fig. 9.

Appendix B Training Details

We have trained two ControlNet-XS models with edge and depth control for both Stable Diffusion [51] and Stable Diffusion XL [40]. As training data, we used one million images from the LAION Aesthetics dataset [59] for both controls. For edges, we extracted random edges using Canny edge detection [5] with random thresholds. For depth control, we approximated the depths using the MiDaS [47] approach. In Tab. 6, we summarise the training setting for all models.

Condition	Control Model	Generative Model	Training Hours [A100]	Learning Rate	Batch Size
Edges	ControlNet-XS (55M)	Stable Diffusion (860M)	$\sim 200$	1e-5	16
Depth Maps	ControlNet-XS (55M)	Stable Diffusion (860M)	$\sim 200$	1e-5	16
Edges	ControlNet-XS (20M)	Stable Diffusion XL (2.6B)	$\sim 250$	1e-4	40
Depth Maps	ControlNet-XS (20M)	Stable Diffusion XL (2.6B)	$\sim 250$	1e-4	40

Table 6: Training details for ControlNet-XS, and for edge and depth control applied to Stable Diffusion [51] and Stable Diffusion XL [40]. The size of ControlNet-XS does not have to increase in correspondence with the size of the controlled generative model.

Appendix C Quantitative Results for Edge Control

In Tab. 7, we conduct an ablation study for three types of architectures of ControlNet-XS (Figs. 2(b), 2(c) and 2(d)) for edge control. The HDD score is scaled by $10^{-1}$ . As we concluded for depth control, we can confirm that Type B is the best architecture choice for ControlNet-XS with edge control. We will use Type B for further experiments. In Tab. 8, we evaluate the effect that model size has on the performance of ControlNet-XS with 491M, 55M, 11.7M and 1.7M parameters, respectively. We also compare our models to the two competing state-of-the-art approaches ControlNet [72] and T2I-Adapter [33] for edge control. Our best model with 55M parameters outperforms both competitors in terms of quality (FID) and control (LPIPS and HDD), while still being able to generate high quality results.

	Quality		Both	Control
Method	CLIP-Sc $\uparrow$	CLIP-Ae $\uparrow$	FID $\downarrow$	LPIPS $\downarrow$	HDD $\downarrow$
CN-XS A (53M)	29.04	5.85	17.40	0.452	15.46
CN-XS B (55M)	29.61	5.98	15.13	0.417	15.22
CN-XS C (117M)	29.41	5.99	15.34	0.405	15.18

Table 7: Ablation study for the different ControlNet-XS architectures illustrated in Figs. 2(b), 2(c) and 2(d) with edge control. We see that with additional, immediate corrective connections in Type B and Type C, the performance considerably increases for all metrics. We choose Type B as our final ControlNet-XS architecture, since it performs on a par, on average, with Type C but has fewer parameters.

	Quality		Both	Control
Method	CLIP-Sc $\uparrow$	CLIP-Ae $\uparrow$	FID $\downarrow$	LPIPS $\downarrow$	HDD $\downarrow$
Stable Diffusion	28.40	6.16	22.69	(0.618)	(18.87)
CN (361M)	29.01	6.17	21.18	0.544	18.52
T2I (77M)	29.14	5.66	18.34	0.459	16.66
CN-XS (491M)	29.48	6.06	15.90	0.429	15.75
CN-XS (55M)	29.61	5.98	15.13	0.417	15.22
CN-XS (11.7M)	29.10	6.04	16.56	0.474	15.49
CN-XS (1.7M)	29.02	6.10	17.07	0.482	15.57

Table 8: Quantitative evaluation for edge control with respect to competitors and change in model size of ControlNet-XS. We observe that our best model, ControlNet-XS (CN-XS) with

55

M parameters, outperforms the two competitors, i.e. ControlNet (CN) [72] and T2I-Adapter (T2I) [33], for every metric besides the CLIP-Aesthetic score. Furthermore, for ControlNet-XS models with few parameters, e.g. 1.7M, we notice that the fidelity of the control reduces, see FID, LPIPS and HDD scores.

Appendix D Additional Qualitative Results

We provide additional results for controlled image generation using edge and depth control with our ControlNet-XS applied to Stable Diffusion [51] in Fig. 10 and Stable Diffusion XL [40] in Fig. 11.

Appendix E Prompt Information for Figures

In Tab. 9, we show the text-prompts which were used for the generation of the images provided in the paper. If not stated otherwise, all generated images were sampled with 50 DDIM-steps and a classifier-free-guidance scale of 9.5.

Figure	Text-Prompt
Figure 1 (left)	cinematic, beautiful, photo of a guy, street photography, colourful, highly detailed, photorealistic
Figure 1 (right)	cinematic cupcake, blueberry flavoured cupcake, delicious, highly detailed, photorealistic
Figure 4(b) (left)	cinematic, advertising shot, shoe in a city street, photorealistic shoe, colourful, highly detailed
Figure 4(b) (right)	photo of dirty old sneakers, muddy, highly detailed, award winning image
Figure 4(d) (left)	photo of a white tiger swimming in a jungle river, best quality
Figure 4(d) (right)	photo of a tiger swimming in a jungle river, best quality
Figure 6	aerial image of a city with a big highway intersection
Figure 7	high quality photo of a delicious cake, 4k image
Figure 12 (c-e)	Photo of a big house with stores at the first floor, cars parked, 4k
Figure 12 (h-j)	close-up of a young woman, detailed, beautiful, street photography, photorealistic, detailed, Kodak ektar 100, natural, candid shot
Figures 13, 15 and 14	high quality photo of a delicious cake, 4k image

Table 9: Text-prompts that were used to generate the images in the main article as well as in the supplementary material.

Appendix F Fidelity of the Control

Here, we provide additional examples with respect to fidelity of the control, see also main article Fig. 6 and Sec. 4.3. In Fig. 12, we qualitatively evaluate the effect of decreased fidelity of the control with smaller ControlNet-XS models. We additionally illustrate the effect which the complexity of control images can have on the fidelity to control.

Appendix G Semantic Bias

Here we provide additional results for Fig. 7 in the main article, about the semantic bias of ControlNet [72]. We show results for different control strengths $\alpha$ for ControlNet (Fig. 13), ControlNet-XS with 55M parameters (Fig. 14) and ControlNet-XS with 11.7M parameters (Fig. 15). We notice that ControlNet-XS has less semantic bias than ControlNet. We also notice that the sematic bias cannot be removed by adapting the control-strength $\alpha$ .

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models