License: CC BY 4.0
arXiv:2312.06573v1 [cs.CV] 11 Dec 2023

ControlNet-XS: Designing an Efficient and Effective Architecture
for Controlling Text-to-Image Diffusion Models

Denis Zavadski*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
[email protected]
   Johann-Friedrich Feiden*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
[email protected]
   Carsten Rother*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
[email protected]
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT
Computer Vision and Learning Lab, IWR, Heidelberg University
Abstract

The field of image synthesis has made tremendous strides forward in the last years. Besides defining the desired output image with text-prompts, an intuitive approach is to additionally use spatial guidance in form of an image, such as a depth map. For this, a recent and highly popular approach is to use a controlling network, such as ControlNet [72], in combination with a pre-trained image generation model, such as Stable Diffusion [51]. When evaluating the design of existing controlling networks, we observe that they all suffer from the same problem of a delay in information flowing between the generation and controlling process. This, in turn, means that the controlling network must have generative capabilities. In this work we propose a new controlling architecture, called ControlNet-XS, which does not suffer from this problem, and hence can focus on the given task of learning to control. In contrast to ControlNet, our model needs only a fraction of parameters, and hence is about twice as fast during inference and training time. Furthermore, the generated images are of higher quality and the control is of higher fidelity. All code and pre-trained models will be made publicly available.

[Uncaptioned image][Uncaptioned image]
Figure 1: Image synthesis with the production-quality model of Stable Diffusion XL  [40], using text-prompts, as well as, depth control (left) and canny-edge control (right). We achieve these results with a new controlling network called ControlNet-XS. In contrast to the well-known ControlNet  [72], our design requires only a small fraction of parameters while at the same time it improves image quality and provides control with higher fidelity. Furthermore, it is about two times faster than ControlNet with respect to inference and training time.

1 Introduction

Using Generative Artificial Intelligence to synthesize new images, is a topic that has received large attention in social media, press, research and industry. It started off in 2014201420142014 with the use of Generative Adversarial Networks (GAN)  [14] that were able to synthesize small-sized images of a given class [43], e.g. celebrity faces. Today, we have commercial and non-commercial products, such as Midjourney  [1] and Stable Diffusion XL  [40], which are able to generate large-sized images (up to 1024×1024102410241024\times 10241024 × 1024) with almost arbitrary content, e.g. ranging from professional photographs to manga art. This can be considered as a truly disruptive technology, for which the development is still far from being completed. One ongoing development is to create control tools, with which users can steer the image generation process towards their desired output. A common control mechanism is text-prompts. Another choice is to use a guidance image, which defines the desired output image in an abstract form such as a sketch or a depth map. This control mechanism is known as image-to-image translation  [38, 55, 65, 21, 66, 34, 75]. The control mechanisms can also be combined by adding image guidance to a Text-to-Image model [72, 33, 74, 20]. Our work falls into this class of methods. There are in general two different conceptional choices for implementing the image guidance in a Text-to-Image model.

On one hand, there is the approach of fine-tuning a generative process with a new control mechanism at hand, e.g. [65]. Such methods use annotated guidance images as additional input to fine-tune a pre-trained network. However, such an end-to-end learning approach is challenging since oftentimes there is a large in-balance between the original training data for the generative process, e.g. 3similar-toabsent3\sim 3∼ 3B images for training Stable Diffusion  [51], in contrast to only 1111M images with known control, as in  [72]. Such an imbalance can lead to effects like “catastrophic forgetting”  [32], which means that known properties of the generative model disappear after fine-tuning. Additionally, fine-tuning often requires access to a large computing cluster.

On the other hand, there are approaches that lock the parameters of the generative network and then train a separate controlling network. The outputs of the controlling network and the generation network are added internally, in order to enable a control. This design choice is currently most popular [72, 33, 74, 20], and, to the best of our knowledge, ControlNet [72] defines the current state-of-the-art. They have observed that the following training strategy works best. The controlling network copies at first part of the generative network. The connections from the controlling network to the generative network are initialized by so-called zero-convolution, to make sure that generative capabilities of the generative network do not diminish at the start of training. In training, the generative model is locked and the control model gently moves from being a pure image generation process towards a process that takes the given control into account. They have shown that this strategy is essential for success, since a randomly initialized, small-sized ControlNet performed poorly. In contrast, we mange to achieve this by questioning one important design choice in the architecture, which is present in all previous works to the best of our knowledge.

The problem of previous works is that there is a significant delay in transferring information from the generation process to the control process, and vice versa. We show that by removing this delay we are able to use a control network that is drastically smaller than ControlNet [72] and even performs better in terms of image quality and control with higher fidelity. The reason for this improvement is that by resolving this delay, the network can focus on the given task: Learning to control the generation process.

To understand this better, let us consider an analogy to this problem. Assume that the generative process is an autonomous vehicle. The vehicle is a complex system with surely a large number of parameters and many different sources of input. Assume that the control input is a specific address it must navigate to. The controlling model is the satellite navigator system in the vehicle. The navigator is a much simpler system than the vehicle itself. Furthermore, it only needs little information, i.e. the current position of the vehicle, and communicates with the vehicle with simple commands like “turn left at the next junction”. A crucial requirement for both systems to operate smoothly together is that there is no delay in information flowing in both directions. For instance, if the navigator were to know the position of the vehicle with a delay of ten seconds, then the vehicle has passed already the junction where it should have turned left! To design a smart navigator system that receives information which is outdated by ten seconds is possible, but much more challenging. The navigator has to guess where the vehicle would drive to in the next ten seconds in order to give commands that are useful for the vehicle at that point of time in the future. This is exactly what happens in ControlNet [72]. Due to the delay in information flowing between the two networks, the controlling network needs generative power to predict what the generation network may do next, before the generation network receives the control signal. We conjecture that this is the reason why a small-sized ControlNet, trained from scratch, performed poorly [72]. In our work we are able to use a small-sized control network, trained from scratch, since we remove the delay in information flow.

In summary, our contributions are as follows: (1) An efficient, small-sized architecture for controlling a pre-trained Text-to-Image diffusion model, that does not suffer from the problem of delay in information flow; (2) Superior results compared to the state-of-the-art in terms of both image quality (FID score) and control with higher fidelity; (3) Controlling the production-quality Stable Diffusion XL model [40] with 2.62.62.62.6B parameter by a control network that has only 20202020M parameters; (4) Release of all code and pre-trained models.

2 Related Work

2.1 Image Generation and Translation

Generative Adversarial Networks (GANs) [14] are probably the most established generative models for unconditional [25, 26, 24], class-conditional [58] and text-conditional [57, 22, 49, 50, 63, 67, 71, 76] image generation. While achieving state-of-the-art results for particular semantic classes [23, 64, 54], generalising GANs to synthesise images of arbitrary content remains an active area of research. With GigaGAN [22] demonstrating state-of-the-art performance, generic Text-to-Image synthesis is still dominated by autoregressive networks like DALL-E [46], Parti [70], CogView [9], Make-A-Scene [11] and Diffusion Models in particular. Since their introduction, Image Diffusion Models [60] rapidly became one of the best performing model-families [8, 37, 18, 17, 27, 61]. Diffusion models learn to transform a point in a simple high-dimensional distribution, such as a Gaussian, to a complex distribution, like the space of all images. This transformation is done by iteratively applying a U-Net that gradually removes the Gaussian noise in the image. This process allows to theoretically model arbitrary complex data distributions [60].

Conditioning Image Synthesis Models. To achieve a desired output, one popular choice is to use text-prompts as guidance. This is done by conditioning a generic image synthesis model on a textual-embedding, provided by pre-trained text-encoders like BERT [7], T5 [44] or CLIP [42]. This combination has lead to impressive results for complex Text-to-Image generation task, by models like Stable Diffusion [51], DALL-E2 [45], DALL-E3 [3], Stable Diffusion XL [40], Imagen [56] and many others [36, 68]. However, text-prompts provide very little control for spatial guidance within an image. To address this problem, one line of work proposed to personalise the generation by enabling the insertion of specific instances to an image [53, 12, 69]. Another direction is to manipulate only a given mask within an image, such as in-painting [2, 13].

Image-to-Image Translation. Many methods use images instead of, or in addition to, textual prompts to condition generative networks. These images describe the desired scene in an abstract form, such as a semantic segmentation, a depth map or a sketch. One approach for implementing this idea is to train the full generative model, from the beginning, with the additional image conditioning [10, 55, 68]. Alternatively, heavy fine tuning of a pre-trained model [65] can be done. These approaches have the drawback that substantial computational resources are required, and that new types of image conditioning cannot easily be added. Due to these limitations, controlling pre-trained networks has become a popular design choice.

2.2 Controlling Pre-Trained Networks

With large-scale, pre-trained models, it has become popular to leave the base model unaltered in order to keep its generative capabilities. A separate controlling network is trained to control the base model. Typically, the internal outputs of the controlling network are simply added to the internal outputs of the base model. An example is a popular technique from natural language processing called “Low-Rank-Adaptation” (LoRA) [19]. The weight matrix of the controlling network is a multiplication of two low-rank matrices, which is a simple way to manage the number of parameters of the controlling network. Following upon this, “orthogonal fine-tuning” makes use of a learnable transformation of weights in order to adapt new concepts [41]. An alternative concept for controlling is to use so-called adapters, which insert new trainable modules, e.g. neural blocks, to the pre-trained generative network. However, in contrast to LoRA, inference time increases. Adapters are popular in natural language processing [62, 39, 30] and have been transferred to image generation models [52, 48], vision transformers (ViT) [28] and are also used for dense predictions in the form of ViT-Adapters [6].

Image Control Models. The idea of adding image guidance to Text-to-Image generation models has recently become popular [33, 72, 74, 20]. The T2I-Adapter [33] trains an additional controlling encoder that adds an intermediate representation to the intermediate feature maps of the locked, pre-trained encoder of Stable Diffusion. Similarly, ControlNet [72] copies the whole encoder network of Stable Diffusion as controlling network and fine-tunes it with an additional image input. However, both of these approaches [72, 33], as well as related work [74, 20], share a common problem. The controlling network receives outdated information from the pre-trained generative network, here Stable Diffusion, and therefore may struggle to send appropriate control signals to the generative network. We resolve this problem, and as a consequence can use a more efficient and effective controlling network.

3 Method

We start with a brief overview of the different components of the Stable Diffusion [51] architecture, which serves as our generative model (Sec. 3.1). The pre-trained generative model is controlled by a controlling network. In Sec. 3.2 we introduce the controlling network ControlNet [72], and explain why it suffers from the problem of delayed information flow. Lastly, we introduce our controlling network, called ControlNet-XS, that eliminates this problem (Sec. 3.3). In Sec. 3.4, we describe the respective training procedure.

Refer to caption
\thesubsubfigure ControlNet [72]
Refer to caption
\thesubsubfigure ControlNet-XS (Ours)
Figure 2: The problem of delayed information flow. Visualising the information flowing in the first three steps of the process (t=0,1,2𝑡012t=0,1,2italic_t = 0 , 1 , 2), where the generative process is in (a, b) to the left and the controlling process to the right. We focus in the following on the information produced by the encoder of the generative model (Enc. G), which is colour coded for each time-step (green, orange, red). The thickness of the arrows approximate the bits of information flowing. The information flowing from one time-step to the next (thin arrows) is by a factor of 405405405405 less than within each time-step (thick arrows). The delay in information flow is shown in (a) for the ControlNet [72]. We see that the ControlNet encoder (Enc. C) does know about the information produced by the generative encoder (Enc. G) with a delay of one time-step. This implies that the control-signal of the control encoder (Enc. C) has to “guess” what the generative process might have done meanwhile. Since little information flows between two time-steps (thin arrows), the task of “guessing” is even more challenging. (b) By having a high-capacity flow of information between the two encoders (Enc. G and Enc. C) this problem is resolved in our ControlNet-XS. This enables us to drastically reduce the number of parameters, compared to ControlNet [72], with improved performance.

3.1 Stable Diffusion

Stable Diffusion [51] is a U-Net based diffusion model for Text-to-Image generation. As a conditional diffusion model, it receives a text embedding from a separate text encoder, as well as a learned time embedding. The output image is reconstructed from noise, by iteratively running the U-Net over typically 50505050 time-steps. The U-Net generator is composed of a sequence of neural blocks involving cross-attention mechanisms for text conditioning. The image signal is processed by the encoder in four layers with diminishing resolution and three neural blocks per layer. Through the mirrored structure of the decoder and one middle block in between, the U-Net has a total of 25 neural blocks. The output of each neural block can be influenced individually by a controlling network.

Refer to caption
(a) ControlNet [72]
Refer to caption
(b) ControlNet-XS (Type A)
Refer to caption
(c) ControlNet-XS (Type B)
Refer to caption
(d) ControlNet-XS (Type C)
Figure 3: Architectural choices. Different designs for controlling a U-Net based generation process with a controlling network. The generation process is in each example on the left-hand side and the control process on the right-hand side. (a) The architecture of ControlNet [72]. (b-c) Three new architectures (Type A-C) proposed in this work. They vary in terms of information flowing between the two encoders, as well as information flowing between the two decoders. We verify experimentally that model Type B performs better than Type A, and is on a par with Type C. We choose Type B as our final architecture, since it has fewer parameters than Type C.

3.2 ControlNet

ControlNet [72] starts with a pre-trained generative model, here the U-Net of Stable Diffusion [51]. The control model copies the encoder of the U-Net, and hence has a representation that is capable of generating images by itself. The control encoder receives the control information, e.g. in form of a depth map, as well as the intermediate, noisy generated image. It outputs control signals that are fed into the different decoder blocks of the generative process (see Fig. 2(a)). The connections from the control model to the generation model are initialised by so-called zero-convolutions, which have the effect that the generative capabilities of the controlled U-Net are not diminished at the beginning of training. In training, the encoder can learn to provide useful control signals to the generative process. The training objective is the one of Stable Diffusion, i.e. image denoising (see Sec. 3.4). While these sound like reasonable design choices, at first glance, it has the severe problem of a delay in the flow of information, as illustrated in Fig. 2. Due to the delayed information flow, the control model has two jobs at once: a) it has to process the input control information in order to make it useful for the generation process; b) it has to anticipate what the encoder of the generation process is going to do. In our design of the control network we remove the delay in information flow. In this way, the control model can focus on the first job.

3.3 ControlNet-XS

Given the insight of delayed information flow in ControlNet [72], we design an architecture that remedies this problem. The idea is to have the two encoders, from the generation and control model, to communicate directly with each other. Fig. 2 illustrates the improvement in information flow in our design. We see that the encoders in both processes share the same information, illustrated by colour coding. We refer to Appendix A for a detailed illustration of our architecture, while a less detailed version is shown in Fig. 2(c) (Type B). The connections between the two encoders are processed by zero-convolutions before concatenating them to the input of the respective neural block in the other encoder. During training, this has the same advantage as in ControlNet, i.e. to not diminish the generative capabilities of the generative process at the start of training.

Our new design allows to drastically reduce the size of the controlling network. Even a model with as little as 1.71.71.71.7M parameters performs on a par with ControlNet [72] with 361361361361M parameters. Note that in ControlNet [72], a version of ControlNet with fewer parameters, called ControlNet-light, was evaluated but found to perform inferior. We call our efficient and effective architecture ControlNet-XS. The second major difference between ControlNet and ControlNet-XS is that we do not need to copy the pre-trained encoder of the generative U-Net. We only utilise the general structure of the encoder, with random weights. The reason for this is that our ControlNet-XS does not need any generative power, like ControlNet, due to no delay in information flow.

3.4 Training ControlNet-XS

For training we follow the same scheme as the original ControlNet [72] and keep all weights of the generative model frozen, i.e. updating only the weights of ControlNet-XS. Because we do not rely on ControlNet-XS being able to approximate a large-scale generative model, we do not need to start from pre-trained weights as initialisation. Instead, we initialise all parameter randomly. We train two versions of ControlNet-XS, one with edge control and one with depth control. As training data, we use one million images from the Laion-Aesthetics dataset [59]. As in ControlNet [72], we either extract canny-edges, or predict the respective depth maps using MiDaS [47]. The standard diffusion model objective remains unchanged:

=𝔼z0,t,ct,cf,ϵ𝒩(0,1)[ϵϵθ(zt,t,ct,cc)22],subscript𝔼similar-tosubscript𝑧0𝑡subscript𝑐𝑡subscript𝑐𝑓italic-ϵ𝒩01delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝑐𝑡subscript𝑐𝑐22\displaystyle\mathcal{L}=\mathbb{E}_{z_{0},t,c_{t},c_{f},\epsilon\sim\mathcal{% N}(0,1)}\bigl{[}\|\epsilon-\epsilon_{\theta}(z_{t},t,c_{t},c_{c})\|_{2}^{2}% \bigr{]},caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (1)

with the target image z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the noisy image ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, the timestep t𝑡titalic_t, the text conditioning ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the control conditioning ccsubscript𝑐𝑐c_{c}italic_c start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT.

Quality Both Control
Method CLIP-Sc \uparrow CLIP-Ae \uparrow FID \downarrow LPIPS \downarrow MSE-d \downarrow
CN (361M) 28.96 6.08 19.01 0.532 29.1
CN-XS A (53M) 29.00 6.02 17.11 0.492 20.9
CN-XS B (55M) 29.21 6.09 16.36 0.468 19.6
CN-XS C (117M) 29.14 6.10 16.24 0.476 20.2
Table 1: Ablation study for four different architectures illustrated in Fig. 3. We see that all ControlNet-XS (CN-XS) architectures outperform ControlNet [72] (CN), both in terms of quality (FID) and control (MSE-depth). We select Type B as our final ControlNet-XS architecture for all experiments, since it performs best, on average, and has fewer parameters than Type C.

4 Experiments

We examine our ControlNet-XS model in terms of different network sizes, architecture configurations, and compare it to state-of-the-art approaches. To better understand the differences in performance of various architectural designs, we conduct a sensitivity analysis. Furthermore, we examine whether large control models induce a bias on the generative model. Finally, to demonstrate the versatility of our approach, we apply it to a larger generative model, namely Stable Diffusion XL. In order to compare to other approaches, we use Stable Diffusion version 1.5 [51] as the generative model in all remaining experiments. All evaluations are shown for depth control only, and we refer to Appendix C for results with respect to edge control.

4.1 Evaluation Metrics

We judge the performance of ControlNet-XS and its competitors by evaluating the fidelity of the control and the quality of the generated images to ensure that the generated quality does not reduce. For quality evaluation, we use the CLIP-Score [15] which approximates the similarity between a given text-prompt and an image, and the CLIP-Aesthetics score [59] which approximates the aesthetic appearance of an image as perceived by humans. The fidelity of the control is evaluated implicitly with the Learned Perceptual Image Patch Similarity (LPIPS) [73] and explicitly by a distance measure between two images, which is the MSE for depth control, denoted by MSE-depth, and the Hausdorff distance for canny-edges, denoted by MSE-HDD. Here, the first image is the reference control image, e.g. the depth map, and the second image is the extracted depth map from the generated image. The extraction algorithm is the same as the one used to generate the training data, i.e. MiDaS [47] for depth extraction. Note that for improved readability, the MSE-depth values are scaled by 103superscript10310^{3}10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT. We also evaluate the Fréchet Inception Distance (FID) [16], a metric that compares the distributions of intermediate features of a pre-trained network applied to generated and original images. For this and all other metrics we use the COCO validation dataset  [29] of 5000500050005000 images. The FID score measures both quality and control. Note that it measures the fidelity of the control since the control signal comes from a target image of the COCO validation set, and hence the features of the generated image are expected to be similar to the features of the target image.

Refer to caption
Figure 4: Sensitivity analysis. We show an MSE-depth error with respect to all 25252525 blocks (x-axis) of the generative U-Net. The MSE-depth error is computed between the extracted depth maps of two generated images: i) a generated image for which all control connections were active, ii) a generated image were the control for one individual block was turned off. The plot is an average over 500500500500 images. Note that blocks 1121121-121 - 12 belong to the encoder, block 13131313 is the middle block and blocks 1425142514-2514 - 25 belong to the decoder. We see that the 3rd block contributes most to the control, followed by the 1st, 5th and 6th block. In general, the blocks in the encoder appear to be far more essential for controlling than the remaining blocks of the decoder.

4.2 Ablation Study: Architecture

We conduct an ablation study for three types of architectures of ControlNet-XS, see Fig. 3b-d. Architecture Type A eliminates the problem of delayed information flow by having information flowing from the encoder of the generation process to the encoder of the controlling process. For Type B we additionally have information flowing in the other direction, i.e. from controlling network to generation network. This makes sure that the generative encoder does not perform “uncontrolled processing” by instantly adjusting the feature maps. Finally, Type C evaluates whether the full mirroring of the generative U-Net, and hence more tailored control of the generative decoder, has any advantages. Tab. 1 shows a quantitative comparison of the three architecture types together with the original ControlNet. Note that our new ControlNet-XS architectures have considerably fewer parameters (53M, 55M and 117M) than ControlNet (361M). For two metrics, quality (FID) and control (MSE-depth), all ControlNet-XS architectures are clearly superior to ControlNet. For other metrics there is no clear trend. We attribute this to the improvement in information flow. Furthermore, Type B performs better than Type A for all measures. The performance of Type B and C are on par. However, Type C has the drawback of effectively doubling the model size. We explain this lack of quantitative improvement of Type C in a sensitivity analysis (Sec. 4.3). We choose type B as our final architecture for ControlNet-XS, and use it in all remaining experiments.

4.3 Sensitivity Analysis

In the following analysis, we want to understand by how much each individual block of the generative U-Net is affected by the control network. The study is shown in Fig. 4. We see that certain blocks are affected more than others. In particular, blocks in the encoder are affected considerably more than blocks in the decoder. We conjecture that this is the reason why our Type C architecture (Fig. 2(d)) with a mirrored generative decoder does not lead to a clear improvement in performance (Tab. 1).

Refer to caption
(a) Control Canny-Edges
Refer to caption
Refer to caption
(b) Generations
Refer to caption
(c) Control Depth Map
Refer to caption
Refer to caption
(d) Generations
Figure 5: Images generated by ControlNet-XS (55M) and Stable Diffusion as generative model with two different text-prompts.

4.4 Ablation Study: Network Size

We evaluate whether changing the parameter size of ControlNet-XS influences the performance, and if yes, by how much we can reduce the size until we notice an apparent decrease in quality and control. Tab. 2 shows the results for ControlNet-XS with 491491491491M, 55555555M, 11.711.711.711.7M and 1.71.71.71.7M parameters, respectively, as well as the Stable Diffusion baseline without any control. We provide the MSE-depth score and LPIPS score for Stable Diffusion, which serves as an upper bound. Note that this bound is not very large, since the text-prompts themselves can already generate images with related depth maps. However, as expected, enforcing control does reduce these scores considerably. We roughly see the same trend when varying the sizes of ControlNet-XS for all scores: The performance increases slightly from 491491491491M to 55555555M and decreases afterwards for smaller model sizes, up to 1.71.71.71.7M. Hence, we choose the 55M model as our best model, and show qualitative results in Fig. 5. In terms of control, it means that smaller models have reduced fidelity of the control. We show qualitative results of this effect in Fig. 6. The decrease in performance for smaller model sizes can be explained as follows. Control models with few parameters perform more similarly to the “uncontrolled” generative model, i.e. Stable Diffusion, which preforms worse in general. Note that CLIP-Aesthetic score is highest for Stable Diffusion, hence our 1.7M model performs best for this score. Note that control models with more parameter are more powerful and hence can considerably affect the overall performance. In Sec. 4.6 we analyse biases induced by large control models.

Quality Both Control
Method CLIP-Sc \uparrow CLIP-Ae \uparrow FID \downarrow LPIPS \downarrow MSE-d \downarrow
Stable Diffusion 28.40 6.16 22.69 (0.618) (69.7)
CN (361M) 28.96 6.08 19.01 0.532 29.1
T2I (77M) 28.80 5.98 20.29 0.526 31.4
CN-XS (491M) 29.09 6.07 16.91 0.487 21.4
CN-XS (55M) 29.21 6.09 16.36 0.468 19.6
CN-XS (11.7M) 28.83 6.10 17.90 0.525 28.6
CN-XS (1.7M) 28.73 6.12 18.45 0.526 29.9
Table 2: Quantitative evaluation with respect to competitors and change in model size of ControlNet-XS. We observe that our best model, ControlNet-XS (CN-XS) with 55555555M parameters, outperforms the two competitors, i.e. ControlNet (CN) [72] and T2I-Adapter (T2I) [33], for every single metric. Furthermore, for ControlNet-XS models with few parameters, e.g. 1.7M, we notice that the fidelity of the control diminishes, see MSE-depth score.

4.5 Quantitative Comparison

We compare our ControlNet-XS in Tab. 2 to both state-of-the-art models ControlNet [72] and T2I-Adapter [33]. Our best model with 55555555M parameters outperforms both competitors for every single metric. With respect to quality measures (CLIP-Score, CLIP-Aesthetics) we improve by a small margin, while in terms of MSE-depth and FID score we are clearly superior. Please note that all model sizes of ControlNet-XS, even with just 1.71.71.71.7M parameters, perform either on a par or better than the competitive approaches. We conjecture that this improvement stems from solving the delayed information flow problem. Tab. 3 compares inference and training times of ControlNet-XS and ControlNet. For both, we increase the speed by about a factor of 2222.

Method Inference \downarrow Training \downarrow
ControlNet (361361361361M) 1min 11sec similar-to\sim 500500500500h (A100)
ControlNet-XS (55555555M) 38sec similar-to\sim 200200200200h (A100)
Table 3: Comparison of inference and training times of our ControlNet-XS and ControlNet [72], trained to control depth. Inference times are averaged over seven runs and we evaluate for 50 DDIM steps with a batch size of 10. The training time is given in NVIDIA A100100100100 GPU hours.
Method FID \downarrow CLIP-Sc \uparrow CLIP-Ae \uparrow
Stable Diffusion 22.69 28.40 6.16
ControlNet-XS (55555555M) 72.20 28.83 4.46
ControlNet-XS (1.7M) 58.95 28.57 4.65
Table 4: Bias induced by low-quality training data. We train two versions of ControlNet-XS (55M and 1.7M) with low resolution training data. For this, we down-scale the RGB training images by a factor of 8888 and then up-scale them back to the original resolution using bicubic interpolation. We see that the FID and CLIP-Aesthetic scores are heavily affected by this change in training data. The bias is less pronounced in the smaller 1.71.71.71.7M model.

4.6 Bias Induced by Large Control Models

In the following, we analyze two scenarios, in which the generative model is biased when combined with a large control model. We have already analysed above that ControlNet [72] needs to be of large size in order to have sufficient generative power to mitigate the problem of delayed information flow. In Fig. 7, we analyse a bias of ControlNet which we refer to as semantic bias for depth control. It is important to note that the strength of the control is not able to reduce the bias induced by the control model. The second bias shows a limitation of existing control models and is discussed in Tab. 4. We see that lower quality training data for the control model leads to lower quality of the generated images.

Quality Both Control
Method CLIP-Sc \uparrow CLIP-Ae \uparrow FID \downarrow LPIPS \downarrow MSE-d \downarrow
SD XL (2.6B) 27.06 5.84 59.47 (0.668) (123.2)
T2I (77M) 27.96 5.67 61.03 0.627 49.0
CN-XS (400M) 29.51 6.13 19.28 0.528 27.2
CN-XS (104M) 29.45 6.23 19.12 0.511 26.2
CN-XS (20M) 29.41 6.19 18.75 0.505 22.6
Table 5: Quantitative evaluation with Stable Diffusion XL. All versions of ControlNet-XS (CN-XS) are able to control the generative model with 2.62.62.62.6B parameters, as seen by the low MSE-depth score. Furthermore, all versions of ControlNet-XS outperform the T2I-Adapter [33] by a large margin. This is again a consequence of the problem of delayed information flow in the T2I-Adapter. Also, the T2I-Adapter does not reduce the FID score of the uncontrolled Stable Diffusion XL model. This is slightly surprising since the MSE-depth score of the T2I-Adapter, i.e. 49.049.049.049.0, is lower then the upper-bound, i.e. 123.2123.2123.2123.2, of the uncontrolled model. We see that varying the model size has only a minor effect on the performance. This is in contrast to the results produced with Stable Diffusion (Tab. 2). Hence we choose the smallest ControlNet-XS with 20202020M parameters as our best model, given that it induces the smallest bias on the performance of the generative model. Note, this model has less than 1111% of parameters of the generative model.
Refer to caption
(a) Control
Refer to caption
(b) Original Image
Refer to caption
(c) ControlNet-XS (55555555M)
Refer to caption
(d) ControlNet-XS (11.711.711.711.7M)
Refer to caption
(e) ControlNet-XS (1.71.71.71.7M)
Figure 6: The fidelity of the control reduces with smaller model sizes of ControlNet-XS. In the 55555555M parameter model the complex structure of the street junction is identical to the one in the original image, as well as the skyscrapers in the upper-left corner. Smaller models with 11.711.711.711.7M and 1.71.71.71.7M parameters, respectively, are still guided by the control but less rigorously.
Refer to caption
Figure 7: Semantic bias for depth control. All images are generated with a control depth map of a street scene and an unrelated text-prompt: “high quality photo of a delicious cake, 4k image”. Note that these are not contradicting control inputs, but the inputs rather challenge the generative process to produce a creative solution with a cake in form of a street scene. We see that ControlNet-XS with 11.711.711.711.7M parameters is able to produce impressive results, followed by results of the 55555555M model. In contrast, ControlNet [72] is not able to produce satisfying results, even when adjusting the control strength α𝛼\alphaitalic_α. Note that α=0.825𝛼0.825\alpha=0.825italic_α = 0.825 is the default for ControlNet. With this default value, ControlNet shows proper house facades textures, while ControlNet-XS shows typical cake textures such as “sponge”, “marzipan” or “icing”. (Note that the output signals of the controlling network are added with a global weighting α𝛼\alphaitalic_α to the output signals of the generation network at the respective neural blocks. This weighting can be adjusted at test time). We conjecture that the reason is a semantic bias induced by large control models. A large control model can use its power to add semantic meaning to input depth maps. This semantic bias cannot be removed by adapting α𝛼\alphaitalic_α. Here α=0.4𝛼0.4\alpha=0.4italic_α = 0.4 was the “sweet spot” where ControlNet suddenly transitions from producing images of a cake to images of a street scene. More results are available in Appendix G.

4.7 Evaluation with Stable Diffusion XL

We evaluate our ControlNet-XS model with Stable Diffusion XL [40] as generative model. Stable Diffusion XL has about 2.62.62.62.6B parameters and hence is over three times larger than its predecessor Stable Diffusion. Tab. 5 presents and discusses quantitative results with respect to model size and the T2I-Adapter [33]. Note that there is no official version of ControlNet available for Stable Diffusion XL. Qualitative results are shown in Fig. 1.

5 Limitations & Societal Impact

We see the main limitation of this work, and previous works, in the problem that a controlling network can add unwanted biases to the performance of the generative model (Sec. 4.6). We mitigate this problem by drastically reducing the size of the controlling network. However, ideally a controlling network does only do the job of controlling the output without inducing any unwanted biases. To achieve this, a better understanding of the generative model may be of help. Our sensitivity analysis (Sec. 4.3) is a first step in this direction.

As said in the introduction, synthetic image generation is a disruptive technology. Hence it is of paramount importance to conduct research on misuse of this technology, such as producing deep fakes. Example of such research is deep fake image detection [35, 31] or embedding watermarks into images [4].

6 Conclusion

We presented ControlNet-XS, a network for controlling pre-trained Text-to-Image Diffusion Models. Extensive experiments validated the superior performance with respect to the state-of-the-art, such as ControlNet [72], despite having a considerably smaller amount of parameters. There are many avenues for future research. One direction is to better understand the mechanisms of the generative model, both from a mathematical standpoint as well as by conducting empirical tests. We believe that this is important for offering in the future even better and application-specific control tools to the user.

7 Acknowledgements

We thank Nicolas Bender for his valuable feedback and his help in conducting experiments. The project has been supported by the Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA) funded by the German Academic Exchange Service (DAAD). The project has also been supported by the Trilateral DFG Research Program (Germany-France-Japan). The project was also support by the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant INST 35/1597-1 FUGG.

References

  • [1] Midjourney, 2023. https://www.midjourney.com/.
  • [2] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723, 2022.
  • [3] James Betker, Gabriel Goh, Li **g, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving Image Generation with Better Captions. 2023.
  • [4] Said Boujerfaoui, Rabia Riad, Hassan Douzi, Frederic Ros, and Rachid Harba. Image watermarking between conventional and learning-based techniques: A literature review. Electronics, 12(1):74, 2022.
  • [5] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
  • [6] Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  • [7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  • [8] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • [9] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, and Jie Tang. CogView: Mastering Text-to-Image Generation via Transformers. In M Ranzato, A Beygelzimer, Y Dauphin, P S Liang, and J Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 19822–19835. Curran Associates, Inc., 2021.
  • [10] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  • [11] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors. In Gabriel Avidan Shai and Brostow, Cissé Moustapha, Farinella Giovanni Maria, and Hassner Tal, editors, Computer Vision – ECCV 2022, pages 89–106, Cham, 2022. Springer Nature Switzerland.
  • [12] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  • [13] Vidit Goel, Elia Peruzzo, Yifan Jiang, Dejia Xu, Nicu Sebe, Trevor Darrell, Zhangyang Wang, and Humphrey Shi. PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models, 2023.
  • [14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [15] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Ye** Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  • [16] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [17] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [18] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. The Journal of Machine Learning Research, 23(1):2249–2281, 2022.
  • [19] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models, 2021.
  • [20] Minghui Hu, Jianbin Zheng, Daqing Liu, Chuanxia Zheng, Chaoyue Wang, Dacheng Tao, and Tat-Jen Cham. Cocktail: Mixing Multi-Modality Control for Text-Conditional Image Generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • [21] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [22] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10124–10134, 2023.
  • [23] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • [24] Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34, 2021.
  • [25] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • [26] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • [27] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  • [28] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He. Exploring Plain Vision Transformer Backbones for Object Detection. In Computer Vision – ECCV 2022, pages 280–296. Springer Nature Switzerland, 2022.
  • [29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [30] Yuning Mao1, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Wen-tau Yih, and Madian Khabsa. UNIPELT: A Unified Framework for Parameter-Efficient Language Model Tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers, volume 1, pages 6253–6264. ACL, 2022.
  • [31] Momina Masood, Mariam Nawaz, Khalid Mahmood Malik, Ali Javed, Aun Irtaza, and Hafiz Malik. Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Applied Intelligence, 53(4):3974–4026, Feb 2023.
  • [32] Michael McCloskey and Neal J Cohen. Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. volume 24, pages 109–165. Academic Press, 1989.
  • [33] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  • [34] Zak Murez, Soheil Kolouri, David Kriegman, Ravi Ramamoorthi, and Kyungnam Kim. Image to image translation for domain adaptation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4500–4509, 2018.
  • [35] Thanh Thi Nguyen, Quoc Viet Hung Nguyen, Dung Tien Nguyen, Duc Thanh Nguyen, Thien Huynh-The, Saeid Nahavandi, Thanh Tam Nguyen, Quoc-Viet Pham, and Cuong M. Nguyen. Deep learning for deepfakes creation and detection: A survey. Computer Vision and Image Understanding, 223:103525, 2022.
  • [36] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • [37] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  • [38] Yingxue Pang, Jianxin Lin, Tao Qin, and Zhibo Chen. Image-to-Image Translation: Methods and Applications. CoRR, abs/2101.08629, 2021.
  • [39] Jonas Pfeiffer, Aishwarya Kamath, Andreas Ruckl, Kyunghyun Cho, and Iryna Gurevych1. AdapterFusion: Non-Destructive Task Composition for Transfer Learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 487–503. ACL, 2021.
  • [40] Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, 2023.
  • [41] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Schölkopf. Controlling Text-to-Image Diffusion by Orthogonal Finetuning. In NeurIPS, 2023.
  • [42] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, and Others. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763, 2021.
  • [43] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.
  • [44] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • [45] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents, 2022.
  • [46] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. CoRR, abs/2102.1, 2021.
  • [47] René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  • [48] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Efficient Parametrization of Multi-Domain Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [49] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016.
  • [50] Scott E Reed, Zeynep Akata, Santosh Mohan, Samuel Tenka, Bernt Schiele, and Honglak Lee. Learning what and where to draw. Advances in neural information processing systems, 29, 2016.
  • [51] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • [52] A Rosenfeld and J K Tsotsos. Incremental Learning Through Deep Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):651–663, 2020.
  • [53] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  • [54] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, and Michael Bernstein. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  • [55] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-Image Diffusion Models. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings. ACM, 2022.
  • [56] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, and A Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 36479–36494. Curran Associates, Inc., 2022.
  • [57] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  • [58] Axel Sauer, Katja Schwarz, and Andreas Geiger. StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets. In ACM SIGGRAPH 2022 conference proceedings, pages 1–10, 2022.
  • [59] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, and Mitchell Wortsman. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  • [60] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • [61] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • [62] Asa Cooper Stickland and Iain Murray. BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 5986–5995. PMLR, 2019.
  • [63] Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan **g, Bing-Kun Bao, and Changsheng Xu. Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16515–16525, 2022.
  • [64] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
  • [65] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is All You Need for Image-to-Image Translation, 2022.
  • [66] Youya Xia, Josephine Monica, Wei-Lun Chao, Bharath Hariharan, Kilian Q Weinberger, and Mark Campbell. Image-to-Image Translation for Autonomous Driving from Coarsely-Aligned Image Pairs. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 7756–7762. IEEE, 2023.
  • [67] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
  • [68] Xingqian Xu, Zhangyang Wang, Gong Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7754–7765, 2023.
  • [69] Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xue** Chen, Xiaoyan Sun, Dong Chen, and Fang Wen. Paint by example: Exemplar-based image editing with diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18381–18391, 2023.
  • [70] Jiahui Yu, Yuanzhong Xu, **g Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, and Others. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  • [71] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 5907–5915, 2017.
  • [72] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023.
  • [73] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  • [74] Shihao Zhao, Dongdong Chen, Yen-Chun Chen, Jianmin Bao, Shaozhe Hao, Lu Yuan, and Kwan-Yee K Wong. Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models. arXiv preprint arXiv:2305.16322, 2023.
  • [75] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
  • [76] Minfeng Zhu, **bo Pan, Wei Chen, and Yi Yang. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5802–5810, 2019.

Supplementary Material

In the following, we provide details about implementation specifics and training parameters. Note, in the main article we showed quantitative evaluations for depth-image control only. Here we include quantitative evaluations on edge-image control, as well as further qualitative results for ControlNet-XS applied to Stable Diffusion [51] and Stable Diffusion XL [40].

Appendix A Architecture

In Fig. 8, we illustrate the interaction between the generative model Stable Diffusion (left) and ControlNet-XS (right). The interactions for each block between the generative and controlling encoders are depicted in Fig. 9.

Refer to caption
Figure 8: ControlNet-XS architecture applied to the Stable Diffusion [51] U-Net. The U-Net generation process is shown to the left. The U-Net encoder consists of three neural blocks per resolution, followed by a middle block and a decoder with corresponding architecture. The encoder and decoder are connected by the common U-Net skip-connections. ControlNet-XS mirrors the structure of the encoder but with significantly less parameters. Both encoders process the image signal, a text conditioning and a timestep embedding. ControlNet-XS additionally receives a control signal. The intermediate feature maps of the generative encoder are communicated from each block of the generative encoder to ControlNet-XS. The connections from ControlNet-XS to the generative U-Net provide an additive correction to all intermediate feature maps. All connections between both networks contain zero-convolutions to ease training.
Refer to caption
Figure 9: Detailed zoom into the connections between a generative encoder block and a ControlNet-XS block. The calculated features are being processed by zero-convolutions and added to the calculated features of the counterpart. Features from the generative blocks can be either added or concatenated to the features of the control block. Because ControlNet-XS is trained from scratch, we utilise concatenation to process the information from the generative encoder blocks.

Appendix B Training Details

We have trained two ControlNet-XS models with edge and depth control for both Stable Diffusion [51] and Stable Diffusion XL [40]. As training data, we used one million images from the LAION Aesthetics dataset [59] for both controls. For edges, we extracted random edges using Canny edge detection [5] with random thresholds. For depth control, we approximated the depths using the MiDaS [47] approach. In Tab. 6, we summarise the training setting for all models.

Condition Control Model Generative Model Training Hours [A100] Learning Rate Batch Size
Edges ControlNet-XS (55M) Stable Diffusion (860M) 200similar-toabsent200\sim 200∼ 200 1e-5 16
Depth Maps ControlNet-XS (55M) Stable Diffusion (860M) 200similar-toabsent200\sim 200∼ 200 1e-5 16
Edges ControlNet-XS (20M) Stable Diffusion XL (2.6B) 250similar-toabsent250\sim 250∼ 250 1e-4 40
Depth Maps ControlNet-XS (20M) Stable Diffusion XL (2.6B) 250similar-toabsent250\sim 250∼ 250 1e-4 40
Table 6: Training details for ControlNet-XS, and for edge and depth control applied to Stable Diffusion [51] and Stable Diffusion XL [40]. The size of ControlNet-XS does not have to increase in correspondence with the size of the controlled generative model.

Appendix C Quantitative Results for Edge Control

In Tab. 7, we conduct an ablation study for three types of architectures of ControlNet-XS (Figs. 2(b), 2(c) and 2(d)) for edge control. The HDD score is scaled by 101superscript10110^{-1}10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. As we concluded for depth control, we can confirm that Type B is the best architecture choice for ControlNet-XS with edge control. We will use Type B for further experiments. In Tab. 8, we evaluate the effect that model size has on the performance of ControlNet-XS with 491M, 55M, 11.7M and 1.7M parameters, respectively. We also compare our models to the two competing state-of-the-art approaches ControlNet [72] and T2I-Adapter [33] for edge control. Our best model with 55M parameters outperforms both competitors in terms of quality (FID) and control (LPIPS and HDD), while still being able to generate high quality results.

Quality Both Control
Method CLIP-Sc \uparrow CLIP-Ae \uparrow FID \downarrow LPIPS \downarrow HDD \downarrow
CN-XS A (53M) 29.04 5.85 17.40 0.452 15.46
CN-XS B (55M) 29.61 5.98 15.13 0.417 15.22
CN-XS C (117M) 29.41 5.99 15.34 0.405 15.18
Table 7: Ablation study for the different ControlNet-XS architectures illustrated in Figs. 2(b), 2(c) and 2(d) with edge control. We see that with additional, immediate corrective connections in Type B and Type C, the performance considerably increases for all metrics. We choose Type B as our final ControlNet-XS architecture, since it performs on a par, on average, with Type C but has fewer parameters.
Quality Both Control
Method CLIP-Sc \uparrow CLIP-Ae \uparrow FID \downarrow LPIPS \downarrow HDD \downarrow
Stable Diffusion 28.40 6.16 22.69 (0.618) (18.87)
CN (361M) 29.01 6.17 21.18 0.544 18.52
T2I (77M) 29.14 5.66 18.34 0.459 16.66
CN-XS (491M) 29.48 6.06 15.90 0.429 15.75
CN-XS (55M) 29.61 5.98 15.13 0.417 15.22
CN-XS (11.7M) 29.10 6.04 16.56 0.474 15.49
CN-XS (1.7M) 29.02 6.10 17.07 0.482 15.57
Table 8: Quantitative evaluation for edge control with respect to competitors and change in model size of ControlNet-XS. We observe that our best model, ControlNet-XS (CN-XS) with 55555555M parameters, outperforms the two competitors, i.e. ControlNet (CN) [72] and T2I-Adapter (T2I) [33], for every metric besides the CLIP-Aesthetic score. Furthermore, for ControlNet-XS models with few parameters, e.g. 1.7M, we notice that the fidelity of the control reduces, see FID, LPIPS and HDD scores.

Appendix D Additional Qualitative Results

We provide additional results for controlled image generation using edge and depth control with our ControlNet-XS applied to Stable Diffusion [51] in Fig. 10 and Stable Diffusion XL [40] in Fig. 11.

Appendix E Prompt Information for Figures

In Tab. 9, we show the text-prompts which were used for the generation of the images provided in the paper. If not stated otherwise, all generated images were sampled with 50 DDIM-steps and a classifier-free-guidance scale of 9.5.

Figure Text-Prompt
Figure 1 (left) cinematic, beautiful, photo of a guy, street photography, colourful, highly detailed, photorealistic
Figure 1 (right) cinematic cupcake, blueberry flavoured cupcake, delicious, highly detailed, photorealistic
Figure 4(b) (left) cinematic, advertising shot, shoe in a city street, photorealistic shoe, colourful, highly detailed
Figure 4(b) (right) photo of dirty old sneakers, muddy, highly detailed, award winning image
Figure 4(d) (left) photo of a white tiger swimming in a jungle river, best quality
Figure 4(d) (right) photo of a tiger swimming in a jungle river, best quality
Figure 6 aerial image of a city with a big highway intersection
Figure 7 high quality photo of a delicious cake, 4k image
Figure 12 (c-e) Photo of a big house with stores at the first floor, cars parked, 4k
Figure 12 (h-j) close-up of a young woman, detailed, beautiful, street photography, photorealistic, detailed, Kodak ektar 100, natural, candid shot
Figures 13, 15 and 14 high quality photo of a delicious cake, 4k image
Table 9: Text-prompts that were used to generate the images in the main article as well as in the supplementary material.
Refer to caption
(a) Control
Refer to caption
(b) Original Image
Refer to caption
(c) photo of white sports car, mountains, summer, award winning image, photorealistic, 4k
Refer to caption
(d) photo of red sports car, mountains, winter, award winning image, photorealistic, 4k, snowing
Refer to caption
(e) Control
Refer to caption
(f) Original Image
Refer to caption
(g) high resolution image of a cute white kitten, high quality, award winning image
Refer to caption
(h) high resolution image of a cute black kitten, high quality, award winning image
Refer to caption
(i) Control
Refer to caption
(j) Original Image
Refer to caption
(k) cinematic, luxury apartment, colourful, highly detailed
Refer to caption
(l) cinematic, cyberpunk apartment out of steel and concrete, colourful, highly detailed
Refer to caption
(m) Control
Refer to caption
(n) Original Image
Refer to caption
(o) photo of a beautiful young woman, award winning picture, professional photo, black and white
Refer to caption
(p) photo of a beautiful young woman, award winning picture, professional photo
Figure 10: Images generated by ControlNet-XS (55M) and Stable Diffusion 1.5 [51] as generative model with two different text-prompts. The generated images have the resolution of 768×768768768768\times 768768 × 768.
Refer to caption
(a) Control
Refer to caption
(b) Original Image
Refer to caption
(c) cinematic, winter, tree on a field, dramatic sky, highly detailed, photorealistic
Refer to caption
(d) cinematic, autumn, tree on a field, dramatic sky, highly detailed, photorealistic
Refer to caption
(e) Control
Refer to caption
(f) Original Image
Refer to caption
(g) cinematic, highly detailed, castle, beautiful sky, summer, photorealistic, 4k
Refer to caption
(h) cinematic, highly detailed, snowy castle, beautiful sky, snowy winter, photorealistic, 4k
Refer to caption
(i) Control
Refer to caption
(j) Original Image
Refer to caption
(k) cinematic statue of an owl on a rock, highly detailed, photorealistic
Refer to caption
(l) cinematic white snow owl on a rock, highly detailed, photorealistic
Refer to caption
(m) Control
Refer to caption
(n) Original Image
Refer to caption
(o) cinematic, highly detailed, milky cocktail on a plate, photorealistic
Refer to caption
(p) cinematic, highly detailed, winter tea with spices on a plate, photorealistic
Figure 11: Images generated by ControlNet-XS (20M) and Stable Diffusion XL [40] as generative model with two different text-prompts. The generated images have the resolution of 1024×1024102410241024\times 10241024 × 1024.

Appendix F Fidelity of the Control

Here, we provide additional examples with respect to fidelity of the control, see also main article Fig. 6 and Sec. 4.3. In Fig. 12, we qualitatively evaluate the effect of decreased fidelity of the control with smaller ControlNet-XS models. We additionally illustrate the effect which the complexity of control images can have on the fidelity to control.

Refer to caption
(a) Control
Refer to caption
(b) Original image
Refer to caption
(c) ControlNet-XS 55M
Refer to caption
(d) ControlNet-XS 11.7M
Refer to caption
(e) ControlNet-XS 1.7M
Refer to caption
(f) Control
Refer to caption
(g) Original Image
Refer to caption
(h) ControlNet-XS 55M
Refer to caption
(i) ControlNet-XS 11.7M
Refer to caption
(j) ControlNet-XS 1.7M
Figure 12: The fidelity of the control reduces with smaller model sizes of ControlNet-XS for complex, detailed control maps. The generated street scenes (c - e) are with respect to a detailed control depth map (a). We see that in the 55555555M parameter model, the complex structures are identical to the original image. Smaller models with 11.711.711.711.7M and 1.71.71.71.7M parameters, respectively, are still guided by the control but less rigorously. Depth maps with fewer details such as the faces in (f) are processed with a similar fidelity of the control by all model sizes (h-j).

Appendix G Semantic Bias

Here we provide additional results for Fig. 7 in the main article, about the semantic bias of ControlNet [72]. We show results for different control strengths α𝛼\alphaitalic_α for ControlNet (Fig. 13), ControlNet-XS with 55M parameters (Fig. 14) and ControlNet-XS with 11.7M parameters (Fig. 15). We notice that ControlNet-XS has less semantic bias than ControlNet. We also notice that the sematic bias cannot be removed by adapting the control-strength α𝛼\alphaitalic_α.

Refer to caption
Figure 13: Semantic bias for depth control. All images are generated with a control depth map of a street scene and an unrelated text-prompt: “high quality photo of a delicious cake, 4k image”. Note that these are not contradicting control inputs, but the inputs rather challenge the generative process to produce a creative solution with a cake in form of a street scene. We see that ControlNet [72] is not able to produce satisfying results, even when adjusting the control strength α𝛼\alphaitalic_α. Note that α=0.825𝛼0.825\alpha=0.825italic_α = 0.825 is the default for ControlNet. With this default value, ControlNet shows proper house facade textures, while ControlNet-XS shows typical cake textures such as “sponge”, “marzipan” or “icing”. (Note that the output signals of the controlling network are added with a global weighting α𝛼\alphaitalic_α to the output signals of the generation network at the respective neural blocks. This weighting can be adjusted at test time). We conjecture that the reason is a semantic bias induced by large control models. A large control model can use its power to add semantic meaning to input depth maps. This semantic bias cannot be removed by adapting α𝛼\alphaitalic_α. Here, α=0.4𝛼0.4\alpha=0.4italic_α = 0.4 was the “sweet spot” where ControlNet suddenly transitions from producing images of a cake to images of a street scene.
Refer to caption
Figure 14: Semantic bias for depth control with ControlNet-XS (55M). Please find a detailed explanation in Fig. 13. We see that for α=0.6𝛼0.6\alpha=0.6italic_α = 0.6 the control is picked up. For all α[0.6,1]𝛼0.61\alpha\in[0.6,1]italic_α ∈ [ 0.6 , 1 ] a street scene with a cake texture is shown. The cake texture is more detailed then with ControlNet-XS (11.7M) (see Fig. 15).
Refer to caption
Figure 15: Semantic bias for depth control with ControlNet-XS (11.7M). Please find a detailed explanation in Fig. 13. We see that for α=0.825𝛼0.825\alpha=0.825italic_α = 0.825 the control is picked up. For both α=0.825𝛼0.825\alpha=0.825italic_α = 0.825 and α=1𝛼1\alpha=1italic_α = 1, a street since with a cake texture is shown. The cake texture is less detailed then with ControlNet-XS (55M) (see Fig. 14). This can be expected since we have seen above (e.g. Fig. 12) that smaller models have reduced fidelity of the control.