HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: arydshln
  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2404.03913v1 [cs.CV] 05 Apr 2024

Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models

Gihyun Kwon11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT  Simon Jenni22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT  Dingzeyu Li22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT  Joon-Young Lee22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
Jong Chul Ye11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT  Fabian Caba Heilbron22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT

KAIST11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT   Adobe22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT
[gihyun, jong.ye]@kaist.ac.kr  [jenni, dinli, jolee, caba]@adobe.com
Abstract
This work is done when Gihyun Kwon was an intern at Adobe Research.

While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. In this work, we introduce Concept Weaver, a method for composing customized text-to-image diffusion models at inference time. Specifically, the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts, and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore, the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects.

[Uncaptioned image]
Figure 1: Concept Weaver’s Generation Results. Our method, Concept Weaver, can inject the appearance of arbitrary off-the-shelf concepts (from a Bank of Concepts) to generate realistic images.

1 Introduction

Text-to-image generation models have shown impressive capabilities  [21, 23, 28] in the last few years. Existing open source  [21] and commercial solutions such as Adobe Firefly have enabled aspiring creatives to generate images with unprecedented quality by simply crafting text prompts. Progress has also been attained in develo** models that can customize images for your own subjects or visual concepts  [11, 3, 22, 25]. These technologies have opened the door for new ways of content creation, where aspiring creators can craft stories with personalized characters under different scenes and styles.

While there has been significant progress in customizing text-to-image generation models, generating images that combine multiple personalized concepts remains challenging. Several approaches  [11, 25] offer the ability to jointly train models for multiple concepts or merge customized models, enabling the creation of scenes with more than one personalized concept. However, it often fails to generate semantically related concepts (e.g., cat and dog) and struggles to scale beyond three or more concepts. More recently, Mix-of-show [4] has addressed the issue of multi-concept generation with disentangled Low-Rank (LoRa)  [9] weight merging and regional guidance at the sampling stage. However, the model still suffers from mixed concepts due to the difficulty of weight merging.

In this paper, we propose a tuning-free method for composing customized text-to-image diffusion models at inference time. We illustrate our key idea in Figure 2, where the goal is to generate images featuring more than two custom concepts. Specifically, rather than generating a personalized image from scratch, we break the process into two steps: first, we create a template image that aligns with the semantics of the input prompt, and then we personalize this template image using a novel concept fusion strategy. The fusion strategy takes as input the non-personalized template image along with region concept guidance (obtained automatically) to generate an edited image that retains the template’s structural details while incorporating the target concepts’ appearance and style. This fusion approach injects concept details into specific spatial regions, allowing us to compose multiple concepts (from the Bank of Concepts) in generated images without blending appearances across different subjects.

Our empirical evaluations show that the proposed method is able to generate multiple custom concepts with higher concept fidelity. In particular, as shown in Section 4, we observe that our method can compose images without blending appearances for semantically related concepts (cats and dogs). Second, we notice that our model can seamlessly handle more than two concepts, e.g., two subjects and a custom background, while the baseline approaches struggle. Finally, we find that the images generated by our method closely follow the semantic meaning of the input prompt achieving high CLIP scores [11]. Ours also has robustness on architecture as it can be used in both of full fine-tuning and Low-Rank adaptation, which is more efficient in computation.

2 Related Work

Text-to-image Diffusion Models.

Text-to-image generation models have made significant progress, starting from early GAN-based models [2, 29] to recent diffusion-based models [23, 21, 28, 20]. Various open source models and commercial models like Adobe Firefly have contributed to this development. The recent introduction of Stable Diffusion models [21] has led to the exploration of various applications such as mask-based image editing [1], image translation  [26, 18, 16], and style transfer based on text  [30]. Moreover, the attention-based structure of stable diffusion has inspired different editing methods [26, 7, 17].

Refer to caption
Figure 2: Concept Weaver’s Method. First, we fine-tune a text-to-timage model for each target concept in the bank (Step 1). Then we source a template image (Step 2). Given the template image, we apply the inversion process with simultaneous feature extraction to save its structural information (Step 3). In Step 4, we extract region masks from the template image with off-the-shelf models [10]. With extracted features and masks, we generate the multi-concept image in Step 5.

Diffusion Model Customization.

Building on the advancements of these T2I models, research on customizing T2I models using user-prepared images or visual concepts has gained attention. The seminal work of Textual Inversion [3] has focused on finding optimized textual embeddings for custom concepts to generate concept-reflecting images. Subsequent research has improved performance by finding extended textual embeddings [27, 12] or fine-tuning model parameters [22, 11], enabling more efficient and flexible customization.

Extended from the previous single-concept frameworks, customization involving multiple concepts has also been attempted. These approaches include methods using joint training for simultaneously embedding the multi-concepts [11, 5], weight merging of single-concept customized model parameters [11, 25], and spatial guidance [13]. However, these approaches face challenges when the number of concepts increases or when the semantic distance between the concepts is close, resulting in the disappearance or blending of specific concepts. To address this, recent work of Mix-of-show [4] applies regional guidance during the sampling process using merged weights to resolve the issue of concept blending. However, the approach still requires additional optimization steps for weight merging and may experience fluctuations in quality due to the sensitivity to regional guidance.

3 The Concept Weaver’s Method

In this section, we introduce Concept Weaver, an innovative method designed to generate high-quality images that incorporate multiple custom concepts. Traditional models often struggle with generating complex, multi-concept images in a single step. Concept Weaver addresses this by employing a cascading generation process, which we illustrate in Figure 2. Consider the prompt: “A [C1]dog and a [C2]cat playing with a ball, [C3]mountain background”, where [C1,C2,C3] denote custom concepts. Our approach begins by personalizing text-to-image models for each concept (Step 1). Next, we select a non-personalized ’template image’ using the given prompt, either from a text-to-image model or a real-world source (Step 2). In the third step, we extract latent representations from this template to aid in later editing. The fourth step involves identifying and isolating the specific regions of the template image that correspond to the target subjects. Finally, our key contribution (Step 5) combines these latent representations, targeted spatial regions, and personalized models to reconstruct the template image, infusing it with the specified concepts. We present each of these key steps in detail next.

Step 1: Concept Bank Training.

In this step we fine-tune a pretrained text-to-image model to embed each of the target concepts in the bank. Among the various customization strategies, we leverage Custom Diffusion [11] as it does not change any residual network or self-attention layers. In practice, Custom Diffusion only fine-tunes the cross-attention layers of the U-Net model ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Specifically, with the text condition pRs×d𝑝superscript𝑅𝑠𝑑p\in R^{s\times d}italic_p ∈ italic_R start_POSTSUPERSCRIPT italic_s × italic_d end_POSTSUPERSCRIPT and self-attention feature fR(h×w)×c𝑓superscript𝑅𝑤𝑐f\in R^{(h\times w)\times c}italic_f ∈ italic_R start_POSTSUPERSCRIPT ( italic_h × italic_w ) × italic_c end_POSTSUPERSCRIPT, the cross attention layer consists of Q=Wqf,K=Wkp,V=Wvpformulae-sequence𝑄superscript𝑊𝑞𝑓formulae-sequence𝐾superscript𝑊𝑘𝑝𝑉superscript𝑊𝑣𝑝Q=W^{q}f,K=W^{k}p,V=W^{v}pitalic_Q = italic_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_f , italic_K = italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p , italic_V = italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_p.

We only fine-tune the ‘key’ and the ‘value’ weight parameters Wk,Wvsuperscript𝑊𝑘superscript𝑊𝑣W^{k},W^{v}italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT of the cross-attention layers. Also, we use modifier tokens [V*], which are placed ahead of the concept word (e.g., [V*] dog) and operate as a constraint to general concepts. We augment the fine-tuning process with robust data augmentation techniques. Since we can incorporate an arbitrary personalization approach if the method is only related to cross-attention layers, we can naturally extend the approach to an efficient LoRA [9]-based fine-tuning method. We will show the flexibility of the proposed approach in our experiment part.

Step 2 : Template Image Generation. One of our key insights is to cascade the multi-concept generation process – we start from a template image that can be customized/personalized with the target concepts in the given prompt. To source a template image we can rely on existing text-to-image models but also on real images if given. They should include the semantic objects (or characters) with specific background desired in the prompt. In practice, we generate template images using Stable Diffusion [21] model version \geq2.0.

Refer to caption
Figure 3: Image Inversion and Multi-Concept Fusion. (a) To extract and save the structural information of template images, we save the intermediate latent of images during the DDIM forward process. With the fully inverted noise, we extract the feature outputs from denoising U-Net during the DDIM reverse process. (b) From the noisy inverted latent, we start the multi-concept fusion generation. We denoise the noisy image with fine-tuned personalized models. After obtaining multiple cross-attention layer features, we fuse the different features from each masked region. In this step, we inject the pre-calculated self-attention and resnet features into the networks.

Step 3 : Inversion and Feature Extraction. After sourcing a template image, we apply an inversion process to obtain a latent representation that will help guide our generation process. In this stage, we borrow the image inversion and feature extraction schemes proposed in plug-and-play diffusion (PNP)  [26]. More specifically, as shown in Figure 3 (a), from the source image xsrcsubscript𝑥𝑠𝑟𝑐x_{src}italic_x start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT we generate the noisy latent space zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with the DDIM [24] forward process. From the inverted latent zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we can accurately reconstruct the source image using a reverse DDIM process [24]. We provide more details about the inversion process in the supplementary material. During the reverse reconstruction process, we extract the features from the U-Net’s l𝑙litalic_l-th layer ftlsuperscriptsubscript𝑓𝑡𝑙{f}_{t}^{l}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT at each timestep t𝑡titalic_t. These features include intermediate outputs from residual layers and self-attention activations. As proposed in PNP diffusion, we extract the ResNet output from l=4𝑙4l=4italic_l = 4 and self-attention maps from l=4,7,9𝑙479l=4,7,9italic_l = 4 , 7 , 9. Inspired by the recent negative prompt inversion [6], we used the reference text condition psrcsubscript𝑝𝑠𝑟𝑐p_{src}italic_p start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT during the inversion process.

Step 4 : Mask Generation. Given an inverted latent and pre-calculated features, we can guide the structural information of the subsequent generation process. However, we using the structural guidance cannot guarantee the concept-wise editing of each targeting concepts and generated images often yields mixed concepts. Therefore, we use the masked guidance in which we apply the personalized generation model to the specific regions which already contains the template objects. In order to obtain the semantic mask regions, we leveraged the Segment Anything Model  [10]. To further avoid the manual seeding of segmentation model, we incorporated the pre-trained text conditional grounding model [15] to obtain the bounding box regions with given text prompts. We then obtain the box regions giving single concept-wise words such as ’a dog’,’a cat’, etc. For N𝑁Nitalic_N different concepts, we extract concept-wise masks M1,M2,MNsubscript𝑀1subscript𝑀2subscript𝑀𝑁M_{1},M_{2},\dots M_{N}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_M start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT, and set the unmasked region as background mask Mbg=(M1M2MN)csubscript𝑀𝑏𝑔superscriptsubscript𝑀1subscript𝑀2subscript𝑀𝑁𝑐M_{bg}=(M_{1}\bigcup M_{2}\bigcup\dots M_{N})^{c}italic_M start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT = ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋃ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋃ … italic_M start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT.

We empirically discovered that when we use directly obtained densely annotated masks, the final output often yields deformed outputs. Therefore instead of using densely annotated mask, we used dilated mask in which the mask region is expanded from the original area. To prevent confusion between overlap** regions of concepts, we kept the original dense mask only in such overlapped regions.

Step 5 : Multi-Concept Fusion. We now can generate the images with multi-concept characters as described in Figure  3(b). Since our goal is to generate images without any joint-training stage, we propose a novel sampling process which can combine the multiple single-concept personalized models in unified sampling process. Starting from inverted noisy latent zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we denoise the noise component from the latent. More specifically, we assume that there is a bank of concepts which already contains parameter sets for fine-tuned single-concept models. In practice, we select N𝑁Nitalic_N concepts for generation, of which the weight parameters are θ1,θ2,θNsubscript𝜃1subscript𝜃2subscript𝜃𝑁\theta_{1},\theta_{2},\dots\theta_{N}italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_θ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. Also, we pick one concept for background generation, which have parameters of θbgsubscript𝜃𝑏𝑔\theta_{bg}italic_θ start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT. With the selected models, we start our multi-concept fusion sampling.

One naive approach is to mix the multiple score estimation outputs similar to compositional diffusion [14]. At each time step t𝑡titalic_t, the single score estimation is represented as:

ϵfuse=iNϵθi(zt,t,p+i)Mi+ϵθbg(zt,t,p+bg)Mbg,subscriptitalic-ϵ𝑓𝑢𝑠𝑒superscriptsubscript𝑖𝑁subscriptitalic-ϵsubscript𝜃𝑖subscript𝑧𝑡𝑡subscript𝑝𝑖subscript𝑀𝑖subscriptitalic-ϵsubscript𝜃𝑏𝑔subscript𝑧𝑡𝑡subscript𝑝𝑏𝑔subscript𝑀𝑏𝑔\displaystyle\epsilon_{fuse}=\sum_{i}^{N}\epsilon_{\theta_{i}}(z_{t},t,p_{+i})% M_{i}+\epsilon_{\theta_{bg}}(z_{t},t,p_{+bg})M_{bg},italic_ϵ start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT + italic_i end_POSTSUBSCRIPT ) italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT + italic_b italic_g end_POSTSUBSCRIPT ) italic_M start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT ,

where ϵθi(zt,t,p+i)subscriptitalic-ϵsubscript𝜃𝑖subscript𝑧𝑡𝑡subscript𝑝𝑖\epsilon_{\theta_{i}}(z_{t},t,p_{+i})italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT + italic_i end_POSTSUBSCRIPT ) is the model output from the i𝑖iitalic_ith concept, and Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the corresponding mask region for each concept. However, we found that naively mixing the different models in score estimation shows limited performance as the concepts of generated outputs are not smoothly mixed.

We address this problem by introducing multiple techniques for realistic concept-fusion:
First, we inject the pre-calculated features ftlsuperscriptsubscript𝑓𝑡𝑙f_{t}^{l}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT to the U-net models. Since the concept-aware parameters are only related to cross-attention layers, they are not related to saved features ftlsuperscriptsubscript𝑓𝑡𝑙f_{t}^{l}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT as they are extracted from residual and self attention layers. Therefore, we give the unified structural information to the entire sampling steps without deteriorating the representation of custom concepts.
Second, we found that using same text condition input to all networks yields severe artifacts and results in concept leakage problems, i.e. the apperance of concepts is mixed indiscriminately. Therefore, we propose a concept-aware text conditioning strategy, in which our text condition input p+isubscript𝑝𝑖p_{+i}italic_p start_POSTSUBSCRIPT + italic_i end_POSTSUBSCRIPT contains a sentence which only includes one concept-indication modifier word. For example, if we combine two concepts of [c1] dog, [c2] cat and [bg] mountain background, our prompt construction scheme is as follows. We start from basic text prompt such as :

pbase=”A dog and a cat playing with a ball, mountain background”subscript𝑝𝑏𝑎𝑠𝑒”A dog and a cat playing with a ball, mountain background”\displaystyle p_{base}=\footnotesize{\textit{"A dog and a cat playing with a % ball, mountain background"}}italic_p start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT = ”A dog and a cat playing with a ball, mountain background”

Then we place the placeholder token in front of the each concepts for each text conditions such that:

p+1=”A [c1] dog playing with a ball, mountain background”subscript𝑝1”A [c1] dog playing with a ball, mountain background”\displaystyle p_{+1}=\footnotesize{\textit{"A {\color[rgb]{1,0,0}\definecolor[% named]{pgfstrokecolor}{rgb}{1,0,0}\pgfsys@color@rgb@stroke{1}{0}{0}% \pgfsys@color@rgb@fill{1}{0}{0}[c1]} dog playing with a ball, mountain % background"}}italic_p start_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT = ”A italic_[c1] dog playing with a ball, mountain background”
p+2=”A [c2] cat playing with a ball, mountain background”subscript𝑝2”A [c2] cat playing with a ball, mountain background”\displaystyle p_{+2}=\footnotesize{\textit{"A {\color[rgb]{0,0,1}\definecolor[% named]{pgfstrokecolor}{rgb}{0,0,1}\pgfsys@color@rgb@stroke{0}{0}{1}% \pgfsys@color@rgb@fill{0}{0}{1}[c2]} cat playing with a ball, mountain % background"}}italic_p start_POSTSUBSCRIPT + 2 end_POSTSUBSCRIPT = ”A italic_[c2] cat playing with a ball, mountain background”
p+bg=”A dog and a cat playing with a ball, [bg] mountain background”subscript𝑝𝑏𝑔”A dog and a cat playing with a ball, [bg] mountain background”\displaystyle p_{+bg}=\footnotesize{\textit{"A dog and a cat playing with a % ball, {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}% \pgfsys@color@rgb@stroke{0}{0}{0}\pgfsys@color@rgb@fill{0}{0}{0}[bg]} mountain% background"}}italic_p start_POSTSUBSCRIPT + italic_b italic_g end_POSTSUBSCRIPT = ”A dog and a cat playing with a ball, italic_[bg] mountain background”

With the differently constructed text conditions, we can sample the concept-specific image in the targeted regions.

Third, we propose to mix the different concepts in the feature space of cross-attention layers as shown in Fig. 3(b). With the i𝑖iitalic_ith concept weight parameter θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and concept-aware prompt p+isubscript𝑝𝑖p_{+i}italic_p start_POSTSUBSCRIPT + italic_i end_POSTSUBSCRIPT, we can extract output feature hil,tsubscriptsuperscript𝑙𝑡𝑖h^{l,t}_{i}italic_h start_POSTSUPERSCRIPT italic_l , italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from the l𝑙litalic_lth cross attention layers and timestep t𝑡titalic_t. For brevity, we remove l,t𝑙𝑡l,titalic_l , italic_t as we use the feature in all layers and timesteps. With the extracted features for each concept, we can calculate mixed features such that:

hfuse=iNhiMi+hbgMbg.subscript𝑓𝑢𝑠𝑒superscriptsubscript𝑖𝑁subscript𝑖subscript𝑀𝑖subscript𝑏𝑔subscript𝑀𝑏𝑔\displaystyle h_{fuse}=\sum_{i}^{N}h_{i}M_{i}+h_{bg}M_{bg}.italic_h start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT .

We also propose a concept-free suppression method to remove the concept-free features during sampling process. Specifically, we calculate the cross attention features hbasesubscript𝑏𝑎𝑠𝑒h_{base}italic_h start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT from a concept-free (not fine-tuned) model ϵθbasesubscriptitalic-ϵsubscript𝜃𝑏𝑎𝑠𝑒\epsilon_{\theta_{base}}italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT with a basic text condition pbasesubscript𝑝𝑏𝑎𝑠𝑒p_{base}italic_p start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT, and extrapolate the concept-free features with the initial fused features such as:

hfuse=(1+λ)[iNhiMi+hbgMbg]λhbase.subscript𝑓𝑢𝑠𝑒1𝜆delimited-[]superscriptsubscript𝑖𝑁subscript𝑖subscript𝑀𝑖subscript𝑏𝑔subscript𝑀𝑏𝑔𝜆subscript𝑏𝑎𝑠𝑒\displaystyle h_{fuse}=(1+\lambda)[\sum_{i}^{N}h_{i}M_{i}+h_{bg}M_{bg}]-% \lambda h_{base}.italic_h start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT = ( 1 + italic_λ ) [ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_b italic_g end_POSTSUBSCRIPT ] - italic_λ italic_h start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT .

We then calculate the fused score estimation, such that:

ϵfuse=ϵθ(zt,t;hfuse;ft),subscriptitalic-ϵ𝑓𝑢𝑠𝑒subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝑓𝑢𝑠𝑒subscript𝑓𝑡\displaystyle\epsilon_{fuse}=\epsilon_{\theta}(z_{t},t;h_{fuse};f_{t}),italic_ϵ start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ; italic_h start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

where hfusesubscript𝑓𝑢𝑠𝑒h_{fuse}italic_h start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT uses the fused features in cross attention layers, and ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT uses the pre-calculated features in self attention & residual layers.

In our model, the pre-calculated features ftsubscript𝑓𝑡f_{t}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT influence only the structural aspects of the image, while the fused features, represented as hfusesubscript𝑓𝑢𝑠𝑒h_{fuse}italic_h start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT, are exclusively concerned with concept-wise semantic information. This clear distinction ensures there is no conflict between these two components. As a result, our approach effectively accomplishes two distinct objectives: maintaining the overall structure of the template image and simultaneously altering the semantics of the objects to align with custom concepts. This dual functionality allows for a nuanced and precise manipulation of images according to specific requirements.

It is widely known that only using the conditional score estimation cannot produce proper generated outputs. Therefore, we leverage classifier-free guidance [8] to extrapolate the output from unconditional text condition p=subscript𝑝p_{\varnothing}=\varnothingitalic_p start_POSTSUBSCRIPT ∅ end_POSTSUBSCRIPT = ∅. In practice, we use the recent ‘negative’ prompt strategy instead of unconditional text condition, so that the output generated images will not contain the unwanted attributes described in the negative prompt pnegsubscript𝑝𝑛𝑒𝑔p_{neg}italic_p start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT. In our case, the negative-guidance score output is represented as:

ϵ=ωϵfuse+(1ω)ϵθbase(zt,t,pneg;ft).italic-ϵ𝜔subscriptitalic-ϵ𝑓𝑢𝑠𝑒1𝜔subscriptitalic-ϵsubscript𝜃𝑏𝑎𝑠𝑒subscript𝑧𝑡𝑡subscript𝑝𝑛𝑒𝑔subscript𝑓𝑡\displaystyle\epsilon=\omega\cdot\epsilon_{fuse}+(1-\omega)\cdot\epsilon_{% \theta_{base}}(z_{t},t,p_{neg};f_{t}).italic_ϵ = italic_ω ⋅ italic_ϵ start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e end_POSTSUBSCRIPT + ( 1 - italic_ω ) ⋅ italic_ϵ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_b italic_a italic_s italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT italic_n italic_e italic_g end_POSTSUBSCRIPT ; italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) .

Implementation Details

For the step 1 single-concept personalization, we adopted the official repository of Custom Diffusion [11]. We used the pre-trained Stable Diffusion V2.1(SD2.1) as our starting point for fine-tuning as the model showed improved quality. For a fair comparison, we adopted SD2.1 for all of the baseline methods. For each concept, we fine-tuned the models with 500 steps using learning rate of 1e-5. For step 2 template image generation part, we used images generated from Stable Diffusion XL with 50 sampling steps, higher resolution of 1024×\times×1024 which takes 10 seconds for generating the image. The source image for this step can be a real images which contains the multiple objects. For step 4 mask generation, we leveraged the pipelines from langSAM111https://github.com/luca-medeiros/lang-segment-anything. For step 3 and 5, we followed the official source code of Plug-and-Play diffusion features [26]. In this stage, we also used SD2.1 as our generation backbone. We set the resolution size of generation process as 768×\times×768, and used sampling step of 50. The entire process (from step 1 to 5) takes about 60 seconds with single RTX3090(VRAM 24GB) GPU. More sampling protocol details in the supplementary material.

Refer to caption
Figure 4: Qualitative Evaluation of Multi-Concept Generation. We assess the quality of image generation by our method compared to baseline approaches, using prompts that incorporate every concept from a predefined concept bank (shown on the left). First row: our method successfully preserves the appearance of the target concepts while all baselines fail. Second row: here Mix-of-show is able to preserve the identity but struggles when the prompt includes a close interaction. Third row: all baseline approaches fail to generate the prompted action or to preserve the concept’s attributes; our model instead generates an image that follows the prompt while preserving the appearance of the concepts. Overall, our model generates concept-aware outputs without any concept mixing problems.

4 Experimental Results

In this section, we evaluate our multi-concept fusion approach. First, we present qualitative and quantitative results that highlight our method’s effectiveness in generating multiple concepts in challenging scenarios. We then discuss our ablation, which examines the impact of different design choices. Finally, we show how our method can also be applied to edit and personalize real images.

Baselines. We compare our approach with several methods for concept personalization. We include early approaches such as Custom Diffusion  [11] and Textual Inversion  [3]. Moreover, we include recent approaches such as Perfusion [25] and Mix-of-show [4]. These approaches use a weight merging approach in which the model uses an optimization process to mix multiple single-concept weights into a unified set of weights. Since the Mix-of-show model uses a region-based sampling approach, we manually set the different regions for each concept for a fair comparison. Datasets. We use diverse data sources for both quantitative and qualitative analyses. For quantitative evaluation, we select 15 distinct concepts from the Custom Concept dataset, arranged into five unique combinations. These concepts encompass a wide range of categories, including animals, humans, natural scenes, and objects. For qualitative analysis, we extend the bank of concepts with 3 animated characters concepts extracted from YouTube. The Custom Concept 101 dataset offers a wide variety of images, with each concept containing approximately 3 to 8 images. For the animated character concepts from the Blender Open Movie222https://www.youtube.com/watch?v=WhWc3b3KhnY&t=52s, we curated a collection of around 5 images per concept. The supplementary material showcases examples of all used concepts in our evaluations.

Evaluation metrics. Following [11], we assess our method against baseline approaches by measuring Text-alignment (Text-sim) and Image-alignment (Image-sim) using CLIP scores [19]. Text-alignment computes the cosine similarity between the CLIP embedding of the generated image and the CLIP embedding of the text prompt. To accurately reflect our model’s performance in generating multiple concepts, we have adapted the standard Image-alignment metric. This involves computing cosine similarity between visual embeddings from designated concept regions and the embeddings of corresponding target concepts. We compute these metrics over 200 unique images generated by each model. We use 5 combinations of multiple concepts in which each combination includes more than 3 concepts. We use varied text prompts, from simple text such as “photo of dog and a cat standing, mountain background’, to complex interactions between the concepts like “photo of dog and a cat kissing, mountain background’. We report the average Text-alignment and Image-alignment scores computed over all the generated images.

Refer to caption
Figure 5: Towards More Complex Multi-Concept Generation. We compare our method against Mix-of-show at generating images with prompts involving four challenging concepts. Mix-of-show exhibits severe problems of concept missing. Our method, instead, can successfully generate realistic concept-aware images when using a larger number of concepts.

4.1 Multi-Concept Generation Results

Qualitative Evaluation.

We compare our method against the baselines in generating images from three-concept prompts. We include simple prompts such as “A photo of a [C1] cat and a [C2] woman standing with a [3] lighthouse background.”. We also study the generation quality for prompts involving concept interactions, for instance, “A photo of a [1] cat and a [C2] woman hugging with a [3] lighthouse background.”. We pick the images with the image with largest CLIP score for a fair comparison.

Figure 4 summarizes our qualitative evaluation. Most baseline approaches [25, 11, 3] struggle to generate high-quality images, often failing to accurately capture the appearance of all target concepts and frequently mixing distinct features such as appearance, texture, or details between concepts. Mix-of-show [4] tends to generate realistic images for multi-concept prompts. However, we observe a common failure mode that mixes the concept’s appearance when the concept locations are close in space, e.g., when prompted to generate subjects that are “kissing”. In contrast, our method can successfully generate the custom concepts, even when prompted to generate interactions between these concepts, without mixing or missing concepts, therefore properly reflecting the given text prompts.

When composing more than 3 concepts, our method also outperforms the competing method of Mix-of-show as shown in Figure 5. Mix-of-show [4] requires weight mixing for multi-concept fusion, making its generated images severely deteriorated when including more concepts due to the complexity of weight optimization.

Method CLIP score
Text sim\uparrow Image sim\uparrow
Textual Inversion 0.3423 0.7256
Custom Diffusion 0.3595 0.7875
Perfusion 0.3182 0.7563
Mix-of-show 0.3634 0.7984
Concept Weaver (ours) 0.3804 0.8124
Table 1: Quantitative Evaluation of Multi-Concept Generation. Our model outperforms the baselines in both CLIP scores, indicating that our outputs have better text and concept alignment.
Method User Study
Text match\uparrow Concept match\uparrow Realism\uparrow
Textual Inversion 2.28 1.89 2.55
Custom Diffusion 2.73 2.11 2.64
Perfusion 2.22 1.84 2.70
Mix-of-show 3.44 3.39 3.78
Concept Weaver (Ours) 4.70 4.64 4.43
Table 2: Human Preference Study. We assess three different axes. Text match: evaluates how closely the images follow a given text prompt. Concept match: measures the quality of preserving the appearance and attributes of target concepts. Realism: captures the overall quality of the generated images. We use a 5-point scale, where 1 represents “strongly disagree” and 5 “strongly agree”, and report the average across all responses.
Settings CLIP score
Text sim \uparrow Image sim \uparrow
(a) Only mask guidance 0.3140 0.7544
(b) w/o feature injection 0.3489 0.7739
(c) eps mix 0.3677 0.8023
(d) w/o concept-free suppresion 0.3727 0.7936
Concept Weaver (Ours) 0.3804 0.8124
Table 3: Ablation Study. Quantitative comparison on ablating components of our method. We validate that each of our design choices make our model better at multi-concept generation.

Quantitative Evaluation. Table 1 reports the CLIP scores for our method and the baseline approaches. The results showed that our method outperformed in both text-similarity and image-similarity scores which indicates that our generated outputs show better quality in both text semantic alignment and concept appearance preservation.

Human Preference Study. To further assess the perceptual quality of our generated images, we conducted a user study with 20 participants. We summarize the results in Table 2. The study was designed to capture detailed opinions along three different axes: 1) Alignment with the given text prompt (Text match), 2) Inclusion of all target concepts (Concept match), and 3) Overall quality and realism of the generated images (Realism). The participants were asked to score 20 images on each of these axis using a 5-point scale, where 1 represents “strongly disagree” and 5 “strongly agree”. More details about the protocol in the supplementary material. These results validate that our proposed method can generate perceptually better outputs when compared to the baseline methods, as consistently indicated by a broad range of human evaluators.

Refer to caption
Figure 6: Ablation Study. (a) Results with only using mask guidance. (b) Results without using feature injection strategy. (c) Results of direct mixing on score estimation output. (d) Results without using concept-free suppression approach. (e) Ours (full).

4.2 Ablation Study

We ablate our method and show a qualitative comparison between different settings in Figure 6. When we only use mask guidance similar to the approach of Mix-of-show (a), the output’s structures are severely deformed, and the image does not contain the proper concepts. (b) When we remove the feature injection, the output image again shows concept leakage and the quality is lowered. (c) When we use epsilon space mixing, the output image shows unwanted artifacts on the boundary area. (d) If we do not use the suppression method, the generated object does not fully reflect the concept appearance, especially for the plushie concept. We also show a quantitative comparison between the different settings in Table 3. We followed the same experiment protocol used in our quantitative comparison. The results validate our design choices and expose their benefits in generating images that have the highest correspondence between the text condition and the target concepts.

4.3 Applications and Potential Extensions

Refer to caption
Figure 7: Customizing Real Images. Our method can also edit real images to inject the appearance of target concepts.

Customizing Real Images. Since our sampling approach starts from initial template images, we can easily extend our method into real image editing by substituting the generated template images with real ones. As shown in Figure  7, our method can edit real images with multiple custom concepts. It accurately injects the appearance and attributes of the target concepts into the existing objects in the real image.

Extension to LoRa Fine-tuning. Instead of using Custom Diffusion fine-tuning on the single-concept personalization step, we can easily adapt our approach to the more efficient scheme of Low-Rank adaptation fine-tuning. Different from the basic approach of fully fine-tuning the key and value weight Wk,Wvsuperscript𝑊𝑘superscript𝑊𝑣W^{k},W^{v}italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT, we can use LoRA-based fine-tuning in which only ΔWΔ𝑊\Delta Wroman_Δ italic_W is updated such that Wnew=W+ΔWsubscript𝑊𝑛𝑒𝑤𝑊Δ𝑊W_{new}=W+\Delta Witalic_W start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT = italic_W + roman_Δ italic_W. Figure  8 illustrates that our method can easily extended to leverage the more efficient LoRA fine-tuning. We show more generated samples in Supplementary Materials.

Refer to caption
Figure 8: Extension to LoRa Fine-tuning. Our method also supports bank of concepts trained with efficient LoRA fine-tuning.

5 Conclusion

We introduced a novel framework to generate high-fidelity images which contain multiple custom concepts. Our proposed approach fuses multiple personalized single-concept models during the sampling stage without any additional optimization process. The experimental results showed that our method outperforms state-of-the-art customization methods in multiple axes. In general, our proposed method can generate a larger number of concepts together, including complex interactions between them. We also showed that our approach can be applied to customize real images and be easily extended to efficient LoRA fine-tuning.

Acknowledgements. This research was supported by the Field oriented Technology Development Project for Customs Administration under Grant NRF2021M3I1A1097938, and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT, Ministry of Science and ICT) (No. 2022-0-00984, Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of Explanation, No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST))

\thetitle

Supplementary Material

Appendix A Method Details

Details of Concept Bank Training. Given the model and image examples with custom concepts, we can fine-tune the components of the model to embed the single-concept into the pre-trained model. Textual Inversion [3] has been widely adopted; however, it suffers from undetailed expression of custom concept due to the limited degree of freedom. There is also Dreambooth [22], which requires fine-tuning of all the parameters of the model, making it time consuming to fine-tune to a large number of concepts. As we will leverage the self-attention layer and residual block features as a source for structural preservation, we chose framework of Custom Diffusion [11] following the score matching loss:

𝐄ϵ,x,p,t[ϵϵθ(xt,p,t)],subscript𝐄italic-ϵ𝑥𝑝𝑡delimited-[]normitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑥𝑡𝑝𝑡\displaystyle\mathbf{E}_{\epsilon,x,p,t}[||\epsilon-\epsilon_{\theta}(x_{t},p,% t)||],bold_E start_POSTSUBSCRIPT italic_ϵ , italic_x , italic_p , italic_t end_POSTSUBSCRIPT [ | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p , italic_t ) | | ] , (1)

where ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is denoising network and ϵitalic-ϵ\epsilonitalic_ϵ is sampled noise from unit gaussian. t,p𝑡𝑝t,pitalic_t , italic_p represents timestep and text condition, respectively. With the text condition pRs×d𝑝superscript𝑅𝑠𝑑p\in R^{s\times d}italic_p ∈ italic_R start_POSTSUPERSCRIPT italic_s × italic_d end_POSTSUPERSCRIPT and self-attention feature fR(h×w)×c𝑓superscript𝑅𝑤𝑐f\in R^{(h\times w)\times c}italic_f ∈ italic_R start_POSTSUPERSCRIPT ( italic_h × italic_w ) × italic_c end_POSTSUPERSCRIPT, the cross attention layer consists of Q=Wqf,K=Wkp,V=Wvpformulae-sequence𝑄superscript𝑊𝑞𝑓formulae-sequence𝐾superscript𝑊𝑘𝑝𝑉superscript𝑊𝑣𝑝Q=W^{q}f,K=W^{k}p,V=W^{v}pitalic_Q = italic_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT italic_f , italic_K = italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_p , italic_V = italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT italic_p, and the attention output is represented as :

A(Q,K,V)=Softmax(QKTd)V.𝐴𝑄𝐾𝑉𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇𝑑𝑉\displaystyle A(Q,K,V)=Softmax\left(\frac{QK^{T}}{\sqrt{d}}\right)V.italic_A ( italic_Q , italic_K , italic_V ) = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V .

We only fine-tune the ‘key’ and the ‘value’ weight parameters, Wk,Wvsuperscript𝑊𝑘superscript𝑊𝑣W^{k},W^{v}italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT, of the cross-attention layers. Also, we use modifier tokens [V*], which are placed ahead of the concept word (e.g., [V*] dog) and operate as a constraint to general concepts.

Unlike the basic models of Custom Diffusion, our approach incorporates a robust augmentation strategy. This involves significantly varying the size and position of training images within the overall dataset. Such resizing and repositioning augmentations grant greater geometric freedom, or action expressiveness, to the generated outputs. Additionally, this method helps to minimize potential artifacts during the region-specific denoising phases, enhancing the overall quality and accuracy of the generated images.

We can also incorporate Low-Rank (LoRa) adaptation on our framework. In case of using LoRa-based adaptation, we fine-tune the Low Rank nodes on all of weights of query, key, and value of cross attention layers. More specifically, we only fine-tune low-rank bias ΔWq,ΔWk,ΔWvΔsuperscript𝑊𝑞Δsuperscript𝑊𝑘Δsuperscript𝑊𝑣\Delta W^{q},\Delta W^{k},\Delta W^{v}roman_Δ italic_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , roman_Δ italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , roman_Δ italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT to obtain new weights Wqnew=Wq+ΔWq,Wknew=Wk+ΔWk,Wvnew=Wv+ΔWvformulae-sequencesuperscript𝑊𝑞𝑛𝑒𝑤superscript𝑊𝑞Δsuperscript𝑊𝑞formulae-sequencesuperscript𝑊𝑘𝑛𝑒𝑤superscript𝑊𝑘Δsuperscript𝑊𝑘superscript𝑊𝑣𝑛𝑒𝑤superscript𝑊𝑣Δsuperscript𝑊𝑣W^{q-new}=W^{q}+\Delta W^{q},W^{k-new}=W^{k}+\Delta W^{k},W^{v-new}=W^{v}+% \Delta W^{v}italic_W start_POSTSUPERSCRIPT italic_q - italic_n italic_e italic_w end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT + roman_Δ italic_W start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_k - italic_n italic_e italic_w end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT + roman_Δ italic_W start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_W start_POSTSUPERSCRIPT italic_v - italic_n italic_e italic_w end_POSTSUPERSCRIPT = italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT + roman_Δ italic_W start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT. In our case, we used rank r=4𝑟4r=4italic_r = 4.

Details of Template Image Generation. In template image generation process, we use Stable Diffusion [21] model version \geq2.0 as the earlier version models often fail to generate images that contain multiple objects.

More specifically, when we use Stable Diffusion v2.1, we optionally used guided generation process in which to use multi-concept guidance prompt such as pmc=“photo of two animals in the same background”subscript𝑝𝑚𝑐“photo of two animals in the same background”p_{mc}=\textit{``photo of two animals in the same background"}italic_p start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT = “photo of two animals in the same background”, along with target prompt (e.g. ptg=subscript𝑝𝑡𝑔absentp_{tg}=italic_p start_POSTSUBSCRIPT italic_t italic_g end_POSTSUBSCRIPT =“photo of a dog and a cat playing with a ball, mountain background”). At each generation steps, we use the summed version of two score outputs from two prompts such as ϵ=ϵθ(zt,t,ptg)+λϵθ(zt,t,pmc)italic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝑝𝑡𝑔𝜆subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝑝𝑚𝑐\epsilon=\epsilon_{\theta}(z_{t},t,p_{tg})+\lambda\epsilon_{\theta}(z_{t},t,p_% {mc})italic_ϵ = italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT italic_t italic_g end_POSTSUBSCRIPT ) + italic_λ italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT italic_m italic_c end_POSTSUBSCRIPT ). If we use Stable Diffusion XL (SDXL), we did not used multi-concept guidance prompt. In practice, we recommend to use SDXL for high fidelity.

Details of Inversion and Feature Extraction. From the source image xsrcsubscript𝑥𝑠𝑟𝑐x_{src}italic_x start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT, we generate the noisy latent space zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT with the DDIM [24] forward process:

zt+1=αt+1αtzt+(1αt+1αt+11αtαt)ϵθ(zt,t,psrc),subscript𝑧𝑡1subscript𝛼𝑡1subscript𝛼𝑡subscript𝑧𝑡1subscript𝛼𝑡1subscript𝛼𝑡11subscript𝛼𝑡subscript𝛼𝑡subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝑝𝑠𝑟𝑐\displaystyle z_{t+1}=\sqrt{\frac{\alpha_{t+1}}{\alpha_{t}}}z_{t}+\left({\sqrt% {{\frac{1-\alpha_{t+1}}{\alpha_{t+1}}}}-\sqrt{{\frac{1-\alpha_{t}}{\alpha_{t}}% }}}\right)\epsilon_{\theta}(z_{t},t,p_{src}),italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( square-root start_ARG divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_ARG end_ARG - square-root start_ARG divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) ,

where we deterministically get the next step latent zt+1subscript𝑧𝑡1z_{t+1}italic_z start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. Here α:=Πi=1t(1βt)assign𝛼superscriptsubscriptΠ𝑖1𝑡1subscript𝛽𝑡\alpha:=\Pi_{i=1}^{t}(1-\beta_{t})italic_α := roman_Π start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the variance schedule. From the inverted latent zTsubscript𝑧𝑇z_{T}italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we can accurately reconstruct the source image using a reverse DDIM process [24]:

zt1=αt1αtzt+(1αt1αt11αtαt)ϵθ(zt,t,psrc).subscript𝑧𝑡1subscript𝛼𝑡1subscript𝛼𝑡subscript𝑧𝑡1subscript𝛼𝑡1subscript𝛼𝑡11subscript𝛼𝑡subscript𝛼𝑡subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡subscript𝑝𝑠𝑟𝑐\displaystyle z_{t-1}=\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}z_{t}+\left({\sqrt% {{\frac{1-\alpha_{t-1}}{\alpha_{t-1}}}}-\sqrt{{\frac{1-\alpha_{t}}{\alpha_{t}}% }}}\right)\epsilon_{\theta}(z_{t},t,p_{src}).italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( square-root start_ARG divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG - square-root start_ARG divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ) italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_p start_POSTSUBSCRIPT italic_s italic_r italic_c end_POSTSUBSCRIPT ) .

During the reverse reconstruction process, we extract the features from the U-Net’s l𝑙litalic_l-th layer ftlsuperscriptsubscript𝑓𝑡𝑙{f}_{t}^{l}italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT at each timestep t𝑡titalic_t.

Details of Implementation. Instead of using a densely annotated mask, we used dilated mask in which the mask region is expanded from the original area. Here we used a filter size of 21x21 for the mask dilation. If we used real concepts, we used original dilated masks. When we generated the images which contain unreal concepts such as animated characters, we found that using rectangular masks (e.g. in the second row of Fig. 11) shows better results.

For self-attention and residual layer feature injection, we only apply the injection to early timesteps. If our entire timesteps for sampling is T𝑇Titalic_T, we apply self attention injection to early timesteps such as t>0.6T𝑡0.6𝑇t>0.6Titalic_t > 0.6 italic_T, and residual layer injection to t>0.5T𝑡0.5𝑇t>0.5Titalic_t > 0.5 italic_T. For concept-free suppression, we used weight of λ=0.3𝜆0.3\lambda=0.3italic_λ = 0.3.

In our generation pipelines, we can filter out unsatisfied samples in mask generation steps. If we cannot obtain the proper concept-wise objects masks in the template images, we filter out the image and use other templates. We can automatically drop the sample if the overlap** regions of two extracted masks are over 90 percent. Also, we randomly showed the generated outputs with CLIP text-image similarity scores higher than 0.3. For fair comparison, we applied same filtering protocol to the baseline of Mix-of-show. In case of early methods, we only applied the CLIP based filtering, as the methods suffer from severe concept missing.

Appendix B Further Comparison

To further compare the generation process between our proposed method and Mix-of-show, we show the further comparison results. As both methods rely on region-wise guidance for multi-concept generation, we compare the difference between two methods in Fig. 11. In our proposed method, we start from generated template images and the object-wise segmented masks. With those conditions, we can translate the template images to concept-aware outputs. In case of Mix-of-show, the method relies on rectangular shape layout boxes, and also apply concept-wise sampling on each box region.

As observed in the figure, the output objects from mix-of-show only follow the approximated spatial conditions of given box regions, as it is much more sensitive to initial noise conditions. In our case, as we start from template images, the output concepts accurately follow the mask regions.

In order to show the comparison with more generated samples, we show the outputs in Fig. 12 and Fig. 13. For fair comparison, we show the outputs filtered with protocols elaborated in our implementation details. In case of Mix-of-show, we can see the generated concepts are properly places on some samples, but in many cases the concept is not properly applied. Also, if we generate the objects with complex actions or interactions (e.g. ‘kissing’, ‘riding a boat’), the outputs from Mix-of-show often fails to reflect the text conditions or suffer from the two concepts mixing. Considering that baseline of Mix-of-show requires additional optimization for concept weight combining, our method shows superiority in both of generation quality and flexibility.

For more detailed comparison on perceptual quality, we show the detailed user study result in Table. 4. We conducted detailed user study using three different parts: background, human face, and real concepts. To evaluate the generation quality, we asked the users to score their preference with more detailed questions: 1) Inclusion of target background or human face concepts (Concept Match) , 2) Realism of generated background or human faces (Realism). Also, we asked same questions to users with showing the generated images on the real concepts. The results show that our proposed method outperforms our main baseline of Mix-of-show in all categories.

Method Background Human Face Real Concept
C. Match\uparrow Realism\uparrow C. Match\uparrow Realism\uparrow C. Match\uparrow Realism\uparrow
Mix-of-show 3.83 4.08 2.52 3.04 3.67 3.75
Ours 4.29 4.46 4.34 4.05 4.58 4.42
Table 4: Human Preference Study. We assess three different categories of Background, Human Face, and Real concepts. We collected answers from 12 different users each assessing 20 images.

Appendix C More Qualitative Results

In order to further show the qualitative results on animated concepts and concepts in same category, we show the outputs in Fig. 14. Our method can generate multi-concept outputs even with animated characters. In the third row, we show the outputs with two concepts which are within same category. Even we use the custom concepts with the same class, we can generate the multi-concept aware results without concept mixing. In Fig. 15, we show more qualitative result using Low-Rank adaptation for single-concept customization.

In order to experiment the multi-concept personalized generation on local regions, we show the results of multiple concept fusion on single subject (e.g. human) in Fig. 9. The results shot that our proposed method works not only for multiple separated objects, but also to the local components of single object. The results further show the robustness of our proposed method.

Refer to caption
Figure 9: Composing custom concepts into single object. We showcase a successful generation of custom local concepts.

Appendix D Details of Evaluation

For image-alignment score calculation, since our generated images contain multiple concepts, we cannot use the whole image-wise similarity scores. Instead, we extracted the concept-wise images using text-guided segmentation model. For example, if we evaluate images which contain ‘[c1] dog’ and ‘[c2] cat’, we run a segmentation model with the text prompts of ‘dog’ and ‘cat’ to obtain segmented masks. Then we cropped the rectangular region which contain segmented masks from the image. Then we calculated the cosine similarity between the image embedding vectors from extracted images and the concept (training) images. As the baseline methods often fails to generated all concepts, we did not calculated the scores when the generated images fail to contain all foreground concept objects for fair comparison.

For human preference evaluation, we collected opinions from 20 participants from the age group of 20-49. We constructed 2 different survey sets, each of which contains 10 generated images per each baseline model and 10 questions. We use the generated outputs from baselines and ours : Textual Inversion, Custom Diffusion, Perfusion, Mix-of-show and ours. Therefore, each survey set contains 50 generated images. We divided the participants into two groups and gave them different survey set. For further explanation, we show the example of survey form in Fig. 16.

Refer to caption
Figure 10: Failure Cases. If we use extremely complex or unrealistic text conditions, our method shows degraded generation performance.

Appendix E Limitations and Societal Impacts

Limitations. Although our method shows great performance in multi-concept generation, our method still has limitations. If we give extremely difficult or unrealistic text conditions, our method still show limited performance in text-alignment such as in Fig. 10. Since this problem comes from the limited performance of pre-trained Stable Diffusion, we expect to solve the problem with using improved diffusion model backbones.

Societal Impact. Since our method can synthesize realistic custom concept images, our method can be maliciously abused if the privacy-sensitive concepts are used. To prevent this, there should be a proper filtering system to check if the training concept is free from ethics issue.

Refer to caption
Figure 11: Detailed Generation Outputs. We show the detailed generation process of ours and the baseline method. In our proposed method, we use template image and concept-wise mask condition for generating accurate multi-concept images. For the baseline mix-of-show, the method use layout information for multi-concept generation.
Refer to caption
Figure 12: Further Comparison with Mix-of-show. We show the comparison results with the baseline of Mix-of-show. Our method successfully generated the target concepts following the given text conditions while the baseline method suffers from concept mixing or misalignment with text conditions.
Refer to caption
Figure 13: Further Comparison with Mix-of-show. We show the comparison results with the baseline of Mix-of-show. Our method successfully generated the target concepts following the given text conditions while the baseline method suffers from concept mixing or misalignment with text conditions.
Refer to caption
Figure 14: More Qualitative Results. We show more comparison results including animated concepts (the 1st, 2nd rows), and including two concepts within the same category (3rd Row), respectively.
Refer to caption
Figure 15: More Qualitative Results on Low Rank adaptation. We show more generated outputs from our method using Low-Rank adaptation based fine-tuning.
Refer to caption
Figure 16: Human Evaluation Example. We show the example question for human preference evaluation.

References

  • Couairon et al. [2023] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2023.
  • Esser et al. [2021] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  • Gal et al. [2022] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  • Gu et al. [2023] Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023.
  • Han et al. [2023a] Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023a.
  • Han et al. [2023b] Ligong Han, Song Wen, Qi Chen, Zhixing Zhang, Kunpeng Song, Mengwei Ren, Ruijiang Gao, Anastasis Stathopoulos, Xiaoxiao He, Yuxiao Chen, Di Liu, Qilong Zhangli, **dong Jiang, Zhaoyang Xia, Akash Srivastava, and Dimitris Metaxas. Improving tuning-free real image editing with proximal guidance, 2023b.
  • Hertz et al. [2023] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-or. Prompt-to-prompt image editing with cross-attention control. In The Eleventh International Conference on Learning Representations, 2023.
  • Ho and Salimans [2022] Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Hu et al. [2021] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • Kumari et al. [2023] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  • Li et al. [2023a] Dongxu Li, Junnan Li, and Steven C. H. Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing, 2023a.
  • Li et al. [2023b] Yuheng Li, Haotian Liu, Yangming Wen, and Yong Jae Lee. Generate anything anywhere in any scene, 2023b.
  • Liu et al. [2023a] Nan Liu, Shuang Li, Yilun Du, Antonio Torralba, and Joshua B. Tenenbaum. Compositional visual generation with composable diffusion models, 2023a.
  • Liu et al. [2023b] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023b.
  • Mou et al. [2023] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  • Park et al. [2023] Geon Yeong Park, Jeongsol Kim, Beomsu Kim, Sang Wan Lee, and Jong Chul Ye. Energy-based cross attention for bayesian context update in text-to-image diffusion models, 2023.
  • Qin et al. [2023] Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv preprint arXiv:2305.11147, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • Ruiz et al. [2023] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  • Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Tewel et al. [2023] Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personalization, 2023.
  • Tumanyan et al. [2023] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
  • Voynov et al. [2023] Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  • Yu et al. [2023] Lili Yu, Bowen Shi, Ramakanth Pasunuru, Benjamin Muller, Olga Golovneva, Tianlu Wang, Arun Babu, Binh Tang, Brian Karrer, Shelly Sheynin, et al. Scaling autoregressive multi-modal models: Pretraining and instruction tuning. arXiv preprint arXiv:2309.02591, 2023.
  • Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks, 2017.
  • Zhang et al. [2023] Yuxin Zhang, Nisha Huang, Fan Tang, Haibin Huang, Chongyang Ma, Weiming Dong, and Changsheng Xu. Inversion-based style transfer with diffusion models, 2023.