HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: axessibility

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2403.10983v1 [cs.CV] 16 Mar 2024

(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

11institutetext: 1Sun Yat-sen University, 2Tencent AI Lab, 3International Digital Economy Academy, 4Nan**g University, 5Harbin Institute of Technology, Shenzhen, 6Shenzhen University, 7Tencent, 8The Hong Kong University of Science and Technology
Homepage: https://kongzhecn.github.io/omg-project/
Code: https://github.com/kongzhecn/OMG/

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

Zhe Kong 11    Yong Zhang 22    Tianyu Yang 33    Tao Wang 44    Kaihao Zhang 55    Bizhu Wu 66    Guanying Chen 11    Wei Liu 77    Wenhan Luo 1188
Abstract

Personalization is an important topic in text-to-image generation, especially the challenging multi-concept personalization. Current multi-concept methods are struggling with identity preservation, occlusion, and the harmony between foreground and background. In this work, we propose OMG, an occlusion-friendly personalized generation framework designed to seamlessly integrate multiple concepts within a single image. We propose a novel two-stage sampling solution. The first stage takes charge of layout generation and visual comprehension information collection for handling occlusions. The second one utilizes the acquired visual comprehension information and the designed noise blending to integrate multiple concepts while considering occlusions. We also observe that the initiation denoising timestep for noise blending is the key to identity preservation and layout. Moreover, our method can be combined with various single-concept models, such as LoRA and InstantID without additional tuning. Especially, LoRA models on civitai.com can be exploited directly. Extensive experiments demonstrate that OMG exhibits superior performance in multi-concept personalization.

Keywords:
Image Generation Multi-Concept Customization Diffusion Model
Refer to caption
Figure 1: We present OMG, an occlusion-friendly method for multi-concept personalization with strong identity preservation and harmonious illumination. The visual examples are generated by using LoRA models downloaded from civitai.com.

1 Introduction

Personalized text-to-image generation is a promising path to realize identity-consistent story visualization. Numerous methods have been proposed for single-concept personalization, such as DreamBooth [33], Textual Inversion [12] and LoRA [22], showcasing their efficacy in achieving high-quality results. While excelling in single-concept personalization, these methods encounter challenges related to identity degradation when tasked with generating a single image encompassing multiple concepts, as shown in Fig. 2 (a).

Several multi-concept personalization methods have been proposed [25, 41, 28, 15], but they still encounter identity degradation problems when generating multiple concepts. Mix-of-show [16] can generate multi-concepts with realistic identity, but it cannot handle occlusion between concepts. Specifically, the method [16] adopts a regionally controllable sampling method, where each timestep injecting region prompts through regional-aware cross-attention. In cases where the concept regions experience occlusion, the final prediction results for these occluded regions are determined by a straightforward linear addition of the cross-attention results from multiple local sample regions. This simplistic approach leads to inaccurate predictions within the occluded regions, resulting in layout conflicts and identity degradation, as shown in Fig. 2 (b). Besides, there is disharmony between the foreground and background, leading to unnatural illumination in the image. Additionally, methods [16, 25] aims at merging two concepts into one diffusion model, which is computationally inefficient.

To address the aforementioned issues, we propose OMG, an occlusion-friendly personalized image generation framework designed to seamlessly integrate multiple concepts within a single image. Unlike other customization methods, our two-stage approach employs latent-level and attention-level layout control to tackle occlusion issues during multiple concept customization. The first stage generates an image with coherent layouts based on user-provided text prompts, without considering personalization. During this stage, additional visual comprehension information such as attention maps and concept masks is acquired through the first stage of sampling. In the second stage, concepts are injected into specific regions by leveraging the preserved visual comprehension information. During sampling, as illustrated in Fig. 2 (a), simultaneously generating two concepts in one image results in significant identity degradation. To address this limitation, we propose a concept noise blending strategy to merge multiple noises from different single-concept models during sampling. In each timestep, different single-concept models only control the generation of one specific region, effectively mitigating identity degradation problems during the multiple-concept sample process. Additionally, we find that the disharmony problem can be solved by controlling the initiation timestep of concept noise blending. Differing from Custom Diffusion [25] and Mix-of-show [16], which require additional training or model optimization to merge multiple concepts into one model, the proposed OMG method can generate an image with multiple concepts directly by utilizing multiple single-concept models derived from the community (e.g., civitai.com) in a plug-and-play manner, without additional tuning. It is computationally efficient and significantly alleviates the time-consuming problem. Extensive experiments and comparison results with other methods demonstrate its superiority. Our contributions are summarized as follows:

  • We propose a novel two-stage framework for multi-concept customization. Our approach can generate an occlusion-friendly personalized image with strong identity preservation and harmonious illumination.

  • We propose a Concept Noise Blending strategy to merge multiple noises from different single-concept models at both latent and attention levels. It mitigates identity degradation of the multi-concept generation and can be easily combined with different personalization frameworks such as LoRA or InstantID in a tuning-free plug-and-play manner.

  • Extensive evaluations demonstrate the effectiveness of our proposed method.

Refer to caption
Figure 2: Existing methods face identity degradation and occlusion problems. (a) Given two text prompts with identifiers, “A [v1]delimited-[]𝑣1[v1][ italic_v 1 ] man” and “A [v2]delimited-[]𝑣2[v2][ italic_v 2 ] woman”, we generate 100100100100 images for the two concepts separately (separate generation) and calculate the Identity Alignment between generated images and reference images. Subsequently, we employ another text prompt, “A [v1]delimited-[]𝑣1[v1][ italic_v 1 ] man and a [v2]delimited-[]𝑣2[v2][ italic_v 2 ] woman”, to randomly generate 100100100100 images containing both concepts simultaneously (simultaneous generation) and calculate Identity Alignment. We find that the simultaneous generation of two concepts leads to the decline of Identity Alignment, resulting in identity degradation. (b) Given spatial conditions with occlusion between concepts, the Mix-of-show [16] cannot generate an integrity image and encounters an identity degradation problem.

2 Related Work

Text-to-Image (T2I) Synthesis. Text-to-image synthesis involves the task of generating realistic and diverse images from text prompts. Recently, diffusion models [21, 39] have demonstrated remarkable progress, attributed to large-scale training datasets like Laion-400M [36] and Conceptual-12M [7]. Several text-to-image models, including SDXL [50], Imagen [35], and DALL·E 3 [5], have shown significant performance improvements.

Single-Concept Customization. Early image personalization approaches focus on expanding or fine-tuning the language vision dictionary of T2I diffusion models to associate new concepts with a limited set of subjects, achieved through the fine-tuning of pre-trained T2I models. Optimization-based methods, such as diffusion model based ones [10, 18, 19, 33, 34, 38, 6], or special textual embeddings [1, 12, 42, 43, 30, 48, 49], learn new concepts to describe target concepts. To reduce the trainable parameters, recent advancements have seen the adoption of Low-Rank Adaptation (LoRA) methods [22, 40] in concept customization. Moreover, studies [2, 8, 14, 29, 37, 51, 45, 13, 44, 27, 47, 9, 50, 37] have recently explored training additional modules for map** concepts to textual representations while kee** the core pre-trained T2I models frozen. This significantly expedites the personalization process. For instance, in InstantID [44], an IdentityNet is designed to integrate facial images with textual prompts, successfully steering image generation in various styles using just a single facial image.

Multi-Concept Customization. Existing methods conduct joint training on multi-concept datasets with additional losses or extra optimization efforts to merge multiple models. Several approaches [3, 17, 28, 15] employ cross-attention maps to avoid the entanglement of multiple concepts. In Custom Diffusion [25], the proposition involves joint training or constrained optimization of multiple models. Notably, the work [16] introduces gradient fusion to minimize identity loss during concept fusion, along with the proposal of regionally controllable sampling to address attribute binding in multi-concept personalization. Modular Customization [31] disentangles customization concepts into orthogonal directions, streamlining the integration of multiple fine-tuned concepts, while preserving the integrity of each concept. [46] employs subject embeddings from an image encoder to enhance generic text conditioning in diffusion models. This augmentation empowers personalized image generation without the necessity for additional training when facing new concepts.

In contrast to the aforementioned methods, our approach diverges by obviating the need for extensive pre-training of additional network models or the optimization required for merging multiple models. Through a simple modification of the sampling process, our method seamlessly integrates multiple concepts into a single image using multiple models, thereby eliminating the necessity for model merging or additional tuning. Furthermore, our method exhibits robust generalization and can be effortlessly combined with various single-concept methods, such as LoRA [22] and InstantID [44], in a plug-and-play manner.

3 Method

Refer to caption
Figure 3: Overviews of the proposed OMG, which contains two stages during sampling. The first stage takes charge of layout generation and visual comprehension information for handling occlusions. Leveraging the acquired information, the identities of concepts can be injected in multi-concept personalized denoising with the proposed latent-level and attention-level noise blending in the second stage.

We propose a two-stage multi-concept customization framework to integrate multiple concepts into a single image. Unlike previous works, the proposed method can address identity degradation, occlusion, time-consuming fusion, and illumination disharmony problems. The overall framework of our proposed paradigm is illustrated in Fig. 3, which contains two stages during sampling.

3.1 Preliminary

Latent diffusion model [21, 39, 32] belongs to a class of generative models containing a diffusion process and a reverse process in the latent space. In the diffusion process, an image x𝑥xitalic_x is firstly projected to latent space by an encoder \mathcal{E}caligraphic_E: z0=(x)subscript𝑧0𝑥z_{0}=\mathcal{E}(x)italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = caligraphic_E ( italic_x ). Then random Gaussian noises are gradually added to the data sample z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to generate the noisy sample ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with a predefined noise adding schedule αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as time-step t: q(zt|z0)=𝒩(α¯tz0,(1α¯t)I)𝑞conditionalsubscript𝑧𝑡subscript𝑧0𝒩subscript¯𝛼𝑡subscript𝑧01subscript¯𝛼𝑡𝐼q(z_{t}|z_{0})=\mathcal{N}(\sqrt{\bar{\alpha}_{t}}z_{0},(1-\bar{\alpha}_{t})I)italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ), where α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\bar{\alpha}_{t}=\textstyle\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In the reverse process, a U-Net εθsubscript𝜀𝜃\varepsilon_{\theta}italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is trained to directly perform denoising in the latent space. The overall training objective is defined as

=Ez0,ϵ,tϵεθ(zt,t,c)22,subscript𝐸subscript𝑧0italic-ϵ𝑡subscriptsuperscriptnormitalic-ϵsubscript𝜀𝜃subscript𝑧𝑡𝑡𝑐22\mathcal{L}=E_{z_{0},\epsilon,t}||\epsilon-\varepsilon_{\theta}(z_{t},t,c)||^{% 2}_{2},caligraphic_L = italic_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_ϵ , italic_t end_POSTSUBSCRIPT | | italic_ϵ - italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (1)

where c𝑐citalic_c is the embedding of the conditional text prompt and ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a noisy sample of z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at timestep t𝑡titalic_t.

3.2 Stage 1: Visual Comprehension Information Preparation

Existing methods, such as Mix-of-show [16], encounter layout conflict challenges. As depicted in Fig. 2 (b), when the regions of two concepts occlude, [16] is incapable of generating an image with a coherent layout, resulting in identity degradation and compromise of the concept’s integrity. Given that cross-attention layers are effective in controlling the spatial layout and appearance [20], the modification of pixel-to-text interaction within these layers allows for preserving the content and spatial layout of the original image while adhering to the target prompt. By selectively modifying predefined regions in an image using a unique identifier, while maintaining the content and structure of other regions, we can effectively mitigate the challenge of concept occlusion. Hence, the first stage aims to acquire visual comprehension information for multi-concept customization.

Refer to caption
Figure 4: Overviews of the Multi-concept Personalized Denoising. This stage utilizes the acquired visual comprehension information and the designed concept noise blending method to integrate multiple concepts while considering occlusions.

As illustrated in Fig. 3 (a), a textual prompt p𝑝pitalic_p describing multiple objects of an image is the input of a T2I model. It is imperative to emphasize that the text prompt p𝑝pitalic_p exclusively contains the class name (e.g., “man” or “woman”), deliberately excluding the introduction of the unique identifier (e.g., “[v]delimited-[]𝑣[v][ italic_v ] man” or “[v]delimited-[]𝑣[v][ italic_v ] woman”) at this point. Consequently, a non-customized image xncussubscript𝑥𝑛𝑐𝑢𝑠x_{ncus}italic_x start_POSTSUBSCRIPT italic_n italic_c italic_u italic_s end_POSTSUBSCRIPT with a coherent layout is generated through

xncus=T2I(p).subscript𝑥𝑛𝑐𝑢𝑠𝑇2𝐼𝑝x_{ncus}=T2I(p).italic_x start_POSTSUBSCRIPT italic_n italic_c italic_u italic_s end_POSTSUBSCRIPT = italic_T 2 italic_I ( italic_p ) . (2)

We employ the publicly available SDXL model as our T2I model. The denoising UNet network is composed of self-attention layers followed by cross-attention layers. In the denoising process, the fusion of embeddings from visual and text features occurs through cross-attention layers, generating cross-attention maps for each textual token in the U-Net. The cross-attention map A𝐴Aitalic_A is calculated as

A=Softmax(QKTd).𝐴𝑆𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇𝑑A=Softmax(\frac{QK^{T}}{\sqrt{d}}).italic_A = italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) . (3)

Here, Q𝑄Qitalic_Q represents a query matrix projection of intermediate features φ(zt)𝜑subscript𝑧𝑡\varphi(z_{t})italic_φ ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and K𝐾Kitalic_K is a key matrix projection of text tokens ϕ(p)italic-ϕ𝑝\phi(p)italic_ϕ ( italic_p ), obtained through two learnable linear projections WQsubscript𝑊𝑄W_{Q}italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT and WKsubscript𝑊𝐾W_{K}italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT, respectively. Q𝑄Qitalic_Q and K𝐾Kitalic_K are defined as

Q=WQφ(zt),K=WKϕ(p).formulae-sequence𝑄subscript𝑊𝑄𝜑subscript𝑧𝑡𝐾subscript𝑊𝐾italic-ϕ𝑝Q=W_{Q}\cdot\varphi(z_{t}),K=W_{K}\cdot\phi(p).italic_Q = italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ⋅ italic_φ ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_K = italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ⋅ italic_ϕ ( italic_p ) . (4)

At each denoising step t𝑡titalic_t, following the input of p𝑝pitalic_p to the T2I model, the cross-attention maps Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, comprising N𝑁Nitalic_N attention layers with corresponding spatial attention maps {At1,At2,,AtN}superscriptsubscript𝐴𝑡1superscriptsubscript𝐴𝑡2superscriptsubscript𝐴𝑡𝑁\{A_{t}^{1},A_{t}^{2},\cdots,A_{t}^{N}\}{ italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT }, are acquired. It is imperative to retain all these obtained attention maps for identity injection in the second stage.

To prepare for concept noise blending, it is necessary to locate the modified region in xncussubscript𝑥𝑛𝑐𝑢𝑠x_{ncus}italic_x start_POSTSUBSCRIPT italic_n italic_c italic_u italic_s end_POSTSUBSCRIPT. Relying on the robust image understanding capabilities [24] of visual comprehension, we can derive concept masks M𝑀Mitalic_M. By inputting both the generated image xncussubscript𝑥𝑛𝑐𝑢𝑠x_{ncus}italic_x start_POSTSUBSCRIPT italic_n italic_c italic_u italic_s end_POSTSUBSCRIPT and the class name (e.g., “man” or “woman”) from p𝑝pitalic_p, concept masks M𝑀Mitalic_M corresponding to k𝑘kitalic_k class {M1,M2,,Mk}subscript𝑀1subscript𝑀2subscript𝑀𝑘\{M_{1},M_{2},\cdots,M_{k}\}{ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } can be derived.

3.3 Stage 2: Multi-concept Personalized Denoising

Upon obtaining a non-customized image xncussubscript𝑥𝑛𝑐𝑢𝑠x_{ncus}italic_x start_POSTSUBSCRIPT italic_n italic_c italic_u italic_s end_POSTSUBSCRIPT with acquired visual comprehension information, we inject the identity of concepts in the second stage. In previous works, such as [20], image editing is achieved by injecting the input text with an edit text prompt. Personalized multi-concept generation could adopt a similar approach by triggering concept generation through the identifiers in text prompts. However, it may face two drawbacks. Firstly, making a text prompt capable of generating multiple concepts necessitates the merging of multiple single-concept models into one like [16, 31], which requires additional network optimization and is inherently time-consuming. Additionally, as illustrated in Fig. 2 (a), employing a single prompt for a multi-concept generation often results in identity degradation. In contrast, we propose a Concept Noise Blending strategy to address the aforementioned issues. The overall architecture of the multi-concept personalized denoising is depicted in Fig. 4.

Refer to caption
Figure 5: Effect of the initiation timestep for concept noise blending. The initiation timestep for concept noise blending influences both the image layout and illumination. When the initiation timestep is 00, there is no concept noise blending operation during sampling, resulting in the same generation result for both stages.

Concept Noise Blending. To mitigate the additional optimization costs associated with network merging, the proposed concept noise blending method directly leverages multiple single-concept models during inference, circumventing the need for network merging. Moreover, each single-concept model is solely responsible for generating a specific concept, effectively addressing the challenge of identity degradation.

During the multi-concept personalized denoising, the input global text prompt p𝑝pitalic_p and initiation noise remain consistent at the first stage. In the second stage, the objective is to generate a customized image containing multiple concepts leveraging the acquired visual comprehension information. Suppose we aim to generate an image xcussubscript𝑥𝑐𝑢𝑠x_{cus}italic_x start_POSTSUBSCRIPT italic_c italic_u italic_s end_POSTSUBSCRIPT containing k𝑘kitalic_k concepts {C1,C2,,Ck}superscript𝐶1superscript𝐶2superscript𝐶𝑘\{C^{1},C^{2},\cdots,C^{k}\}{ italic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_C start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ⋯ , italic_C start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT }. Let T2Ici𝑇2subscriptsuperscript𝐼𝑖𝑐T2I^{i}_{c}italic_T 2 italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represent the i𝑖iitalic_i-th single-concept model designed to generate the concept Cisuperscript𝐶𝑖C^{i}italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT through concept text prompt pisuperscript𝑝𝑖p^{i}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. The pisuperscript𝑝𝑖p^{i}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT encapsulates a special identifier that can be input to T2Ici𝑇2subscriptsuperscript𝐼𝑖𝑐T2I^{i}_{c}italic_T 2 italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for generating concept Cisuperscript𝐶𝑖C^{i}italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. At timestep t𝑡titalic_t, given text prompt pisuperscript𝑝𝑖p^{i}italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT of concept i𝑖iitalic_i, the corresponding predicted noise Ct1isuperscriptsubscript𝐶𝑡1𝑖C_{t-1}^{i}italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is obtained through

Ct1i=T2Ici(zt,pi,t).superscriptsubscript𝐶𝑡1𝑖𝑇2subscriptsuperscript𝐼𝑖𝑐subscript𝑧𝑡superscript𝑝𝑖𝑡C_{t-1}^{i}=T2I^{i}_{c}(z_{t},p^{i},t).italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_T 2 italic_I start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_t ) . (5)

Additionally, the T2I𝑇2𝐼T2Iitalic_T 2 italic_I model is the same as the first stage. By inputting a global text prompt p𝑝pitalic_p at timestep t𝑡titalic_t, the corresponding global output zt1superscriptsubscript𝑧𝑡1z_{t-1}^{{}^{\prime}}italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT is obtained through the T2I𝑇2𝐼T2Iitalic_T 2 italic_I model with occlusion layout preservation. The generated zt1superscriptsubscript𝑧𝑡1z_{t-1}^{{}^{\prime}}italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT represents a non-customized noise. To inject the identity of the concept Cisuperscript𝐶𝑖C^{i}italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT into zt1superscriptsubscript𝑧𝑡1z_{t-1}^{{}^{\prime}}italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, specific regions in zt1superscriptsubscript𝑧𝑡1z_{t-1}^{{}^{\prime}}italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT are overwritten with the corresponding concept noise Ct1isuperscriptsubscript𝐶𝑡1𝑖C_{t-1}^{i}italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT based on concept masks M𝑀Mitalic_M through:

zt1=(1i=0kMi)*zt1+i=0kMi*Ct1i,subscript𝑧𝑡11superscriptsubscript𝑖0𝑘subscript𝑀𝑖superscriptsubscript𝑧𝑡1superscriptsubscript𝑖0𝑘subscript𝑀𝑖superscriptsubscript𝐶𝑡1𝑖z_{t-1}=(1-{\textstyle\bigcup_{i=0}^{k}M_{i}})*z_{t-1}^{{}^{\prime}}+\sum_{i=0% }^{k}{M_{i}*C_{t-1}^{i}},italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ( 1 - ⋃ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) * italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT * italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (6)

where Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the mask for concept Cisuperscript𝐶𝑖C^{i}italic_C start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT. Through noise-level concept blending, the identity of concepts can be injected into one noise at each timestep.

MultiDiffusion [4] similarly incorporates noise fusion during sampling, by binding together multiple diffusion generation processes with a shared set of parameters or constraints to generate high-quality and diverse images that adhere to user-provided controls. In contrast, the proposed Concept Noise Blending does not necessitate multiple crops. Instead, different regions are calculated by distinct models. Ultimately, the results from various regions are fused based on the concept mask, eliminating the need for additional optimization steps.

Refer to caption
Figure 6: Comparison of OMG with other methods on the single-concept customization. In both character customization and object customization, OMG exhibits superior identity alignment with reference images when compared to other methods.

Occlusion Layout Preservation. The initiation stage yields a non-customized image xncussubscript𝑥𝑛𝑐𝑢𝑠x_{ncus}italic_x start_POSTSUBSCRIPT italic_n italic_c italic_u italic_s end_POSTSUBSCRIPT with a coherent layout. In the second stage, despite the global prompt and initiation noise being identical to those in the first stage, the generated noises at each timestep are completely different due to Concept Noise Blending. We utilize the cross-attention maps A𝐴Aitalic_A stored in the first stage to uphold the layout consistency of the generated image with xncussubscript𝑥𝑛𝑐𝑢𝑠x_{ncus}italic_x start_POSTSUBSCRIPT italic_n italic_c italic_u italic_s end_POSTSUBSCRIPT. This operation ensures the production of an occlusion-friendly multi-concept customized image.

In each timestep, we ensure that the layout is preserved in the generated image by modifying the cross-attention maps within the UNet during the T2I𝑇2𝐼T2Iitalic_T 2 italic_I model sampling. For instance, at the t𝑡titalic_t timestep, ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is fed into the T2I𝑇2𝐼T2Iitalic_T 2 italic_I model alongside the global prompt p𝑝pitalic_p and timestep t𝑡titalic_t. Cross-attention maps play a crucial role in controlling the structure and geometry of an image. To maintain an occlusion-friendly layout, we overwrite the generated attention map in each timestep within the UNet with the stored maps. This process can be formulated as:

zt1=T2I(zt,p,t){AtgAt},superscriptsubscript𝑧𝑡1𝑇2𝐼subscript𝑧𝑡𝑝𝑡subscriptsuperscript𝐴𝑔𝑡subscript𝐴𝑡z_{t-1}^{{}^{\prime}}=T2I(z_{t},p,t)\{A^{g}_{t}\leftarrow A_{t}\},italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT = italic_T 2 italic_I ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_p , italic_t ) { italic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } , (7)

where Atgsubscriptsuperscript𝐴𝑔𝑡A^{g}_{t}italic_A start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the generated attention map in the second stage of the T2I𝑇2𝐼T2Iitalic_T 2 italic_I model and Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes the stored attention map from the first stage.

3.4 Denoising Timestep of Concept Noise Blending

The initiation timestep for concept noise blending holds significant influence over both the image layout and illumination of the generated image. To elucidate the impact of different concept noise blending starting points, we present generated images at various timesteps in Fig. 5. Leveraging DDIM, the series comprises a total of 50505050 steps, with the leftmost image representing the outcome when Concept Noise Blending begins at step 50505050, indicating that concept noise blending operations are active throughout the entire sampling process. The rightmost image represents the result starting from step 00, indicating that no concept noise blending operations occur during the entire sampling process. Hence, when the concept noise blending operation starts at timestep 00, the generated image is the same as stage one. The intermediary images illustrate the progressive steps from 50505050 to 00.

Commencing concept noise blending at an early step may introduce layout conflicts in the composition and shape of objects within the generated image. However, with the increase in concept noise blending steps, the content information becomes more coherent and stable, effectively preserving the identity of the object. After iterative denoising, as the concept noise blending step approaches 00, the identity of the character diminishes gradually, resulting in a synthesized image resembling the first stage. This highlights the early stage of sampling governs the image layout, while the identity of concepts unfolds in later timesteps.

Moreover, we observe that the illumination disharmony between the foreground and background is notable in the earlier steps. With increasing timesteps, the illumination gradually becomes consistent, suggesting a potential association between illumination and image layout information.

4 Experiments

Refer to caption
Figure 7: Comparison with InstantID [44] in single-concept customization. OMG emerges as the superior method by generating images with more natural colors. This showcases the prowess of OMG over InstantID in the context of single-concept customization.
Refer to caption
Figure 8: Comparison of OMG with other methods on the same spatial condition on multi-concept customization. To make a fair comparison, all the comparison methods utilize the same spatial condition in each row. The proposed OMG can achieve the best performance in identity preservation in multi-concept customization.

Refer to caption
Figure 9: Comparison of OMG with InstantID [44] in multi-concept customization. OMG stands out by generating images with enhanced realism, characterized by a more extensive and vibrant color spectrum.
Refer to caption
Figure 10: Qualitative ablation study of OMG. (a) Generating images with layout preservation can preserve reasonable structure and enhance realism in the generated images. (b) Concept Noise Blending can generate images with a more coherent image layout and harmonious illumination. (c) The proposed OMG can achieve multi-concept customization with an increasing number of concepts.

Datasets. To evaluate OMG method, we collect a dataset that encompasses 15151515 distinct concepts. This dataset comprises 7777 real-world characters, 3333 anime characters, and 5555 real-world objects, all annotated automatically by Blip-2 [26].

Experimental Setup. We implement OMG employing the SDXL model [50]. The multi-concept customization approach we propose can be seamlessly combined with various single-concept customization methods, such as LoRA [22] and InstantID [44]. For LoRA [22], we integrate the LoRA layer into the linear layer in all attention modules of the text encoder and Unet, with a rank of 256256256256. We use the Adafactor optimizer with a constant learning rate for all experiments, setting the learning rate for the text encoder to 3e53superscript𝑒53e^{-5}3 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and for Unet to 3e33superscript𝑒33e^{-3}3 italic_e start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. Single-concept fine-tuning requires approximately 2222 hours on one A100 GPU. Regarding InstantID [44], we leverage the officially provided pre-trained Image Adapter model and IdentityNet model. We utilize the Antelopev2 model for face detection and face ID embedding extraction. When combining the proposed method with InstantID for multi-concept customization, only forward inference is needed during concept image generation, without any additional training.

Evaluation Metrics. Following [16], we evaluate our method using Image Alignment, which measures the visual similarity of generated images with the target concept using similarity in the CLIP image feature space [23]. Additionally, we adopt Text Alignment, which measures the similarity of generated images with given prompts using text-image similarity in the CLIP feature space [23]. However, for face images, Image Alignment may not accurately evaluate the similarity between the generated face and the real face. To address this, we use Identity Alignment to further illustrate the identity-preserving capabilities by measuring the ArcFace score [11] at which the target human identity is detected in a set of generated images. Consequently, we adopt Text Alignment and Image Alignment for objects, and for characters, Text Alignment and Identity Alignment are employed to measure the performance of methods.

Table 1: Quantitative comparison on single- and multi-concept personalization. OMG achieves state-of-the-art performance in single-concept customization and achieves better identity preservation than other methods in multi-concept customization.
Method Character Object
Text Alignment Identity Alignment Text Alignment Image Alignment
Single Multiple ΔΔ\Deltaroman_Δ Single Multiple ΔΔ\Deltaroman_Δ Single Multiple ΔΔ\Deltaroman_Δ Single Multiple ΔΔ\Deltaroman_Δ
DreamBooth [33] 0.677 0.658 -0.019 0.456 0.480 0.025 0.713 0.717 0.004 0.805 0.800 -0.005
Textual Inversion [12] 0.673 0.673 0.000 0.292 0.294 0.002 0.693 0.697 0.004 0.784 0.781 -0.003
Custom Diffusion [25] 0.629 0.704 0.075 0.370 0.322 -0.048 0.695 0.755 0.060 0.840 0.778 -0.061
Mix-of-show [16] 0.675 0.639 -0.036 0.422 0.436 0.015 0.724 0.731 0.007 0.791 0.780 -0.011
OMG (Ours) 0.693 0.696 0.003 0.514 0.510 -0.004 0.730 0.762 0.032 0.842 0.810 -0.033

4.1 Quantitative Comparison

We compare OMG with several concept customization methods, including DreamBooth [33], Textual Inversion [12], InstantID [44], Custom Diffusion [25], and Mix-of-show [16]. All the methods except InstantID are training-based customization requiring multiple reference images. In contrast, InstantID [44] achieves personalized generation with just one reference image.

Following Custom Diffusion [25], we utilize 20202020 text prompts and 50505050 samples per prompt for each concept. Hence, a total of 1000100010001000 images are ultimately generated. For a fair comparison, all the comparison methods adopt DDPM sampling with 50505050 steps and a classifier-free guidance sample across all methods. Our evaluation spans various categories of concepts, including characters and objects. We use a single-concept tuned model to assess the identity-preserving effect of our method through a set of prompts. The experimental results including single-concept and multi-concepts are detailed in Tab. 1.

For single-concept, we achieve the best results in Text Alignment, Image Alignment, and Identity Alignment for characters and objects. We adopt LoRA for single-concept fine-tuning, which proves the effectiveness of LoRA in capturing the complex concepts’ identity. For multi-concept, the proposed method exhibits superior performance with the input images for object customization. For characters, our method performs better on Identity Alignment than other methods, which proves the superiority of our method in identity preservation.

In our comparative analysis, we compare the proposed method with InstantID [44]. Notably, InstantID [44] achieves image customization requiring only a single reference image for inference, while ours leverages multiple reference images for fine-tuning. To ensure an equitable comparison, we align the number of reference images used by InstantID with our approach and calculate the average mean of ID embeddings as an image prompt. Consequently, our method achieves a Text Alignment score of 0.6920.6920.6920.692 and an Identity Alignment score of 0.5000.5000.5000.500. InstantID, exhibiting superior performance with a Text Alignment score of 0.6980.6980.6980.698 and an Identity Alignment score of 0.5340.5340.5340.534, benefits from fine-tuning on ample facial data. It is notable that our method, in contrast, has not undergone fine-tuning on such extensive datasets.

Table 2: Quantitative comparison on multi-concept personalization. OMG exhibits favorable generative effects on both characters simultaneously
Method Identity Alignment
Character 1 Character 2 Average
DreamBooth [33] 0.136 0.493 0.315
Textual Inversion [12] 0.023 0.228 0.126
Custom Diffusion [25] 0.098 0.252 0.175
Mix-of-show [16] 0.357 0.257 0.307
OMG (Ours) 0.382 0.478 0.430

The qualitative results for multi-concept personalization shown in Tab. 1 mainly measure the fusion ability of multiple single-concept models during multi-concept customization. It cannot reflect the generation effect of multi-concept generation. Therefore, to measure the generation effects when different methods generate multiple concepts simultaneously, we propose a new calculation method. To make a fair comparison, we use the same spatial condition to generate two characters simultaneously, one male and another female. We calculate the region of different characters in the image through visual comprehension, then calculate the Identity Alignment scores for two different characters with their corresponding reference images separately. This measure approach is more effective in measuring the effects of generating multiple concepts simultaneously. The experiment results are shown in Tab. 2. Some comparison methods may derive good Identity Alignment in one character but not perform well on another. Our method achieves satisfied generative effects on both characters simultaneously and obtains the best Average Identity Alignment. This demonstrates the effectiveness of our method in multi-concept generation.

4.2 Qualitative Comparison

4.2.1 Single-Concept Results.

The efficacy of our method in preserving identity is demonstrated through a comparison of single-concept generation representing different identities. As previously mentioned, each concept undergoes individual fine-tuning. The experimental results are presented in Fig. 6. Each column corresponds to images sampled from the same model, representing two distinct concept identities. In both character customization and object customization, our method exhibits superior identity alignment with reference images when compared to other methods. The text prompts can be found in the supplement.

Fig. 7 illustrates the results of single-concept customization compared to InstantID [44]. Our method stands out by generating higher-quality images, underscoring its visual superiority over InstantID [44] in single-concept customization.

4.2.2 Multi-Concept Results.

We take a comprehensive comparison with other methods in multi-concept customization. Owing that the Mix-of-show [16] requires additional spatial conditions, we implement identical spatial condition controls across all compared methods to make a fair comparison. The experimental results are illustrated in Fig. 8. Mix-of-show [16] generates layout conflict images, leading to object loss and identity degradation. Notably, DreamBooth [33], Textual Inversion [12], and Custom Diffusion [25] exhibit limitations in generating images with realistic identity preservation. In contrast, our proposed method demonstrates robust identity preservation for each character in the multi-concept generation, substantiating its efficacy in multi-concept customization.

Furthermore, we conduct a comparative analysis between the proposed method and InstantID [44]. To facilitate this comparison, we substitute the single-concept model in our approach with InstantID and juxtapose the two methods. The experimental findings are visually depicted in Fig. 9. Our method produces images with enhanced realism, with more natural facial. This substantiates the superior performance of our method in multi-concept customization.

4.3 Ablation Study

To assess the effectiveness of various components within OMG, we conduct an ablation study encompassing the following elements: Layout Preservation, Concept Noise Blending, and Different Numbers of Concepts.

Layout Preservation. We present the ablation results of layout preservation in Fig. 10 (a). The left image showcases the generated image in the first stage. The other two images illustrate the generated image with and without layout preservation, respectively. By substituting the attention maps generated in the second stage, the layout of the image is well-preserved. The inclusion of layout preservation contributes to the generation of a more reasonable structure, highlighting the effectiveness of layout preservation in enhancing the overall quality.

Concept Noise Blending. Subsequently, we compare different sample types, specifically regionally controllable sampling [16] and the proposed concept noise blending. Given that regionally controllable sampling necessitates additional spatial conditions, we ensure a fair comparison by providing the same poses for both methods. Experimental outcomes are shown in Fig. 10 (b). In instances of regionally controllable sampling, occluded regions of two concepts may lead to missing concepts or a disorderly image layout in the generated image. In contrast, the concept noise blending is effective when multiple concepts are occluded. Furthermore, our method yields images with more harmonious illumination between the foreground and the background, resulting in a more realistic portrayal.

Different Numbers of Concepts. We also assess the robustness by increasing the number of concepts. As depicted in Fig. 10 (c), we showcase the generation effects when the number of concepts varies from 1111 to 5555. Notably, even with an escalation in the number of concepts, our method consistently preserves the identity of each concept. This substantiates the efficacy of our method in generating a diverse array of concepts while maintaining identity integrity.

5 Conclusion

We introduce OMG, a personalized generation framework for handling occlusion challenges in the context of generating realistic images for multiple concepts. Leveraging an image editing framework, our method specifically addresses the occlusion problem prevalent in multi-concept generation. The proposed concept noise blending further mitigates identity degradation issues. Experimental results showcase OMG’s ability to successfully generate high-quality images even when concepts experience occlusion. Additionally, our method seamlessly integrates with various single-concept customization models without additional training, enhancing its versatility and practicality.

References

  • [1] Alaluf, Y., Richardson, E., Metzer, G., Cohen-Or, D.: A neural space-time representation for text-to-image personalization. ACM TOG 42(6), 1–10 (2023)
  • [2] Arar, M., Gal, R., Atzmon, Y., Chechik, G., Cohen-Or, D., Shamir, A., H. Bermano, A.: Domain-agnostic tuning-encoder for fast personalization of text-to-image models. In: SIGGRAPH Asia 2023 Conference Papers. pp. 1–10 (2023)
  • [3] Avrahami, O., Aberman, K., Fried, O., Cohen-Or, D., Lischinski, D.: Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311 (2023)
  • [4] Bar-Tal, O., Yariv, L., Lipman, Y., Dekel, T.: Multidiffusion: Fusing diffusion paths for controlled image generation. arXiv preprint arXiv:2302.08113 (2023)
  • [5] Betker, J., Goh, G., **g, L., Brooks, T., Wang, J., Li, L., Ouyang, L., Zhuang, J., Lee, J., Guo, Y., et al.: Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf 2(3) (2023)
  • [6] Chae, D., Park, N., Kim, J., Lee, K.: Instructbooth: Instruction-following personalized text-to-image generation. arXiv preprint arXiv:2312.03011 (2023)
  • [7] Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3558–3568 (2021)
  • [8] Chen, W., Hu, H., Li, Y., Rui, N., Jia, X., Chang, M.W., Cohen, W.W.: Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186 (2023)
  • [9] Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. arXiv preprint arXiv:2307.09481 (2023)
  • [10] Choi, J., Choi, Y., Kim, Y., Kim, J., Yoon, S.: Custom-edit: Text-guided image editing with customized diffusion models. arXiv preprint arXiv:2305.15779 (2023)
  • [11] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR. pp. 4690–4699 (2019)
  • [12] Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: Personalizing text-to-image generation using textual inversion. ICLR (2022)
  • [13] Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Designing an encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2302.12228 (2023)
  • [14] Gal, R., Arar, M., Atzmon, Y., Bermano, A.H., Chechik, G., Cohen-Or, D.: Encoder-based domain tuning for fast personalization of text-to-image models. ACM TOG 42(4), 1–13 (2023)
  • [15] Gong, Y., Pang, Y., Cun, X., Xia, M., Chen, H., Wang, L., Zhang, Y., Wang, X., Shan, Y., Yang, Y.: Talecrafter: Interactive story visualization with multiple characters. Siggraph Asia (2023)
  • [16] Gu, Y., Wang, X., Wu, J.Z., Shi, Y., Chen, Y., Fan, Z., Xiao, W., Zhao, R., Chang, S., Wu, W., et al.: Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. NIPS (2023)
  • [17] Han, L., Li, Y., Zhang, H., Milanfar, P., Metaxas, D., Yang, F.: Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305 (2023)
  • [18] Hao, S., Han, K., Zhao, S., Wong, K.Y.K.: Vico: Detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971 (2023)
  • [19] He, X., Cao, Z., Kolkin, N., Yu, L., Rhodin, H., Kalarot, R.: A data perspective on enhanced identity preservation for diffusion personalization. arXiv preprint arXiv:2311.04315 (2023)
  • [20] Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)
  • [21] Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. NeurIPS 33, 6840–6851 (2020)
  • [22] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. ICLR (2021)
  • [23] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. arXiv:2304.02643 (2023)
  • [24] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
  • [25] Kumari, N., Zhang, B., Zhang, R., Shechtman, E., Zhu, J.Y.: Multi-concept customization of text-to-image diffusion. In: CVPR. pp. 1931–1941 (2023)
  • [26] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
  • [27] Li, Z., Cao, M., Wang, X., Qi, Z., Cheng, M.M., Shan, Y.: Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint arXiv:2312.04461 (2023)
  • [28] Liu, Z., Zhang, Y., Shen, Y., Zheng, K., Zhu, K., Feng, R., Liu, Y., Zhao, D., Zhou, J., Cao, Y.: Cones 2: Customizable image synthesis with multiple subjects. arXiv preprint arXiv:2305.19327 (2023)
  • [29] Ma, Y., Yang, H., Wang, W., Fu, J., Liu, J.: Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319 (2023)
  • [30] Pang, L., Yin, J., Xie, H., Wang, Q., Li, Q., Mao, X.: Cross initialization for personalized text-to-image generation. arXiv preprint arXiv:2312.15905 (2023)
  • [31] Po, R., Yang, G., Aberman, K., Wetzstein, G.: Orthogonal adaptation for modular customization of diffusion models. arXiv preprint arXiv:2312.02432 (2023)
  • [32] Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: CVPR. pp. 10684–10695 (2022)
  • [33] Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In: CVPR. pp. 22500–22510 (2023)
  • [34] Ruiz, N., Li, Y., Jampani, V., Wei, W., Hou, T., Pritch, Y., Wadhwa, N., Rubinstein, M., Aberman, K.: Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949 (2023)
  • [35] Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gontijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion models with deep language understanding. NIPS 35, 36479–36494 (2022)
  • [36] Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk, R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., Komatsuzaki, A.: Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114 (2021)
  • [37] Shi, J., Xiong, W., Lin, Z., Jung, H.J.: Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)
  • [38] Smith, J.S., Hsu, Y.C., Zhang, L., Hua, T., Kira, Z., Shen, Y., **, H.: Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027 (2023)
  • [39] Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. ICLR (2020)
  • [40] Tewel, Y., Gal, R., Chechik, G., Atzmon, Y.: Key-locked rank one editing for text-to-image personalization. In: ACM SIGGRAPH 2023 Conference Proceedings. pp. 1–11 (2023)
  • [41] Tunanyan, H., Xu, D., Navasardyan, S., Wang, Z., Shi, H.: Multi-concept t2i-zero: Tweaking only the text embeddings and nothing else. arXiv preprint arXiv:2310.07419 (2023)
  • [42] Vinker, Y., Voynov, A., Cohen-Or, D., Shamir, A.: Concept decomposition for visual exploration and inspiration. ACM TOG 42(6), 1–13 (2023)
  • [43] Voynov, A., Chu, Q., Cohen-Or, D., Aberman, K.: p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522 (2023)
  • [44] Wang, Q., Bai, X., Wang, H., Qin, Z., Chen, A.: Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519 (2024)
  • [45] Wei, Y., Zhang, Y., Ji, Z., Bai, J., Zhang, L., Zuo, W.: Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. arXiv preprint arXiv:2302.13848 (2023)
  • [46] Xiao, G., Yin, T., Freeman, W.T., Durand, F., Han, S.: Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431 (2023)
  • [47] Yan, Y., Zhang, C., Wang, R., Zhou, Y., Zhang, G., Cheng, P., Yu, G., Fu, B.: Facestudio: Put your face everywhere in seconds. arXiv preprint arXiv:2312.02663 (2023)
  • [48] Zhang, X.L., Wei, X.Y., Wu, J.L., Zhang, T.Y., Zhang, Z.X., Lei, Z., Li, Q.: Compositional inversion for stable diffusion models. arXiv preprint arXiv:2312.08048 (2023)
  • [49] Zhao, R., Zhu, M., Dong, S., Wang, N., Gao, X.: Catversion: Concatenating embeddings for diffusion-based text-to-image personalization. arXiv preprint arXiv:2311.14631 (2023)
  • [50] Zhou, Y., Zhang, R., Gu, J., Sun, T.: Customization assistant for text-to-image generation. arXiv preprint arXiv:2312.03045 (2023)
  • [51] Zhou, Y., Zhang, R., Sun, T., Xu, J.: Enhancing detail preservation for customized text-to-image generation: A regularization-free approach. arXiv preprint arXiv:2305.13579 (2023)