Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization

Yuhang Ma
Fuxi AI Lab, NetEase Inc.
[email protected]
&Wenting Xu
Fuxi AI Lab, NetEase Inc.
[email protected]
&Jiji Tang
Fuxi AI Lab, NetEase Inc.
[email protected]
&Qinfeng **
Fuxi AI Lab, NetEase Inc.
[email protected]
&Rongsheng Zhang
Fuxi AI Lab, NetEase Inc.
[email protected]
&Zeng Zhao
Fuxi AI Lab, NetEase Inc.
[email protected]
&Changjie Fan
Fuxi AI Lab, NetEase Inc.
[email protected]
&Zhipeng Hu
Fuxi AI Lab, NetEase Inc.
[email protected]
Abstract

Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods. Our code will be released at https://github.com/Character-Adapter/Character-Adapter.

1 Introduction

Refer to caption
Figure 1: Images generated by Character-Adapter. Character-Adapter can be seamlessly integrated with any preferred model, without extra training. This approach empowers the customization of concepts while preserving the high-fidelity appearance of given characters (without any quantitative limitations), encompassing attributes such as hairstyle, identity, attire, and others.

Under the nurturing of text-to-image diffusion models, customized image generation aiming to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. Particularly, training-based methods [1, 2, 3, 4, 5, 6, 7] are the most prevalent approach for generating high-fidelity characters (e.g., Dreambooth [1], Custom Diffusion [5]). However, these training-based methods exhibit several notable limitations. Firstly, they necessitate the acquisition of customized data to fine-tune the models, resulting in a significant demand for computational resources and prolonged training durations. Secondly, they may encounter trade-offs between character consistency and text-image alignment [3, 7, 6]. Furthermore, fine-tuning for specific characters can potentially compromise the robustness, leading to a sacrifice in their ability to generalize across a wide range of characters [8].

To inspire generalizability, training-free methods [9, 10, 11, 12, 13] prioritize consistent character generation by utilizing adapter modules or optimizing the parameters of diffusion models to incorporate reference images. They suffer from several drawbacks: 1) they primarily focus on identity preservation (e.g., IP-Adapter [9], InstantID [10]), neglecting other aspects of the reference character, such as attire and decorations [12, 14, 10]; 2) they struggle in capturing precise semantic representations, as they primarily focus on the reference image [9, 10].

We consider that the inconsistency generation exhibited by these methods is attributed to inadequate image feature extraction and concept confusion of reference characters [7]. Recently, IP-Adapter [9] is introduced to embed reference image features. However, integrating the entire reference image into one adapter can lead to inadequate feature preservation. Furthermore, given that the text encoder struggles to encapsulate compositional concepts, employing full tokens for multi-concept generation may lead to concept confusion [15]. To address these problems, We present Character-Adapter, a novel framework designed to facilitate consistent character generation. Specifically, it incorporates a prompt-guided segmentation module that localizes image regions based on text prompts, thereby facilitating adequate image feature extraction. Subsequently, we introduce a dynamic region-level adapters module, comprising region-level adapters and attention dynamic fusion. The region-level adapters allow each adapter to concentrate on the corresponding region (e.g., attire, decorations) of the generated image, thereby mitigating concept fusion and promoting disentangled representations for each component of the generated image. Additionally, the attention dynamic fusion is introduced to enable more accurate conditional image feature preservation while facilitating coherent generation between the character and the background regions.

Overall, Character-Adapter is a plug-and-play framework designed to generate highly consistent characters with intricate details. Its advantage of not necessitating further training allows it to be more versatile and practical in its applications. Several cases involving both single and multiple character generation are illustrated in Fig. 1. Our contributions can be summarized as follows:

  • We introduce Character-Adapter, a framework designed to ensure high-fidelity character generation. Our method achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods.

  • We propose prompt-guided segmentation for regional localization of reference characters to facilitate comprehensive feature extraction, and employ dynamic region-level adapters to mitigate concept fusion, thereby preserving high-fidelity consistency with reference characters.

  • Character-Adapter is a versatile plug-and-play model that can be easily integrated into any backbone model or compatible with other editing tools (such as ControlNet [16]) for both single and multiple character generation.

2 Related work

Text-to-image generative models. Diffusion models have achieved remarkable results in text-to-image generation in recent years [17, 18, 19, 20, 21, 22, 23]. Early works such as DALL-E2 [19] and Imagen [18] utilize original images as the diffusion input, resulting in enormous computational resources and training time. Latent diffusion models (LDMs)  [24] have been introduced to compress images into a latent space through a pre-trained auto-encoder [25], instead of operating directly in the pixel space [18, 17]. However, general diffusion models rely solely on text prompts, lacking the capability to generate consistent characters with image conditions.

Consistent character generation. Subject-driven image generation aims to generate customized images of a particular subject based on different text prompts. Most existing works adopt extensive fine-tuning for each subject [1, 2, 3, 4]. Dreambooth [1] maps the subject to a unique identifier while Textual-Inversion [26] is proposed to optimize a word vector for a custom concept. Moreover, some works [5, 6, 7] put their effort in multi-subject image generation. Custom Diffusion [5] propose to combine multiple concepts via closed-form constrained optimization. OMG [6] and Mix-of-Show [7] propose to optimize the fusion mode during training in circumstance of multi-concept generation. However, these methods necessitate additional training for all subjects, which can be time-consuming in multi-subject generation scenarios. Recently, some methods strive to enable subject-driven image generation without additional training [12, 27, 9, 11, 10, 8]. Most of them explore extended-attention mechanisms for maintaining identity consistency. IP-Adapter [9] and InstantID [10] introduce visual control by separating cross-attention layers for text features and image features. ConsiStory [8] enables training-free subject-level consistency across novel images via cross-frame attention. However, they fail to preserve detailed information according to the inadequate image feature extraction.

3 Methods

Refer to caption
Figure 2: Framework of Character-Adapter. Step 1 involves obtaining the segmentation of the reference characters with given images and prompts through the prompt-guided segmentation module (Module a). Step 2 acquires attention maps of layout images generated solely from the given prompts via the same module. Step 3 illustrates the process of generating images with the given prompt and semantic regions through the dynamic region-level adapters module (Module b).

3.1 Preliminaries

Latent diffusion model. Latent diffusion models (LDMs) [28] consist of three components: a text encoder, a variational autoencoder (VAE) [25], and a U-Net. The U-Net aims to predict noise during the diffusion process, as follows:

LLDM:=𝔼ε(x),ϵ𝒩(0,1),t[ϵϵθ(zt,t)]22]\displaystyle L_{LDM}:=\mathbb{E}_{\varepsilon(x),\epsilon\sim\mathcal{N}(0,1)% ,t}[\parallel\epsilon-\epsilon_{\theta}(z_{t},t)\parallel]_{2}^{2}]italic_L start_POSTSUBSCRIPT italic_L italic_D italic_M end_POSTSUBSCRIPT := blackboard_E start_POSTSUBSCRIPT italic_ε ( italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) ∥ ] start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (1)

where ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is obtained from the encoder \mathcal{E}caligraphic_E. Cross-attention in LDMs. Cross-attention mechanisms are used to augment the underlying U-Net backbone, thereby transforming DMs into more flexible conditional image generators, which are effective for learning attention-based models for a variety of input modalities [29]. In the original Stable Diffusion, text prompts are encoded by CLIP [30] to obtain text features, which are fed as conditions into the U-Net backbone of the diffusion model. Specifically, the output after the cross-attention mechanism in the Stable Diffusion are noted as:

z=Attention(Q,K,V)=Softmax(QKd)V,superscript𝑧Attention𝑄𝐾𝑉Softmax𝑄superscript𝐾top𝑑𝑉\begin{split}{z}^{\prime}=\text{Attention}({Q},{K},{V})=\text{Softmax}(\frac{{% Q}{K}^{\top}}{\sqrt{d}}){V},\\ \end{split}start_ROW start_CELL italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = Attention ( italic_Q , italic_K , italic_V ) = Softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ) italic_V , end_CELL end_ROW (2)

where Q=zWq𝑄𝑧subscript𝑊𝑞Q=zW_{q}italic_Q = italic_z italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT,K=ctWk𝐾subscript𝑐𝑡subscript𝑊𝑘K=c_{t}W_{k}italic_K = italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT,V=ctWv𝑉subscript𝑐𝑡subscript𝑊𝑣V=c_{t}W_{v}italic_V = italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, z𝑧zitalic_z denotes latent vectors obtained from the encoder \mathcal{E}caligraphic_E, ctsubscript𝑐𝑡c_{t}italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT denotes text features obtained from text encoder (CLIP in Stable Diffusion).

Image Prompt Adapter. IP-Adapter proposes a decoupled cross-attention strategy to support conditional image generation by introducing an image cross-attention mechanism [9] analogous to the original cross-attention module in Stable Diffusion [28]. This mechanism seamlessly integrates image prompts with text prompts to guide the text-to-image generation process. Owing to the decoupled nature of the image and text cross-attention mechanisms, the proposed approach can be readily integrated into existing Stable Diffusion models without additional training. The decoupled cross-attention can be formulated as:

Znew=Attention(Q,K,V)+λAttention(Q,K,V),superscript𝑍𝑛𝑒𝑤Attention𝑄𝐾𝑉𝜆Attention𝑄superscript𝐾superscript𝑉{Z}^{new}=\text{Attention}({Q},{K},{V})+\lambda\cdot\text{Attention}({Q},{K}^{% \prime},{V}^{\prime}),italic_Z start_POSTSUPERSCRIPT italic_n italic_e italic_w end_POSTSUPERSCRIPT = Attention ( italic_Q , italic_K , italic_V ) + italic_λ ⋅ Attention ( italic_Q , italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (3)

where K=ciWksuperscript𝐾subscript𝑐𝑖subscriptsuperscript𝑊𝑘K^{\prime}=c_{i}{W}^{\prime}_{k}italic_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT,V=ciWvsuperscript𝑉subscript𝑐𝑖subscriptsuperscript𝑊𝑣V^{\prime}=c_{i}{W}^{\prime}_{v}italic_V start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, cisubscript𝑐𝑖c_{i}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the image features extracted from the CLIP image encoder, and Znewsuperscript𝑍𝑛𝑒𝑤Z^{new}italic_Z start_POSTSUPERSCRIPT italic_n italic_e italic_w end_POSTSUPERSCRIPT denotes the new latent representation obtained by conditioning on both image and text inputs. Notably, Wksubscriptsuperscript𝑊𝑘{W}^{\prime}_{k}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Wvsubscriptsuperscript𝑊𝑣{W}^{\prime}_{v}italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT are the only trainable parameters in IP-Adapter.

3.2 Framework of Character-Adapter

In this section, we introduce Character-Adapter, a novel framework designed for high-fidelity consistent character generation. As illustrated in Fig. 2, we first segment a reference character into several parts using prompt-guided segmentation. Subsequently, We obtain the attention maps of the target image layout using the same approach. Finally, we utilize dynamic region-level adapters to achieve detailed and consistent character generation.

3.2.1 Prompt-guided segmentation

We hypothesize that directly passing the entire reference image into the image encoder leads to inadequate extraction of detailed image features pertaining to the given character [9], as evidenced by the examples shown in Fig. 4. To address this problem, we propose a prompt-guided segmentation module that decomposes the given character into separate regions, namely the face, upper body, and lower body. The ablation study presented in Sec. 4.4 proves our assumption and substantiates the efficacy of the proposed prompt-guided segmentation in facilitating comprehensive image feature extraction.

When given a user-input prompt P𝑃Pitalic_P, for instance, “a boy standing in a library, wearing green jacket and blue pants”, we complete the prompt with detailed region descriptions:

PC=[PG;PF;PU;PL]=W:=[w1;;wg;wg+1wfwl],l>u>f>g,formulae-sequencesubscript𝑃𝐶subscript𝑃𝐺subscript𝑃𝐹subscript𝑃𝑈subscript𝑃𝐿𝑊assignsubscript𝑤1subscript𝑤𝑔subscript𝑤𝑔1subscript𝑤𝑓subscript𝑤𝑙𝑙𝑢𝑓𝑔P_{C}=[P_{G};P_{F};P_{U};P_{L}]=W:=[w_{1};...;w_{g};w_{g+1}...w_{f}...w_{l}],% \quad l>u>f>g,italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = [ italic_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ] = italic_W := [ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; … ; italic_w start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT italic_g + 1 end_POSTSUBSCRIPT … italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT … italic_w start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ] , italic_l > italic_u > italic_f > italic_g , (4)

where PGsubscript𝑃𝐺P_{G}italic_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT denotes the user-given prompt, PFsubscript𝑃𝐹P_{F}italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT indicates the prompt for the face region (“a boy”), PUsubscript𝑃𝑈P_{U}italic_P start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT specifies the prompt for the upper body (“green jacket”), and PLsubscript𝑃𝐿P_{L}italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT relates to the prompt for the lower body (“blue pants”). W::𝑊absentW:italic_W : represents the sequence of words in the constructed prompt PCsubscript𝑃𝐶P_{C}italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT.

Subsequently, We use PCsubscript𝑃𝐶P_{C}italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT to generate a layout image with Stable Diffusion, from which we derive the attention map corresponding to each word and the latent representation of U-Net. To be specific, let us consider a 2D layout image latent, denoted as ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, at timestep t𝑡titalic_t. The attention map for the i𝑖iitalic_i-th word wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, in relation to the image latent at the k𝑘kitalic_k-th U-Net block, can be represented as follows:

Ai,t(k)=F(wi,zt(k))=softmax((Wq(k)zt(k))(Wk(k)TE(wi))),superscriptsubscript𝐴𝑖𝑡𝑘𝐹subscript𝑤𝑖superscriptsubscript𝑧𝑡𝑘𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscriptsubscript𝑊𝑞𝑘superscriptsubscript𝑧𝑡𝑘superscriptsubscript𝑊𝑘𝑘𝑇𝐸subscript𝑤𝑖A_{i,t}^{(k)}=F(w_{i},z_{t}^{(k)})=softmax((W_{q}^{(k)}\cdot z_{t}^{(k)})(W_{k% }^{(k)}\cdot TE(w_{i}))),italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT = italic_F ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( ( italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ⋅ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ) ( italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ⋅ italic_T italic_E ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ) , (5)

where A𝐴Aitalic_A denotes the attention map, Wqsubscript𝑊𝑞W_{q}italic_W start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are projection matrices in the attention module, and TE𝑇𝐸TEitalic_T italic_E corresponds to a language model that encodes text prompts into text embeddings.

To obtain precise attention maps and masks, we integrate the attention maps of all K𝐾Kitalic_K layers to a uniform dimension through interpolation. Subsequently, we calculate the similarity Si,tsubscript𝑆𝑖𝑡S_{i,t}italic_S start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT as follows:

Si,t=k=1KInterplot(Ai,t(k),w,h),subscript𝑆𝑖𝑡superscriptsubscript𝑘1𝐾𝐼𝑛𝑡𝑒𝑟𝑝𝑙𝑜𝑡superscriptsubscript𝐴𝑖𝑡𝑘𝑤S_{i,t}=\sum_{k=1}^{K}Interplot(A_{i,t}^{(k)},w,h),italic_S start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_I italic_n italic_t italic_e italic_r italic_p italic_l italic_o italic_t ( italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_w , italic_h ) , (6)

where w𝑤witalic_w and hhitalic_h represent the width and height of the noised image’s latent, respectively.

Since each semantic region prompt comprises multiple words, such as “green jacket”, we aggregate each attention map of these words, denoted by Si,tsubscript𝑆𝑖𝑡S_{i,t}italic_S start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT, in order to establish the correlation between the region prompt and the corresponding semantic image region:

Sr,t=maxrbegin<i<rendk[1,K]Interplot(Ai,t(k),w,h),subscript𝑆𝑟𝑡subscriptsubscript𝑟𝑏𝑒𝑔𝑖𝑛𝑖subscript𝑟𝑒𝑛𝑑subscript𝑘1𝐾𝐼𝑛𝑡𝑒𝑟𝑝𝑙𝑜𝑡superscriptsubscript𝐴𝑖𝑡𝑘𝑤S_{r,t}=\mathop{\max}\limits_{r_{begin}<i<r_{end}}\sum_{k\in[1,K]}Interplot(A_% {i,t}^{(k)},w,h),italic_S start_POSTSUBSCRIPT italic_r , italic_t end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_b italic_e italic_g italic_i italic_n end_POSTSUBSCRIPT < italic_i < italic_r start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_k ∈ [ 1 , italic_K ] end_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r italic_p italic_l italic_o italic_t ( italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT , italic_w , italic_h ) , (7)

where rbeginsubscript𝑟𝑏𝑒𝑔𝑖𝑛r_{begin}italic_r start_POSTSUBSCRIPT italic_b italic_e italic_g italic_i italic_n end_POSTSUBSCRIPT and rendsubscript𝑟𝑒𝑛𝑑r_{end}italic_r start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT represent the word indices corresponding to the beginning and end of the region prompt, respectively.

Similarly, given a reference character, we follow Eq.  5 and utilize attention maps to extract specific adapter condition regions. First, we add noise to the latent representation of the reference image z0(R)superscriptsubscript𝑧0𝑅z_{0}^{(R)}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT in t𝑡titalic_t steps as the following:

zt(R)=α¯tz0(R)+1α¯tϵt.superscriptsubscript𝑧𝑡𝑅subscript¯𝛼𝑡superscriptsubscript𝑧0𝑅1¯𝛼𝑡subscriptitalic-ϵ𝑡z_{t}^{(R)}=\sqrt{\overline{\alpha}_{t}}z_{0}^{(R)}+\sqrt{1-\overline{\alpha}t% }\epsilon_{t}.italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_R ) end_POSTSUPERSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG italic_t end_ARG italic_ϵ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (8)

Following Eq.  5, the corresponding attention map Sr,tRsuperscriptsubscript𝑆𝑟𝑡𝑅S_{r,t}^{R}italic_S start_POSTSUBSCRIPT italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT between the reference image R𝑅Ritalic_R and the prompt PCsubscript𝑃𝐶P_{C}italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is obtained. We then use attention maps between the prompt and reference image to achieve a more precise and fine-grained segmentation of the reference image:

ybeginr=argminy(SrR(x,y)>γ2),xbeginr=0,formulae-sequencesuperscriptsubscript𝑦𝑏𝑒𝑔𝑖𝑛𝑟subscript𝑦superscriptsubscript𝑆𝑟𝑅𝑥𝑦subscript𝛾2superscriptsubscript𝑥𝑏𝑒𝑔𝑖𝑛𝑟0y_{begin}^{r}=\mathop{\arg\min}\limits_{y}(S_{r}^{R}(x,y)>\gamma_{2}),\quad x_% {begin}^{r}=0,italic_y start_POSTSUBSCRIPT italic_b italic_e italic_g italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( italic_x , italic_y ) > italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_b italic_e italic_g italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = 0 , (9)
yendr=argmaxy(SrR(x,y)>γ2),xendr=width,formulae-sequencesuperscriptsubscript𝑦𝑒𝑛𝑑𝑟subscript𝑦superscriptsubscript𝑆𝑟𝑅𝑥𝑦subscript𝛾2superscriptsubscript𝑥𝑒𝑛𝑑𝑟𝑤𝑖𝑑𝑡y_{end}^{r}=\mathop{\arg\max}\limits_{y}(S_{r}^{R}(x,y)>\gamma_{2}),\quad x_{% end}^{r}=width,italic_y start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( italic_x , italic_y ) > italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT = italic_w italic_i italic_d italic_t italic_h , (10)

we ultimately employ CRr(x,y)𝐶subscript𝑅𝑟𝑥𝑦CR_{r}(x,y)italic_C italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ), defined within the ranges x[xbeginr,xendr]𝑥superscriptsubscript𝑥𝑏𝑒𝑔𝑖𝑛𝑟superscriptsubscript𝑥𝑒𝑛𝑑𝑟x\in[x_{begin}^{r},x_{end}^{r}]italic_x ∈ [ italic_x start_POSTSUBSCRIPT italic_b italic_e italic_g italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ] and y[ybeginr,yendr]𝑦superscriptsubscript𝑦𝑏𝑒𝑔𝑖𝑛𝑟superscriptsubscript𝑦𝑒𝑛𝑑𝑟y\in[y_{begin}^{r},y_{end}^{r}]italic_y ∈ [ italic_y start_POSTSUBSCRIPT italic_b italic_e italic_g italic_i italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_e italic_n italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ], as the semantic region of the reference image, segmented through prompt guidance.

3.2.2 Dynamic region-level adapters

Another limitation of existing consistent character generation methods is concept fusion, which we posit stems from the text encoder’s inability to adequately encapsulate compositional concepts. Employing full tokens for multi-concept generation may lead to concept confusion [15]. This diminishes the guidance of these features toward the corresponding regions of the latent vectors, as substantiated in Sec. 4.4. To mitigate this limitation, we propose a dynamic region-level adapters module to alleviate concept fusion during the inference process.

Region-level adapters. Initially, we employ mask-based multi-adapters to integrate the semantic guidance from different regions. The mask at resolution (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) of each region can be obtained as following:

Mr(x,y)=Binary(Sr(x,y))={1Sr(x,y)>γ1,0Sr(x,y)<=γ1.x[1,w],y[1,h],formulae-sequencesubscript𝑀𝑟𝑥𝑦𝐵𝑖𝑛𝑎𝑟𝑦subscript𝑆𝑟𝑥𝑦cases1subscript𝑆𝑟𝑥𝑦subscript𝛾10subscript𝑆𝑟𝑥𝑦subscript𝛾1𝑥1𝑤𝑦1M_{r}(x,y)=Binary(S_{r}(x,y))=\left\{\begin{array}[]{l}1\quad\quad S_{r}(x,y)>% \gamma_{1},\\ 0\quad\quad S_{r}(x,y)<=\gamma_{1}.\\ \end{array}x\in[1,w],y\in[1,h]\right.,italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) = italic_B italic_i italic_n italic_a italic_r italic_y ( italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) ) = { start_ARRAY start_ROW start_CELL 1 italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) > italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL 0 italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) < = italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . end_CELL end_ROW end_ARRAY italic_x ∈ [ 1 , italic_w ] , italic_y ∈ [ 1 , italic_h ] , (11)

where Sr(x,y)subscript𝑆𝑟𝑥𝑦S_{r}(x,y)italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) is the attention map of specific region of the layout image. γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the threshold.

As these adapters represent different semantic regions of a reference image, for example, the face region is used to guide the character’s appearance and hairstyle. A natural approach is to fuse the different semantics using mask guidance, the masked attention map now reads:

At(x,y)=r(Ar,t(CR)(x,y)Mr(x,y))+r(1Mr(x,y))AP,t(x,y),subscript𝐴𝑡𝑥𝑦subscript𝑟superscriptsubscript𝐴𝑟𝑡𝐶𝑅𝑥𝑦subscript𝑀𝑟𝑥𝑦subscriptproduct𝑟1subscript𝑀𝑟𝑥𝑦subscript𝐴𝑃𝑡𝑥𝑦A_{t}(x,y)=\sum_{r}(A_{r,t}^{(CR)(x,y)}\cdot M_{r}(x,y))+\prod\limits_{r}(1-M_% {r}(x,y))\cdot A_{P,t}(x,y),italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_C italic_R ) ( italic_x , italic_y ) end_POSTSUPERSCRIPT ⋅ italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) ) + ∏ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( 1 - italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ⋅ italic_A start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT ( italic_x , italic_y ) , (12)

where Ar,t(CR)superscriptsubscript𝐴𝑟𝑡𝐶𝑅A_{r,t}^{(CR)}italic_A start_POSTSUBSCRIPT italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_C italic_R ) end_POSTSUPERSCRIPT refers to the attention map between the reference image region CRr𝐶subscript𝑅𝑟CR_{r}italic_C italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and the prompt PCsubscript𝑃𝐶P_{C}italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT using Eq. 7, Mr(x,y)subscript𝑀𝑟𝑥𝑦M_{r}(x,y)italic_M start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) refers to the masks obtained from the layout image as shown in Eq. 11, AP,t(x,y)subscript𝐴𝑃𝑡𝑥𝑦A_{P,t}(x,y)italic_A start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT ( italic_x , italic_y ) is the attention map derived solely from the prompt condition.

Attention dynamic fusion. Mask-based regional-level adapters face the challenge of obtaining region masks that facilitate generation adhering to the layout while remaining faithful to the reference image. We observe that an inference time of T=650𝑇650T=650italic_T = 650 yields masks whose shapes strike a balance between the reference and layout masks (Fig. 3). However, the coarse nature of these hard-label masks adversely impacts the generation of well-defined boundaries, as evidenced in Sec. 4.4. To address this, we propose an attention dynamic fusion module, characterized as a soft-label mask, to integrate the region-level adapters and produce coherent image generation through At(x,y)subscript𝐴𝑡𝑥𝑦A_{t}(x,y)italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_y ), defined as follows:

At(x,y)=r(Ar,t(CR)(x,y)softmax(Sr(x,y))+softmax(1rSr(x,y))AP,t(x,y).A_{t}(x,y)=\sum_{r}(A_{r,t}^{(CR)(x,y)}\cdot softmax(S_{r}(x,y))+softmax(1-% \sum_{r}S_{r}(x,y))\cdot A_{P,t}(x,y).italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_C italic_R ) ( italic_x , italic_y ) end_POSTSUPERSCRIPT ⋅ italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) ) + italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( 1 - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_S start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ⋅ italic_A start_POSTSUBSCRIPT italic_P , italic_t end_POSTSUBSCRIPT ( italic_x , italic_y ) . (13)
Refer to caption
Figure 3: Visualization of attention maps between the upper body region prompt and image latent. As the timestep increases, the attention map more accurately reflects the area of the character clothing.

3.2.3 Multi-character consistency

In addition to achieving remarkable performance in terms of single-character consistency, Character-Adapter also demonstrates remarkable performance for multi-character generation. Similar to single-character generation, we utilize prompt-guided segmentation to automatically localize multiple reference characters. The prompt includes a global prompt and regional prompts for N𝑁Nitalic_N character is noted as:

PC=[PG;PF;1PU;1PL;1PF;NPU;NPL]N.P_{C}=[P_{G};P_{F}{}_{1};P_{U}{}_{1};P_{L}{}_{1};...P_{F}{}_{N};P_{U}{}_{N};P_% {L}{}_{N}].italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT = [ italic_P start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_FLOATSUBSCRIPT 1 end_FLOATSUBSCRIPT ; … italic_P start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_N end_FLOATSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_N end_FLOATSUBSCRIPT ; italic_P start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT start_FLOATSUBSCRIPT italic_N end_FLOATSUBSCRIPT ] . (14)

The process of multi-character consistency generation is described in Algorithm 1.

Algorithm 1 Character-Adapter for single and multi-characters consistency generation
0:  Prompt PCsubscript𝑃𝐶P_{C}italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, N𝑁Nitalic_N reference character images Ri,i[1,N]subscript𝑅𝑖𝑖1𝑁R_{i},i\in[1,N]italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ [ 1 , italic_N ], a diffusion network Dθsubscript𝐷𝜃D_{\theta}italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
1:  use prompt PCsubscript𝑃𝐶P_{C}italic_P start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT to obtain attention map Ai,t(k)superscriptsubscript𝐴𝑖𝑡𝑘A_{i,t}^{(k)}italic_A start_POSTSUBSCRIPT italic_i , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT via Eq. 5 and correlation map Sr,tsubscript𝑆𝑟𝑡S_{r,t}italic_S start_POSTSUBSCRIPT italic_r , italic_t end_POSTSUBSCRIPT via Eq. 7
2:  for i in [1, N] do
3:     add noise and perform diffusion process to get correlation map Sr,tR(i)superscriptsubscript𝑆𝑟𝑡superscript𝑅𝑖S_{r,t}^{R^{(i)}}italic_S start_POSTSUBSCRIPT italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT via Eq. 5 Eq. 7
4:     get region adapter inputs CRr(i)𝐶superscriptsubscript𝑅𝑟𝑖CR_{r}^{(i)}italic_C italic_R start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT via Eq. 9 and Eq. 10
5:  end for
6:  fuse the adapters to get attention map Atsubscript𝐴𝑡A_{t}italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT via Eq. 13
6:  target image TI=Dθ(At)subscript𝑇𝐼subscript𝐷𝜃subscript𝐴𝑡T_{I}=D_{\theta}(A_{t})italic_T start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_A start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

4 Experiments

4.1 Experimental setup

Evaluation datasets. We collect a dataset comprising 50 distinct entities, which includes 25 real-world characters and 25 anime characters for single-character evaluation. For multi-character evaluation, we randomly combine two characters from real-world characters and anime characters, creating a total of 50 samples. All user-given texts are obtained from MSCOCO [31] datasets.

Implementation details. Character-Adapter is a plug-and-play framework and can be transplanted to any diffusion backbone. Hence, we conduct all experiments on SD v1.5 [28] with classifier-free guidance as 7 and 20-timestep Euler A sampling. The mask threshold γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are all set as 0.8. More details are provided in Sec. A.1

4.2 Quantitative comparison

We conduct experiments on Character-Adapter and previous SOTA works, quantitative results are illustrated in Table 1. state-of-the-art performance in zero-shot character consistency generation, with an improvement of 24.8% on the CLIP-I and DINO scores compared to previous state-of-the-art methods in both single and multi character generation. In comparison with fine-tuning-based approaches, Character-Adapter can achieve comparable character consistency while offering over 70 times improvement in computational efficiency, as it requires no additional training. Moreover, Character-Adapter achieves the best performance in terms of text-image alignment with an improvement of 3.5%.

Refer to caption
Figure 4: Visual comparison (%) of Character-Adapter against other subject-driven methods. Our approach ensures high-fidelity consistency, while maintaining text-image alignment.
Table 1: Quantitative results (%) of Character-Adapter with other methods. Evaluations are conducted for both single and multiple character generation. The best results are highlighted in bold.
Methods Models Time(s) CLIP-T(%)\uparrow CLIP-I(%)\uparrow DINO-I(%)\uparrow
Single Multi Single Multi Single Multi
Finetune-based Textual Inversion [26] 220 26.2 24.0 82.8 82.1 52.3 42.8
LoRA [3] 1050 27.3 25.4 83.5 83.2 61.2 62.4
Custom Diffusion [5] 510 28.6 24.8 83.9 83.6 67.8 67.2
Mix-of-Show [7] 1200 29.9 27.4 84.2 84.1 68.2 65.9
OMG [6] 1200 29.7 28.6 84.6 83.6 67.1 67.8
Training-free Reference only[32] 7.6 26.8 22.8 78.4 73.6 42.1 40.8
IP-Adapter [9] 5.2 29.8 26.2 83.6 82.4 59.8 58.0
InstantID [10] 5.8 27.4 22.1 75.8 74.2 46.7 45.9
FastComposer [12] 7.8 26.4 21.1 72.4 67.2 42.7 41.9
T2I-Adapter [33] 7.6 24.2 21.8 63.8 62.3 43.4 40.9
BLIP-Diffusion [13] 4.5 29.7 27.4 84.0 82.7 60.2 53.2
Character-Adapter (Ours) 7.2 30.4 30.2 84.8 84.6 68.1 67.8

4.3 Qualitative comparison

As shown in the multi-character generation example (Fig. 4 6th row), methods like T2I-Adapter, InstantID, Reference-only, and FastComposer fail to preserve the reference characters’ attire. IP-Adapter encounters concept fusion and detail omission issues, attributable to using an image-level adapter, leading to inadequate feature extraction and representation. While fine-tuning-based models like LoRA align with text prompts, they lack details of reference characters and cause concept fusion in multi-character generation. In contrast, Character-Adapter generates high-fidelity multi-character images with intricate details while preserving text-image alignment. Overall, the qualitative results substantiate Character-Adapter’s significant improvements in high-fidelity character generation. Additional results are provided in Sec. A.5.

4.4 Ablation study

Refer to caption
Figure 5: Visualization of ablation study, each component is removed individually to prove its efficiency, (d) represents the results obtained with the whole Character-Adapter.
Table 2: Quantitative ablation result (%) of Character-Adapter. Each component is individually removed to evaluate its necessity and contribution to the overall performance.
Methods Adapter level Guidance Fusion CLIP-T\uparrow CLIP-I\uparrow DINO-I\uparrow
 w/o {Prompt-guided segmentation} Region Dynamic 26.9 77.9 47.2
 w/o {Region-level adapter} Image Dynamic 28.4 79.8 53.9
 w/o {Attention dynamic fusion} Region Mask 27.4 78.3 62.2
Character-Adapter (full) Region Dynamic 30.8 86.9 68.3

Prompt-guided segmentation. We conduct an experiment where the character is equally segmented into three parts, and compare the results to verify the importance of prompt-guided segmentation. Fig. 5 (a) exhibits a significant imbalance in the proportions of the characters, in contrast to Fig. 5 (d). The results indicate that prompt-guided segmentation facilitates comprehensive image feature extraction and provides fine-grained guidance for character generation.

Region-level adapters. The experimental results in Table 2 show a significant drop in the CLIP-I score when using an image-level adapter. A comparison between Fig. 5 (b) and Fig. 5 (d) demonstrates that the proposed region-level adapters effectively mitigate content fusion issues, thereby improving character consistency.

Attention dynamic fusion. Fig.5 (c) showcases the result without using attention dynamic fusion. The visualization in Fig. 5 (b) reveals a dramatic change in the character’s hairstyle compared to Fig.5 (d). This is because the solid mask only preserves details of certain regions of the reference character and is influenced by the layout image. The result in Table 2 also corroborates the effectiveness of the proposed attention dynamic fusion module.

4.5 User study

We conducted a user study with 20 experts to evaluate Character-Adapter against previous methods. Each expert assessed 50 pairwise comparisons. As shown in Table 3, the results indicate that Character-Adapter achieves a win rate significantly exceeding 50% for both text alignment and character consistency.

Table 3: User study results(%) illustrating the comparison between Character-Adapter and SOTA methods, revealing a notable preference among participants for Character-Adapter in terms of both textual alignment and character consistency.
Methods Models (Character-Adapter vs. *) Text Alignments Character Consistency
Win(%) Lose(%) Win(%) Lose(%)
Finetuning LoRA 92.4 7.6 90.6 9.4
Plug-and-play IP-Adapter 87.5 12.5 70.8 29.2
T2I-Adapter 90.2 9.8 91.7 8.3
InstantID 94.2 5.8 100 0.0
Reference-only 100 0.0 99.2 0.8
BLIP-Diffusion 100 0.0 100 0.0
FastComposer 98.2 1.8 100 0.0

4.6 Extended application

Additional controls. Fig. 6 demonstrates that Character-Adapter can be successfully integrated with ControlNet, enabling pose conditioning to guide the consistent character generation.

Inpainting-based personalization. Our method can also be combined with inpainting to seamlessly insert consistent characters into any fixed scene, which holds great potential for generating comics and graphic novels. Fig. 6 shows that integrated with inpainting, our proposed Character-Adapter can replace a selected region with specific characters, while maintaining a fixed background naturally.

Consistent generation of non-human characters. As Character-Adapter supports character consistency, it has the potential to extend the ability to animals and objects. In Fig. 6, we have demonstrated that Character-Adapter can achieve excellent results in the task of animal consistency.

Refer to caption
Figure 6: Visualization of Character-Adapter’s versatility and compatibility. (a) Combination with Pose Control. (b) Inpainting with a reference image. (c) Generation with animals (other types).

5 Conclusion

In this work, we investigate the limitations of existing consistent character generation methods and propose Character-Adapter, which enables the synthesis of high-fidelity images that preserve the intricate details and identities of reference characters. The key innovations in Character-Adapter are the prompt-guided segmentation module, which enables fine-grained extraction of regional features from reference characters, and the dynamic region-level adapters module, which mitigates concept fusion issues. Notably, Character-Adapter can be seamlessly integrated into any backbone model and compatible with other editing tools for both single and multiple character generation.

References

  • [1] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023.
  • [2] Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, and Qian He. Dreamtuner: Single image is enough for subject-driven generation. arXiv preprint arXiv:2312.13691, 2023.
  • [3] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
  • [4] Yuxiang Wei, Yabo Zhang, Zhilong Ji, **feng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023.
  • [5] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  • [6] Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guanying Chen, Wei Liu, and Wenhan Luo. Omg: Occlusion-friendly personalized multi-concept generation in diffusion models. arXiv preprint arXiv:2403.10983, 2024.
  • [7] Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  • [8] Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consistent text-to-image generation. arXiv preprint arXiv:2402.03286, 2024.
  • [9] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  • [10] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
  • [11] Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint arXiv:2312.04461, 2023.
  • [12] Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
  • [13] Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems, 36, 2024.
  • [14] Ziheng Wu, Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Xing Shi, and Jun Huang. Easyphoto: Your smart ai photo generator, 2023.
  • [15] Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. Be yourself: Bounded attention for multi-subject text-to-image generation. arXiv preprint arXiv:2403.16990, 2024.
  • [16] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
  • [17] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
  • [18] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  • [19] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • [20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
  • [21] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • [22] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  • [23] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
  • [24] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
  • [25] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  • [26] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
  • [27] Yuxuan Zhang, Jiaming Liu, Yiren Song, Rui Wang, Hao Tang, **peng Yu, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. arXiv preprint arXiv:2312.16272, 2023.
  • [28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  • [29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
  • [30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
  • [31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  • [32] Mikubill. reference-only. sd-webui-controlnet (2023), https://github.com/Mikubill/sd-webui-controlnet, gitHub repository, 2014.
  • [33] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models, 2023.
  • [34] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.

Appendix A Appendix

We have provided supplementary details regarding our Character-Adapter in this section. Implementation details are illustrated in Section A.1. Limitations and future work are discussed in Sections A.2 and A.3, respectively. Potential societal impacts of our work are examined in Section A.4.

A.1 Implementation details

A.1.1 Inference setup

We employ Realistic Vision V4111https://huggingface.co/SG161222/Realistic_Vision_V4.0_noVAE to generate photo-realistic human portraits and animals, and using Animesh222https://huggingface.co/redstonehero/animesh_prunedv21 for anime character generation. All comparison models utilize 20-step Euler A sampling, and the classifier-free guidance is set to be 7.0. The corresponding resolution of inference image is set to 768 x 768. We implement experiments using an A30 GPU.

A.1.2 Evaluation metrics

We first employ CLIP ViT-L/14333https://huggingface.co/openai/clip-vit-large-patch14 to evaluate the similarity between the generated images and the given text prompts (CLIP-T). Subsequently, we utilize the image encoder of the CLIP model to evaluate the correlation between the generated consistent images and the reference images (CLIP-I). Additionally, we further employ the DINO score [34] to evaluate image alignment, as DINO is better suited for subject representation (DINO-I). Finally, we conduct human evaluations to further evaluate the performance of different approaches in terms of text and image alignment.

A.2 Limitations and discussion

While our method provides a plug-and-play framework for generating consistent and detailed characters with high robustness, several limitations warrant consideration. Firstly, in scenarios involving extremely complex clothing patterns, our model may not fully preserve the original details. Secondly, due to the inherent limitations in the semantic capabilities of Stable Diffusion models, there exists a possibility of inaccurate target localization in the attention maps, leading to misalignment of image details. We leave the exploration of these limitations as future work.

A.3 Future work

For future work, an intriguing direction would be to investigate methods for obtaining more accurate attention maps for different image regions based on prompts, thereby mitigating semantic confusion. This could be achieved by enhancing the semantic understanding of the diffusion model. Furthermore, as Character-Adapter supports identity preservation for both facial and attire features, it holds promise for applications in narrative storytelling and video generation.

A.4 Societal impacts

The proposed method aims to provide an effective and flexible tool for high-fidelity customized character image generation. However, there exists a potential risk of misuse, wherein individuals may generate fake celebrity images, thereby misleading the public. This concern is not unique to our approach but rather a common consideration among all subject-driven image generation methods. One potential solution is to employ a safety checker akin to NSFW filter,444https://huggingface.co/runwayml/stable-diffusion-v1-5 which is a classification module that estimates whether generated images could be considered offensive or harmful, thereby preventing the generation of controversial content and the abuse of celebrity images. Such measures would mitigate the potential misuse of our method while preserving its intended functionality.

A.5 Additional results

In Fig. 7, we present additional results showcasing the capabilities of Character-Adapter. The proposed framework demonstrates high-fidelity generation of consistent characters and objects across diverse scene contexts. Furthermore, we illustrate that Character-Adapter excels in scenarios involving multiple characters, maintaining high-fidelity of each character throughout the generated images. These results further validate the efficacy and versatility of our proposed approach.

Refer to caption
Figure 7: Additional qualitative results.