Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization

Yuhang Ma
Fuxi AI Lab, NetEase Inc.
[email protected]
&Wenting Xu
Fuxi AI Lab, NetEase Inc.
[email protected]
&Jiji Tang
Fuxi AI Lab, NetEase Inc.
[email protected]
&Qinfeng **
Fuxi AI Lab, NetEase Inc.
[email protected]
&Rongsheng Zhang
Fuxi AI Lab, NetEase Inc.
[email protected]
&Zeng Zhao
Fuxi AI Lab, NetEase Inc.
[email protected]
&Changjie Fan
Fuxi AI Lab, NetEase Inc.
[email protected]
&Zhipeng Hu
Fuxi AI Lab, NetEase Inc.
[email protected]

Abstract

Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods. Our code will be released at https://github.com/Character-Adapter/Character-Adapter.

1 Introduction

Refer to caption — Figure 1: Images generated by Character-Adapter. Character-Adapter can be seamlessly integrated with any preferred model, without extra training. This approach empowers the customization of concepts while preserving the high-fidelity appearance of given characters (without any quantitative limitations), encompassing attributes such as hairstyle, identity, attire, and others.

Under the nurturing of text-to-image diffusion models, customized image generation aiming to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. Particularly, training-based methods [1, 2, 3, 4, 5, 6, 7] are the most prevalent approach for generating high-fidelity characters (e.g., Dreambooth [1], Custom Diffusion [5]). However, these training-based methods exhibit several notable limitations. Firstly, they necessitate the acquisition of customized data to fine-tune the models, resulting in a significant demand for computational resources and prolonged training durations. Secondly, they may encounter trade-offs between character consistency and text-image alignment [3, 7, 6]. Furthermore, fine-tuning for specific characters can potentially compromise the robustness, leading to a sacrifice in their ability to generalize across a wide range of characters [8].

To inspire generalizability, training-free methods [9, 10, 11, 12, 13] prioritize consistent character generation by utilizing adapter modules or optimizing the parameters of diffusion models to incorporate reference images. They suffer from several drawbacks: 1) they primarily focus on identity preservation (e.g., IP-Adapter [9], InstantID [10]), neglecting other aspects of the reference character, such as attire and decorations [12, 14, 10]; 2) they struggle in capturing precise semantic representations, as they primarily focus on the reference image [9, 10].

We consider that the inconsistency generation exhibited by these methods is attributed to inadequate image feature extraction and concept confusion of reference characters [7]. Recently, IP-Adapter [9] is introduced to embed reference image features. However, integrating the entire reference image into one adapter can lead to inadequate feature preservation. Furthermore, given that the text encoder struggles to encapsulate compositional concepts, employing full tokens for multi-concept generation may lead to concept confusion [15]. To address these problems, We present Character-Adapter, a novel framework designed to facilitate consistent character generation. Specifically, it incorporates a prompt-guided segmentation module that localizes image regions based on text prompts, thereby facilitating adequate image feature extraction. Subsequently, we introduce a dynamic region-level adapters module, comprising region-level adapters and attention dynamic fusion. The region-level adapters allow each adapter to concentrate on the corresponding region (e.g., attire, decorations) of the generated image, thereby mitigating concept fusion and promoting disentangled representations for each component of the generated image. Additionally, the attention dynamic fusion is introduced to enable more accurate conditional image feature preservation while facilitating coherent generation between the character and the background regions.

Overall, Character-Adapter is a plug-and-play framework designed to generate highly consistent characters with intricate details. Its advantage of not necessitating further training allows it to be more versatile and practical in its applications. Several cases involving both single and multiple character generation are illustrated in Fig. 1. Our contributions can be summarized as follows:

•

We introduce Character-Adapter, a framework designed to ensure high-fidelity character generation. Our method achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods.
•

We propose prompt-guided segmentation for regional localization of reference characters to facilitate comprehensive feature extraction, and employ dynamic region-level adapters to mitigate concept fusion, thereby preserving high-fidelity consistency with reference characters.
•

Character-Adapter is a versatile plug-and-play model that can be easily integrated into any backbone model or compatible with other editing tools (such as ControlNet [16]) for both single and multiple character generation.

2 Related work

Text-to-image generative models. Diffusion models have achieved remarkable results in text-to-image generation in recent years [17, 18, 19, 20, 21, 22, 23]. Early works such as DALL-E2 [19] and Imagen [18] utilize original images as the diffusion input, resulting in enormous computational resources and training time. Latent diffusion models (LDMs) [24] have been introduced to compress images into a latent space through a pre-trained auto-encoder [25], instead of operating directly in the pixel space [18, 17]. However, general diffusion models rely solely on text prompts, lacking the capability to generate consistent characters with image conditions.

Consistent character generation. Subject-driven image generation aims to generate customized images of a particular subject based on different text prompts. Most existing works adopt extensive fine-tuning for each subject [1, 2, 3, 4]. Dreambooth [1] maps the subject to a unique identifier while Textual-Inversion [26] is proposed to optimize a word vector for a custom concept. Moreover, some works [5, 6, 7] put their effort in multi-subject image generation. Custom Diffusion [5] propose to combine multiple concepts via closed-form constrained optimization. OMG [6] and Mix-of-Show [7] propose to optimize the fusion mode during training in circumstance of multi-concept generation. However, these methods necessitate additional training for all subjects, which can be time-consuming in multi-subject generation scenarios. Recently, some methods strive to enable subject-driven image generation without additional training [12, 27, 9, 11, 10, 8]. Most of them explore extended-attention mechanisms for maintaining identity consistency. IP-Adapter [9] and InstantID [10] introduce visual control by separating cross-attention layers for text features and image features. ConsiStory [8] enables training-free subject-level consistency across novel images via cross-frame attention. However, they fail to preserve detailed information according to the inadequate image feature extraction.

3 Methods

3.1 Preliminaries

Latent diffusion model. Latent diffusion models (LDMs) [28] consist of three components: a text encoder, a variational autoencoder (VAE) [25], and a U-Net. The U-Net aims to predict noise during the diffusion process, as follows:

\displaystyle L_{LDM}:=\mathbb{E}_{\varepsilon(x),\epsilon\sim\mathcal{N}(0,1)% ,t}[\parallel\epsilon-\epsilon_{\theta}(z_{t},t)\parallel]_{2}^{2}]

(1)

where $z_{t}$ is obtained from the encoder $\mathcal{E}$ . Cross-attention in LDMs. Cross-attention mechanisms are used to augment the underlying U-Net backbone, thereby transforming DMs into more flexible conditional image generators, which are effective for learning attention-based models for a variety of input modalities [29]. In the original Stable Diffusion, text prompts are encoded by CLIP [30] to obtain text features, which are fed as conditions into the U-Net backbone of the diffusion model. Specifically, the output after the cross-attention mechanism in the Stable Diffusion are noted as:

\begin{split}{z}^{\prime}=\text{Attention}({Q},{K},{V})=\text{Softmax}(\frac{{% Q}{K}^{\top}}{\sqrt{d}}){V},\\ \end{split}

(2)

where $Q=zW_{q}$ , $K=c_{t}W_{k}$ , $V=c_{t}W_{v}$ , $z$ denotes latent vectors obtained from the encoder $\mathcal{E}$ , $c_{t}$ denotes text features obtained from text encoder (CLIP in Stable Diffusion).

Image Prompt Adapter. IP-Adapter proposes a decoupled cross-attention strategy to support conditional image generation by introducing an image cross-attention mechanism [9] analogous to the original cross-attention module in Stable Diffusion [28]. This mechanism seamlessly integrates image prompts with text prompts to guide the text-to-image generation process. Owing to the decoupled nature of the image and text cross-attention mechanisms, the proposed approach can be readily integrated into existing Stable Diffusion models without additional training. The decoupled cross-attention can be formulated as:

{Z}^{new}=\text{Attention}({Q},{K},{V})+\lambda\cdot\text{Attention}({Q},{K}^{% \prime},{V}^{\prime}),

(3)

where $K^{\prime}=c_{i}{W}^{\prime}_{k}$ , $V^{\prime}=c_{i}{W}^{\prime}_{v}$ , $c_{i}$ represents the image features extracted from the CLIP image encoder, and $Z^{new}$ denotes the new latent representation obtained by conditioning on both image and text inputs. Notably, ${W}^{\prime}_{k}$ and ${W}^{\prime}_{v}$ are the only trainable parameters in IP-Adapter.

3.2 Framework of Character-Adapter

In this section, we introduce Character-Adapter, a novel framework designed for high-fidelity consistent character generation. As illustrated in Fig. 2, we first segment a reference character into several parts using prompt-guided segmentation. Subsequently, We obtain the attention maps of the target image layout using the same approach. Finally, we utilize dynamic region-level adapters to achieve detailed and consistent character generation.

3.2.1 Prompt-guided segmentation

We hypothesize that directly passing the entire reference image into the image encoder leads to inadequate extraction of detailed image features pertaining to the given character [9], as evidenced by the examples shown in Fig. 4. To address this problem, we propose a prompt-guided segmentation module that decomposes the given character into separate regions, namely the face, upper body, and lower body. The ablation study presented in Sec. 4.4 proves our assumption and substantiates the efficacy of the proposed prompt-guided segmentation in facilitating comprehensive image feature extraction.

When given a user-input prompt $P$ , for instance, “a boy standing in a library, wearing green jacket and blue pants”, we complete the prompt with detailed region descriptions:

P_{C}=[P_{G};P_{F};P_{U};P_{L}]=W:=[w_{1};...;w_{g};w_{g+1}...w_{f}...w_{l}],% \quad l>u>f>g,

(4)

where $P_{G}$ denotes the user-given prompt, $P_{F}$ indicates the prompt for the face region (“a boy”), $P_{U}$ specifies the prompt for the upper body (“green jacket”), and $P_{L}$ relates to the prompt for the lower body (“blue pants”). $W:$ represents the sequence of words in the constructed prompt $P_{C}$ .

Subsequently, We use $P_{C}$ to generate a layout image with Stable Diffusion, from which we derive the attention map corresponding to each word and the latent representation of U-Net. To be specific, let us consider a 2D layout image latent, denoted as $z_{t}$ , at timestep $t$ . The attention map for the $i$ -th word $w_{i}$ , in relation to the image latent at the $k$ -th U-Net block, can be represented as follows:

A_{i,t}^{(k)}=F(w_{i},z_{t}^{(k)})=softmax((W_{q}^{(k)}\cdot z_{t}^{(k)})(W_{k% }^{(k)}\cdot TE(w_{i}))),

(5)

where $A$ denotes the attention map, $W_{q}$ and $W_{k}$ are projection matrices in the attention module, and $TE$ corresponds to a language model that encodes text prompts into text embeddings.

To obtain precise attention maps and masks, we integrate the attention maps of all $K$ layers to a uniform dimension through interpolation. Subsequently, we calculate the similarity $S_{i,t}$ as follows:

S_{i,t}=\sum_{k=1}^{K}Interplot(A_{i,t}^{(k)},w,h),

(6)

where $w$ and $h$ represent the width and height of the noised image’s latent, respectively.

Since each semantic region prompt comprises multiple words, such as “green jacket”, we aggregate each attention map of these words, denoted by $S_{i,t}$ , in order to establish the correlation between the region prompt and the corresponding semantic image region:

S_{r,t}=\mathop{\max}\limits_{r_{begin}<i<r_{end}}\sum_{k\in[1,K]}Interplot(A_% {i,t}^{(k)},w,h),

(7)

where $r_{begin}$ and $r_{end}$ represent the word indices corresponding to the beginning and end of the region prompt, respectively.

Similarly, given a reference character, we follow Eq. 5 and utilize attention maps to extract specific adapter condition regions. First, we add noise to the latent representation of the reference image $z_{0}^{(R)}$ in $t$ steps as the following:

z_{t}^{(R)}=\sqrt{\overline{\alpha}_{t}}z_{0}^{(R)}+\sqrt{1-\overline{\alpha}t% }\epsilon_{t}.

(8)

Following Eq. 5, the corresponding attention map $S_{r,t}^{R}$ between the reference image $R$ and the prompt $P_{C}$ is obtained. We then use attention maps between the prompt and reference image to achieve a more precise and fine-grained segmentation of the reference image:

y_{begin}^{r}=\mathop{\arg\min}\limits_{y}(S_{r}^{R}(x,y)>\gamma_{2}),\quad x_% {begin}^{r}=0,

(9)

y_{end}^{r}=\mathop{\arg\max}\limits_{y}(S_{r}^{R}(x,y)>\gamma_{2}),\quad x_{% end}^{r}=width,

(10)

we ultimately employ $CR_{r}(x,y)$ , defined within the ranges $x\in[x_{begin}^{r},x_{end}^{r}]$ and $y\in[y_{begin}^{r},y_{end}^{r}]$ , as the semantic region of the reference image, segmented through prompt guidance.

3.2.2 Dynamic region-level adapters

Another limitation of existing consistent character generation methods is concept fusion, which we posit stems from the text encoder’s inability to adequately encapsulate compositional concepts. Employing full tokens for multi-concept generation may lead to concept confusion [15]. This diminishes the guidance of these features toward the corresponding regions of the latent vectors, as substantiated in Sec. 4.4. To mitigate this limitation, we propose a dynamic region-level adapters module to alleviate concept fusion during the inference process.

Region-level adapters. Initially, we employ mask-based multi-adapters to integrate the semantic guidance from different regions. The mask at resolution $(x,y)$ of each region can be obtained as following:

M_{r}(x,y)=Binary(S_{r}(x,y))=\left\{\begin{array}[]{l}1\quad\quad S_{r}(x,y)>% \gamma_{1},\\ 0\quad\quad S_{r}(x,y)<=\gamma_{1}.\\ \end{array}x\in[1,w],y\in[1,h]\right.,

(11)

where $S_{r}(x,y)$ is the attention map of specific region of the layout image. $\gamma_{1}$ is the threshold.

As these adapters represent different semantic regions of a reference image, for example, the face region is used to guide the character’s appearance and hairstyle. A natural approach is to fuse the different semantics using mask guidance, the masked attention map now reads:

A_{t}(x,y)=\sum_{r}(A_{r,t}^{(CR)(x,y)}\cdot M_{r}(x,y))+\prod\limits_{r}(1-M_% {r}(x,y))\cdot A_{P,t}(x,y),

(12)

where $A_{r,t}^{(CR)}$ refers to the attention map between the reference image region $CR_{r}$ and the prompt $P_{C}$ using Eq. 7, $M_{r}(x,y)$ refers to the masks obtained from the layout image as shown in Eq. 11, $A_{P,t}(x,y)$ is the attention map derived solely from the prompt condition.

Attention dynamic fusion. Mask-based regional-level adapters face the challenge of obtaining region masks that facilitate generation adhering to the layout while remaining faithful to the reference image. We observe that an inference time of $T=650$ yields masks whose shapes strike a balance between the reference and layout masks (Fig. 3). However, the coarse nature of these hard-label masks adversely impacts the generation of well-defined boundaries, as evidenced in Sec. 4.4. To address this, we propose an attention dynamic fusion module, characterized as a soft-label mask, to integrate the region-level adapters and produce coherent image generation through $A_{t}(x,y)$ , defined as follows:

A_{t}(x,y)=\sum_{r}(A_{r,t}^{(CR)(x,y)}\cdot softmax(S_{r}(x,y))+softmax(1-% \sum_{r}S_{r}(x,y))\cdot A_{P,t}(x,y).

(13)

3.2.3 Multi-character consistency

In addition to achieving remarkable performance in terms of single-character consistency, Character-Adapter also demonstrates remarkable performance for multi-character generation. Similar to single-character generation, we utilize prompt-guided segmentation to automatically localize multiple reference characters. The prompt includes a global prompt and regional prompts for $N$ character is noted as:

P_{C}=[P_{G};P_{F}{}_{1};P_{U}{}_{1};P_{L}{}_{1};...P_{F}{}_{N};P_{U}{}_{N};P_% {L}{}_{N}].

(14)

The process of multi-character consistency generation is described in Algorithm 1.

Algorithm 1 Character-Adapter for single and multi-characters consistency generation

0: Prompt

P_{C}

N

reference character images

R_{i},i\in[1,N]

, a diffusion network

D_{\theta}

1: use prompt

P_{C}

to obtain attention map

A_{i,t}^{(k)}

via Eq. 5 and correlation map

S_{r,t}

via Eq. 7

2: for i in [1, N] do

3: add noise and perform diffusion process to get correlation map

S_{r,t}^{R^{(i)}}

via Eq. 5 Eq. 7

4: get region adapter inputs

CR_{r}^{(i)}

via Eq. 9 and Eq. 10

5: end for

6: fuse the adapters to get attention map

A_{t}

via Eq. 13

6: target image

T_{I}=D_{\theta}(A_{t})

4 Experiments

4.1 Experimental setup

Evaluation datasets. We collect a dataset comprising 50 distinct entities, which includes 25 real-world characters and 25 anime characters for single-character evaluation. For multi-character evaluation, we randomly combine two characters from real-world characters and anime characters, creating a total of 50 samples. All user-given texts are obtained from MSCOCO [31] datasets.

Implementation details. Character-Adapter is a plug-and-play framework and can be transplanted to any diffusion backbone. Hence, we conduct all experiments on SD v1.5 [28] with classifier-free guidance as 7 and 20-timestep Euler A sampling. The mask threshold $\gamma_{1}$ and $\gamma_{2}$ are all set as 0.8. More details are provided in Sec. A.1

4.2 Quantitative comparison

We conduct experiments on Character-Adapter and previous SOTA works, quantitative results are illustrated in Table 1. state-of-the-art performance in zero-shot character consistency generation, with an improvement of 24.8% on the CLIP-I and DINO scores compared to previous state-of-the-art methods in both single and multi character generation. In comparison with fine-tuning-based approaches, Character-Adapter can achieve comparable character consistency while offering over 70 times improvement in computational efficiency, as it requires no additional training. Moreover, Character-Adapter achieves the best performance in terms of text-image alignment with an improvement of 3.5%.

Table 1: Quantitative results (%) of Character-Adapter with other methods. Evaluations are conducted for both single and multiple character generation. The best results are highlighted in bold.

Methods	Models	Time(s)	CLIP-T(%) $\uparrow$		CLIP-I(%) $\uparrow$		DINO-I(%) $\uparrow$
Methods	Models	Time(s)	Single	Multi	Single	Multi	Single	Multi
Finetune-based	Textual Inversion [26]	220	26.2	24.0	82.8	82.1	52.3	42.8
	LoRA [3]	1050	27.3	25.4	83.5	83.2	61.2	62.4
	Custom Diffusion [5]	510	28.6	24.8	83.9	83.6	67.8	67.2
	Mix-of-Show [7]	1200	29.9	27.4	84.2	84.1	68.2	65.9
	OMG [6]	1200	29.7	28.6	84.6	83.6	67.1	67.8
Training-free	Reference only[32]	7.6	26.8	22.8	78.4	73.6	42.1	40.8
	IP-Adapter [9]	5.2	29.8	26.2	83.6	82.4	59.8	58.0
	InstantID [10]	5.8	27.4	22.1	75.8	74.2	46.7	45.9
	FastComposer [12]	7.8	26.4	21.1	72.4	67.2	42.7	41.9
	T2I-Adapter [33]	7.6	24.2	21.8	63.8	62.3	43.4	40.9
	BLIP-Diffusion [13]	4.5	29.7	27.4	84.0	82.7	60.2	53.2
	Character-Adapter (Ours)	7.2	30.4	30.2	84.8	84.6	68.1	67.8

4.3 Qualitative comparison

As shown in the multi-character generation example (Fig. 4 6th row), methods like T2I-Adapter, InstantID, Reference-only, and FastComposer fail to preserve the reference characters’ attire. IP-Adapter encounters concept fusion and detail omission issues, attributable to using an image-level adapter, leading to inadequate feature extraction and representation. While fine-tuning-based models like LoRA align with text prompts, they lack details of reference characters and cause concept fusion in multi-character generation. In contrast, Character-Adapter generates high-fidelity multi-character images with intricate details while preserving text-image alignment. Overall, the qualitative results substantiate Character-Adapter’s significant improvements in high-fidelity character generation. Additional results are provided in Sec. A.5.

4.4 Ablation study

Table 2: Quantitative ablation result (%) of Character-Adapter. Each component is individually removed to evaluate its necessity and contribution to the overall performance.

Methods	Adapter level	Guidance	Fusion	CLIP-T $\uparrow$	CLIP-I $\uparrow$	DINO-I $\uparrow$
w/o {Prompt-guided segmentation}	Region	✗	Dynamic	26.9	77.9	47.2
w/o {Region-level adapter}	Image	✗	Dynamic	28.4	79.8	53.9
w/o {Attention dynamic fusion}	Region	✓	Mask	27.4	78.3	62.2
Character-Adapter (full)	Region	✓	Dynamic	30.8	86.9	68.3

Prompt-guided segmentation. We conduct an experiment where the character is equally segmented into three parts, and compare the results to verify the importance of prompt-guided segmentation. Fig. 5 (a) exhibits a significant imbalance in the proportions of the characters, in contrast to Fig. 5 (d). The results indicate that prompt-guided segmentation facilitates comprehensive image feature extraction and provides fine-grained guidance for character generation.

Region-level adapters. The experimental results in Table 2 show a significant drop in the CLIP-I score when using an image-level adapter. A comparison between Fig. 5 (b) and Fig. 5 (d) demonstrates that the proposed region-level adapters effectively mitigate content fusion issues, thereby improving character consistency.

Attention dynamic fusion. Fig.5 (c) showcases the result without using attention dynamic fusion. The visualization in Fig. 5 (b) reveals a dramatic change in the character’s hairstyle compared to Fig.5 (d). This is because the solid mask only preserves details of certain regions of the reference character and is influenced by the layout image. The result in Table 2 also corroborates the effectiveness of the proposed attention dynamic fusion module.

4.5 User study

We conducted a user study with 20 experts to evaluate Character-Adapter against previous methods. Each expert assessed 50 pairwise comparisons. As shown in Table 3, the results indicate that Character-Adapter achieves a win rate significantly exceeding 50% for both text alignment and character consistency.

Table 3: User study results(%) illustrating the comparison between Character-Adapter and SOTA methods, revealing a notable preference among participants for Character-Adapter in terms of both textual alignment and character consistency.

Methods	Models (Character-Adapter vs. *)	Text Alignments		Character Consistency
Methods	Models (Character-Adapter vs. *)	Win(%)	Lose(%)	Win(%)	Lose(%)
Finetuning	LoRA	92.4	7.6	90.6	9.4
Plug-and-play	IP-Adapter	87.5	12.5	70.8	29.2
	T2I-Adapter	90.2	9.8	91.7	8.3
	InstantID	94.2	5.8	100	0.0
	Reference-only	100	0.0	99.2	0.8
	BLIP-Diffusion	100	0.0	100	0.0
	FastComposer	98.2	1.8	100	0.0

4.6 Extended application

Additional controls. Fig. 6 demonstrates that Character-Adapter can be successfully integrated with ControlNet, enabling pose conditioning to guide the consistent character generation.

Inpainting-based personalization. Our method can also be combined with inpainting to seamlessly insert consistent characters into any fixed scene, which holds great potential for generating comics and graphic novels. Fig. 6 shows that integrated with inpainting, our proposed Character-Adapter can replace a selected region with specific characters, while maintaining a fixed background naturally.

Consistent generation of non-human characters. As Character-Adapter supports character consistency, it has the potential to extend the ability to animals and objects. In Fig. 6, we have demonstrated that Character-Adapter can achieve excellent results in the task of animal consistency.

5 Conclusion

In this work, we investigate the limitations of existing consistent character generation methods and propose Character-Adapter, which enables the synthesis of high-fidelity images that preserve the intricate details and identities of reference characters. The key innovations in Character-Adapter are the prompt-guided segmentation module, which enables fine-grained extraction of regional features from reference characters, and the dynamic region-level adapters module, which mitigates concept fusion issues. Notably, Character-Adapter can be seamlessly integrated into any backbone model and compatible with other editing tools for both single and multiple character generation.

References

[1] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation, 2023.
[2] Miao Hua, Jiawei Liu, Fei Ding, Wei Liu, Jie Wu, and Qian He. Dreamtuner: Single image is enough for subject-driven generation. arXiv preprint arXiv:2312.13691, 2023.
[3] Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021.
[4] Yuxiang Wei, Yabo Zhang, Zhilong Ji, **feng Bai, Lei Zhang, and Wangmeng Zuo. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15943–15953, 2023.
[5] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
[6] Zhe Kong, Yong Zhang, Tianyu Yang, Tao Wang, Kaihao Zhang, Bizhu Wu, Guanying Chen, Wei Liu, and Wenhan Luo. Omg: Occlusion-friendly personalized multi-concept generation in diffusion models. arXiv preprint arXiv:2403.10983, 2024.
[7] Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, et al. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
[8] Yoad Tewel, Omri Kaduri, Rinon Gal, Yoni Kasten, Lior Wolf, Gal Chechik, and Yuval Atzmon. Training-free consistent text-to-image generation. arXiv preprint arXiv:2402.03286, 2024.
[9] Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
[10] Qixun Wang, Xu Bai, Haofan Wang, Zekui Qin, and Anthony Chen. Instantid: Zero-shot identity-preserving generation in seconds. arXiv preprint arXiv:2401.07519, 2024.
[11] Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, and Ying Shan. Photomaker: Customizing realistic human photos via stacked id embedding. arXiv preprint arXiv:2312.04461, 2023.
[12] Guangxuan Xiao, Tianwei Yin, William T Freeman, Frédo Durand, and Song Han. Fastcomposer: Tuning-free multi-subject image generation with localized attention. arXiv preprint arXiv:2305.10431, 2023.
[13] Dongxu Li, Junnan Li, and Steven Hoi. Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. Advances in Neural Information Processing Systems, 36, 2024.
[14] Ziheng Wu, Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Xing Shi, and Jun Huang. Easyphoto: Your smart ai photo generator, 2023.
[15] Omer Dahary, Or Patashnik, Kfir Aberman, and Daniel Cohen-Or. Be yourself: Bounded attention for multi-subject text-to-image generation. arXiv preprint arXiv:2403.16990, 2024.
[16] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023.
[17] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
[18] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
[19] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
[20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
[21] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
[22] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Qinsheng Zhang, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, et al. ediff-i: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
[23] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In International conference on machine learning, pages 8821–8831. Pmlr, 2021.
[24] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
[25] Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
[26] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion, 2022.
[27] Yuxuan Zhang, Jiaming Liu, Yiren Song, Rui Wang, Hao Tang, **peng Yu, Huaxia Li, Xu Tang, Yao Hu, Han Pan, et al. Ssr-encoder: Encoding selective subject representation for subject-driven generation. arXiv preprint arXiv:2312.16272, 2023.
[28] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 2017.
[30] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
[31] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
[32] Mikubill. reference-only. sd-webui-controlnet (2023), https://github.com/Mikubill/sd-webui-controlnet, gitHub repository, 2014.
[33] Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models, 2023.
[34] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.

Appendix A Appendix

We have provided supplementary details regarding our Character-Adapter in this section. Implementation details are illustrated in Section A.1. Limitations and future work are discussed in Sections A.2 and A.3, respectively. Potential societal impacts of our work are examined in Section A.4.

A.1 Implementation details

A.1.1 Inference setup

We employ Realistic Vision V4¹¹1https://huggingface.co/SG161222/Realistic_Vision_V4.0_noVAE to generate photo-realistic human portraits and animals, and using Animesh²²2https://huggingface.co/redstonehero/animesh_prunedv21 for anime character generation. All comparison models utilize 20-step Euler A sampling, and the classifier-free guidance is set to be 7.0. The corresponding resolution of inference image is set to 768 x 768. We implement experiments using an A30 GPU.

A.1.2 Evaluation metrics

We first employ CLIP ViT-L/14³³3https://huggingface.co/openai/clip-vit-large-patch14 to evaluate the similarity between the generated images and the given text prompts (CLIP-T). Subsequently, we utilize the image encoder of the CLIP model to evaluate the correlation between the generated consistent images and the reference images (CLIP-I). Additionally, we further employ the DINO score [34] to evaluate image alignment, as DINO is better suited for subject representation (DINO-I). Finally, we conduct human evaluations to further evaluate the performance of different approaches in terms of text and image alignment.

A.2 Limitations and discussion

While our method provides a plug-and-play framework for generating consistent and detailed characters with high robustness, several limitations warrant consideration. Firstly, in scenarios involving extremely complex clothing patterns, our model may not fully preserve the original details. Secondly, due to the inherent limitations in the semantic capabilities of Stable Diffusion models, there exists a possibility of inaccurate target localization in the attention maps, leading to misalignment of image details. We leave the exploration of these limitations as future work.

A.3 Future work

For future work, an intriguing direction would be to investigate methods for obtaining more accurate attention maps for different image regions based on prompts, thereby mitigating semantic confusion. This could be achieved by enhancing the semantic understanding of the diffusion model. Furthermore, as Character-Adapter supports identity preservation for both facial and attire features, it holds promise for applications in narrative storytelling and video generation.

A.4 Societal impacts

The proposed method aims to provide an effective and flexible tool for high-fidelity customized character image generation. However, there exists a potential risk of misuse, wherein individuals may generate fake celebrity images, thereby misleading the public. This concern is not unique to our approach but rather a common consideration among all subject-driven image generation methods. One potential solution is to employ a safety checker akin to NSFW filter,⁴⁴4https://huggingface.co/runwayml/stable-diffusion-v1-5 which is a classification module that estimates whether generated images could be considered offensive or harmful, thereby preventing the generation of controversial content and the abuse of celebrity images. Such measures would mitigate the potential misuse of our method while preserving its intended functionality.

A.5 Additional results

In Fig. 7, we present additional results showcasing the capabilities of Character-Adapter. The proposed framework demonstrates high-fidelity generation of consistent characters and objects across diverse scene contexts. Furthermore, we illustrate that Character-Adapter excels in scenarios involving multiple characters, maintaining high-fidelity of each character throughout the generated images. These results further validate the efficacy and versatility of our proposed approach.