HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: epic

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: arXiv.org perpetual non-exclusive license
arXiv:2312.15905v1 [cs.CV] 26 Dec 2023

Cross Initialization for Personalized Text-to-Image Generation

Lianyu Pang11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT, Jian Yin1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT, Haoran ** Wang44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPT, Qing Li55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPT, Xudong Mao1,212{}^{1,2}start_FLOATSUPERSCRIPT 1 , 2 end_FLOATSUPERSCRIPT11footnotemark: 1  
11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTSun Yat-sen University  22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTGuangdong Key Laboratory of Big Data Analysis and Processing
33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTLinnan University  44{}^{4}start_FLOATSUPERSCRIPT 4 end_FLOATSUPERSCRIPTEast China Normal University  55{}^{5}start_FLOATSUPERSCRIPT 5 end_FLOATSUPERSCRIPTThe Hong Kong Polytechnic University
Abstract

Recently, there has been a surge in face personalization techniques, benefiting from the advanced capabilities of pretrained text-to-image diffusion models. Among these, a notable method is Textual Inversion, which generates personalized images by inverting given images into textual embeddings. However, methods based on Textual Inversion still struggle with balancing the trade-off between reconstruction quality and editability. In this study, we examine this issue through the lens of initialization. Upon closely examining traditional initialization methods, we identified a significant disparity between the initial and learned embeddings in terms of both scale and orientation. The scale of the learned embedding can be up to 100 times greater than that of the initial embedding. Such a significant change in the embedding could increase the risk of overfitting, thereby compromising the editability. Driven by this observation, we introduce a novel initialization method, termed Cross Initialization, that significantly narrows the gap between the initial and learned embeddings. This method not only improves both reconstruction and editability but also reduces the optimization steps from 5,000 to 320. Furthermore, we apply a regularization term to keep the learned embedding close to the initial embedding. We show that when combined with Cross Initialization, this regularization term can effectively improve editability. We provide comprehensive empirical evidence to demonstrate the superior performance of our method compared to the baseline methods. Notably, in our experiments, Cross Initialization is the only method that successfully edits an individual’s facial expression. Additionally, a fast version of our method allows for capturing an input image in roughly 26 seconds, while surpassing the baseline methods in terms of both reconstruction and editability. Code will be available at https://github.com/lyuPang/CrossInitialization.

[Uncaptioned image]
Figure 1: Personalization results of our method using a single input image. Our method enables a variety of novel personalized face generations with high visual fidelity, such as facial expression editing, interaction with other individuals, and stylization. Moreover, it significantly speeds up the personalization process by reducing the optimization steps from 5,000 to 320.
*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPTCorresponding author ([email protected]).

1 Introduction

Recent advancements in large-scale diffusion models [48, 51, 43] have significantly advanced the field of text-to-image generation, paving the way for a variety of generative tasks [17, 23, 7]. Text-to-image personalization [17], when provided with several images of a target concept, enables users to produce personalized images in novel contexts or styles. This personalization is achieved either by inverting the target concept into the textual embedding space [17, 64, 2] or by fine-tuning the pretrained diffusion model [49, 29]. Among these, Textual Inversion [17] is one notable method that learns the target concept by inverting given images into textual embeddings.

Refer to caption
Refer to caption
Figure 2: Scale (left) and orientation (right) of the textual embedding v*subscript𝑣v_{*}italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT, as initialized by the traditional method. The term E(v*)𝐸subscript𝑣E(v_{*})italic_E ( italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ) represents the output vector of the text encoder, and vinitsubscript𝑣initv_{\text{init}}italic_v start_POSTSUBSCRIPT init end_POSTSUBSCRIPT represents the initial state of the embedding. After optimization, both the scale and orientation of v*subscript𝑣v_{*}italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT undergo substantial alterations, aligning more closely with E(v*)𝐸subscript𝑣E(v_{*})italic_E ( italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ).

Face personalization [69, 18, 68] focuses on the personalized generation of a particular individual. An effective face personalization model should be able to synthesize the individual in novel scenes or styles based on text prompts while preserving the individual’s unique identity. However, many existing methods are prone to overfitting and often struggle to generate images that align with the prompt while accurately capturing the individual’s identity.

In this work, we investigate the overfitting problem in Textual Inversion [17] through the lens of initialization. Traditional methods typically initialize the textual embedding with a super-category token (e.g., “face” or “person”) [17, 64, 2]. However, after optimization, this approach often leads to significant deviations from the initial embedding in both scale and orientation, as depicted in Fig. 2. Such drastic changes may increase the risk of overfitting and compromise the editability of the embedding.

To address this issue, our approach aims to minimize the disparity between the initial and learned embeddings. Our method is inspired by two main observations. Firstly, after optimization, the learned embedding tends to align with the output of the CLIP [40] text encoder in terms of both scale and orientation, as illustrated in Fig. 2. Secondly, using the text encoder’s output as its input typically produces an image nearly identical to the original, as shown in Fig. 3. Drawing from these insights, we introduce Cross Initialization, a method where the textual embedding is initialized with the text encoder’s output, as depicted in Fig. 4. This approach effectively narrows the gap between the initial and learned embeddings, facilitating more effective optimizations compared to traditional methods. Our results demonstrate that Cross Initialization not only enhances reconstruction quality and editability but also significantly speeds up the personalization process.

To further improve editability, we incorporate a regularization term designed to keep the learned embedding close to its initial state throughout the optimization process. In Textual Inversion, the effectiveness of this regularization is often limited due to the substantial disparity between the initial and learned embeddings. In contrast, when used in conjunction with Cross Initialization, this regularization strategy becomes significantly more effective. This improvement is primarily attributed to the reduced gap between the initial and learned embeddings facilitated by Cross Initialization.

We demonstrate the superior performance of Cross Initialization compared to the baseline methods through both qualitative and quantitative evaluations. Our method enables a variety of novel personalized face generations with high visual fidelity. Notably, in our experiments, Cross Initialization is the only method capable of editing an individual’s facial expression. Furthermore, a fast version of our method allows for capturing an input image in roughly 26 seconds, while surpassing the baseline methods in terms of both reconstruction and editability.

Conditioning

Apple

House

Giraffe

Face

c(v)=E(v)𝑐𝑣𝐸𝑣c(v)=E(v)italic_c ( italic_v ) = italic_E ( italic_v )

Refer to caption
Refer to caption
Refer to caption
Refer to caption

c(v)=E(E(v))𝑐𝑣𝐸𝐸𝑣c(v)=E(E(v))italic_c ( italic_v ) = italic_E ( italic_E ( italic_v ) )

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Top row: Images generated using standard textual embeddings as input for the text encoder, for instance, vapplesubscript𝑣applev_{\text{apple}}italic_v start_POSTSUBSCRIPT apple end_POSTSUBSCRIPT. Bottom row: Images generated using the output of the text encoder as its input, for instance, E(vapple)𝐸subscript𝑣appleE(v_{\text{apple}})italic_E ( italic_v start_POSTSUBSCRIPT apple end_POSTSUBSCRIPT ). Here, c(v)𝑐𝑣c(v)italic_c ( italic_v ) denotes the conditioning vector in diffusion models. The images produced by v𝑣vitalic_v and E(v)𝐸𝑣E(v)italic_E ( italic_v ) are remarkably similar.

2 Related Works

Text-to-Image Synthesis.

Text-to-image synthesis is the task of generating realistic and diverse images from natural language descriptions. Various deep generative models have been widely explored for this task, such as GANs [44, 52], VAEs [42, 15], and Autoregressive Models [67, 43]. Recently, diffusion models [48, 57, 24] have demonstrated remarkable capabilities in generating high-fidelity images aligned with textual prompts [43, 36, 48, 51, 7].

Refer to caption
Figure 4: Comparison of Textual Inversion Initialization and Cross Initialization techniques. Textual Inversion [17] (left) initializes the textual embedding v*subscript𝑣v_{*}italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a super-category token (e.g., “face”). Cross Initialization (right) begins by obtaining the output vector from the text encoder E(v*)𝐸subscript𝑣E(v_{*})italic_E ( italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ), which is subsequently used to initialize the embedding. This approach reduces the disparity between the initial and learned embeddings.

Inversion.

Image inversion involves reconstructing an image by map** it into the latent space of a pretrained generator. This process can be accomplished either through direct optimization of the latent code [1, 19, 71] or by employing an encoder network to map the image into a latent space [39, 45, 6, 37, 59, 65, 70]. Image inversion has been applied to various image manipulation tasks [53, 38, 19]. In the context of diffusion models, image inversion aims to identify an initial noise latent code that can be denoised back to the input image [43, 14, 35]. This inverted noise latent code is then leveraged for text-guided image manipulation, as explored in recent studies [23, 12, 28, 30, 60].

Personalization.

Personalization adapts pretrained generative models to capture new concepts depicted in several given images. In the realm of text-to-image diffusion models, this allows for the creation of personalized images guided by text prompts. Techniques for this task include optimizing textual embeddings to learn new concepts [11, 17, 64, 2, 16, 62], fine-tuning diffusion models for concept acquisition [22, 50, 10, 21, 11, 58, 49, 29, 56, 4], and training encoders for map** new concepts to textual representations [3, 18, 55, 69, 33, 9, 26]. These methods facilitate applications like image editing [28, 61] and personalized 3D generation [31, 34, 41, 46]. Particularly, some studies [69, 68, 18, 66, 8, 20, 25] focus on the personalized generation of individual human images. However, existing methods often face the overfitting problem, hindering the creation of text-aligned personalized images. Our work addresses this challenge by examining the overfitting problem through the lens of initialization. Our approach enables more efficient learning of new concepts, leading to faster personalized face generation with improved identity preservation and enhanced editability.

3 Preliminaries

Latent Diffusion Models.

We implement our method on the publicly available Stable Diffusion (SD) model, a Latent Diffusion Model (LDM) [48] for text-to-image synthesis. This model is composed of an encoder, \mathcal{E}caligraphic_E, which maps an image x𝑥xitalic_x to a latent code z=(x)𝑧𝑥z=\mathcal{E}(x)italic_z = caligraphic_E ( italic_x ), and a decoder, 𝒟𝒟\mathcal{D}caligraphic_D, which reconstructs the image from this code 𝒟((x))x𝒟𝑥𝑥\mathcal{D}(\mathcal{E}(x))\approx xcaligraphic_D ( caligraphic_E ( italic_x ) ) ≈ italic_x. A Denoising Diffusion Probabilistic Model (DDPM) [24] is trained to generate latent codes within the latent space of a pretrained autoencoder. For text-to-image generation, the model is conditioned on a vector c(y)𝑐𝑦c(y)italic_c ( italic_y ) derived from a text prompt y𝑦yitalic_y. The training objective of LDM is defined by:

diffusion=𝔼z(x),y,ε𝒩(0,1),t[εεθ(zt,t,c(y))22].subscriptdiffusionsubscript𝔼formulae-sequencesimilar-to𝑧𝑥𝑦similar-to𝜀𝒩01𝑡delimited-[]superscriptsubscriptnorm𝜀subscript𝜀𝜃subscript𝑧𝑡𝑡𝑐𝑦22\mathcal{L}_{\text{diffusion}}=\mathbb{E}_{z\sim\mathcal{E}(x),y,\varepsilon% \sim\mathcal{N}(0,1),t}\left[\left\|\varepsilon-\varepsilon_{\theta}\left(z_{t% },t,c(y)\right)\right\|_{2}^{2}\right].caligraphic_L start_POSTSUBSCRIPT diffusion end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_E ( italic_x ) , italic_y , italic_ε ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ε - italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ( italic_y ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (1)

Given the timestep t𝑡titalic_t, the noised latent ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and the conditioning vector c(y)𝑐𝑦c(y)italic_c ( italic_y ), the denoising network εθsubscript𝜀𝜃\varepsilon_{\theta}italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT aims to remove the noise that was added to the original latent code z0subscript𝑧0z_{0}italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Text Embeddings.

Given a text prompt y𝑦yitalic_y, the sentence is first tokenized into several tokens. Each token is then mapped to a textual embedding visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT using a predefined embedding lookup. Subsequently, these textual embeddings are passed through a pretrained CLIP text encoder E𝐸Eitalic_E, which outputs a series of vectors that constitute the conditioning vector c(y)=[E(v1),,E(vn)]𝑐𝑦𝐸subscript𝑣1𝐸subscript𝑣𝑛c(y)=[E(v_{1}),\dots,E(v_{n})]italic_c ( italic_y ) = [ italic_E ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , italic_E ( italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ]. For a textual embedding vi1024subscript𝑣𝑖superscript1024v_{i}\in\mathbb{R}^{1024}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT, its corresponding output of the text encoder is denoted by E(vi)1024𝐸subscript𝑣𝑖superscript1024E(v_{i})\in\mathbb{R}^{1024}italic_E ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT 1024 end_POSTSUPERSCRIPT. Note that in the SD v2.1 model, the dimensionality of both visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and E(vi)𝐸subscript𝑣𝑖E(v_{i})italic_E ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is 1024102410241024.

Textual Inversion.

Textual Inversion [17] is a technique that captures novel concepts from a few example images. It is achieved by injecting new concepts into the pretrained diffusion models. Specifically, Textual Inversion introduces a new token S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and its corresponding textual embedding v*subscript𝑣v_{*}italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT, representing the new concept. To learn the new concept, Textual Inversion fixes the LDM and optimizes only v*subscript𝑣v_{*}italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT, minimizing the objective of LDM given in Eq. 1. The optimization objective is defined by:

v*=argminv𝔼z,y,ε,t[εεθ(zt,t,c(y,v))22],subscript𝑣subscript𝑣subscript𝔼𝑧𝑦𝜀𝑡delimited-[]superscriptsubscriptnorm𝜀subscript𝜀𝜃subscript𝑧𝑡𝑡𝑐𝑦𝑣22v_{*}=\arg\min_{v}\mathbb{E}_{z,y,\varepsilon,t}\left[\left\|\varepsilon-% \varepsilon_{\theta}\left(z_{t},t,c(y,v)\right)\right\|_{2}^{2}\right],italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_z , italic_y , italic_ε , italic_t end_POSTSUBSCRIPT [ ∥ italic_ε - italic_ε start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ( italic_y , italic_v ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (2)

where c(y,v)𝑐𝑦𝑣c(y,v)italic_c ( italic_y , italic_v ) is the conditioning vector obtained from the prompt y𝑦yitalic_y and the textual embedding v𝑣vitalic_v.

4 Method

Our method is based on the Textual Inversion technique, in which the textual embedding is typically initialized with a super-category token (e.g., “face”). In this section, we analyze how Textual Inversion suffers from a severe overfitting problem through the lens of initialization, as detailed in Sec. 4.1. To address this issue, we propose a novel initialization method, named Cross Initialization, as described in Sec. 4.2. This method facilitates more efficient optimizations, enhancing both reconstruction and editability. To further improve editability, we introduce a regularization term in Sec. 4.3.

Input
A sand sculpture of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT Funko Pop
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 5: Images generated by Textual Inversion. This method fails to place the given individual in new styles, primarily due to its tendency to overfit the input image.


4.1 Analysis

In Fig. 5, we show several examples generated by Textual Inversion. This method fails to place the person in new styles and generates images similar to the input image, indicating a severe overfitting problem. In this section, we delve into this overfitting problem in Textual Inversion from the perspective of initialization. Existing methods based on Textual Inversion typically initialize the textual embedding with a super-category token [17, 64, 2]. However, our experiments consistently show that, after optimization, the learned embedding becomes significantly different from its initial state, both in scale and orientation. Figs. 2 and 6 show several examples where the scale of the learned embedding can be up to 100 times greater than that of the initial embedding. Such drastic changes in the embedding may increase the risk of overfitting and degrade the editability of the embedding.

Given that the learned embedding significantly differs from the initial embedding of a coarse descriptor, a question arises: How does the learned embedding manage to produce images that accurately represent the given concept? To investigate this, we examine the outputs of the intermediate layers in the text encoder. The text encoder comprises several self-attention blocks [54], with a LayerNorm layer [5] preceding the input of each sub-block. We observe that the LayerNorm layer normalizes the scale of the embedding, while the self-attention layer modifies its orientation. LABEL:fig:embedding_in_encoder illustrates this process: each sub-block progressively alters the scale and orientation of the embedding, and ultimately the output vectors of the initial and learned embeddings exhibit a similarity in both scale and orientation.

To mitigate the overfitting issue in Textual Inversion, this analysis motivates us to seek an initial embedding that can be close to the learned embedding.

Refer to caption
Refer to caption
Figure 6: More examples illustrating that, after optimization, the textual embedding v*subscript𝑣v_{*}italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT experiences significant changes in both scale (left) and orientation (right). Here, vinitsubscript𝑣initv_{\text{init}}italic_v start_POSTSUBSCRIPT init end_POSTSUBSCRIPT denotes the embedding’s initial state, and vlearnedsubscript𝑣learnedv_{\text{learned}}italic_v start_POSTSUBSCRIPT learned end_POSTSUBSCRIPT denotes the embedding’s final state.

4.2 Cross Initialization

Based on the analysis in Sec. 4.1, our goal is to design an initial embedding that meets two criteria: 1) it is close to the learned embedding, and 2) it roughly captures the target concept. Our method is inspired by two key observations. First, as shown in Fig. 2, the learned embedding becomes similar to the output of the text encoder after optimization. Second, when we use the text encoder’s output as its input, the diffusion model produces an image nearly identical to the original, as shown in Fig. 3. The reason for these two phenomena is that the LayerNorm and self-attention layers in the text encoder gradually alter the scale and orientation of the embedding, making it converge to a specific vector, as discussed in Sec. 4.1. Based on these insights, we propose initializing the textual embedding with the output of the text encoder, a method we term Cross Initialization, as depicted in Fig. 4.

Formally, given a single face image, we first set the textual embedding to the mean of 691 well-known names’ embeddings, denoted as v¯691subscript¯𝑣691\bar{v}_{691}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT. The computation of v¯691subscript¯𝑣691\bar{v}_{691}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT is elaborated in the following subsection. Subsequently, we feed v¯691subscript¯𝑣691\bar{v}_{691}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT into the text encoder E𝐸Eitalic_E, obtaining the output vector E(v¯691)𝐸subscript¯𝑣691E(\bar{v}_{691})italic_E ( over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT ). We then initialize the textual embedding vinitsubscript𝑣initv_{\text{init}}italic_v start_POSTSUBSCRIPT init end_POSTSUBSCRIPT with this output vector:

vinit=E(v¯691).subscript𝑣init𝐸subscript¯𝑣691v_{\text{init}}=E(\bar{v}_{691}).italic_v start_POSTSUBSCRIPT init end_POSTSUBSCRIPT = italic_E ( over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT ) . (3)

Finally, we optimize the textual embedding by minimizing the LDM loss given in Eq. 2.

The aforementioned two observations ensure that the initial embedding E(v¯691)𝐸subscript¯𝑣691E(\bar{v}_{691})italic_E ( over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT ) is close to the learned embedding, while also roughly representing the target concept. As shown in Fig. 7, using Cross Initialization, the learned embedding retains proximity to its initial state throughout the optimization process. This facilitates more efficient optimizations, leading to more identity-preserved, prompt-aligned, and faster face personalization.

Mean Textual Embedding.

We follow [68] to construct the mean textual embedding v¯691subscript¯𝑣691\bar{v}_{691}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT. A total of 691 well-known names are used to form an embedding set C={v1,,vm}𝐶subscript𝑣1subscript𝑣𝑚C=\{v_{1},\dots,v_{m}\}italic_C = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }, where m=691𝑚691m=691italic_m = 691 and each textual embedding visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is obtained from the pre-defined embedding lookup. The mean textual embedding is calculated as v¯691=1mi=1mvisubscript¯𝑣6911𝑚subscriptsuperscript𝑚𝑖1subscript𝑣𝑖\bar{v}_{691}=\frac{1}{m}\sum^{m}_{i=1}v_{i}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Moreover, we represent each name with two tokens (i.e., the first and last names), resulting in the final mean textual embedding as v¯691=[v¯691f,v¯691l]subscript¯𝑣691superscriptsubscript¯𝑣691𝑓superscriptsubscript¯𝑣691𝑙\bar{v}_{691}=[\bar{v}_{691}^{f},\bar{v}_{691}^{l}]over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT = [ over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT , over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ], where v¯691fsuperscriptsubscript¯𝑣691𝑓\bar{v}_{691}^{f}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_f end_POSTSUPERSCRIPT and v¯691lsuperscriptsubscript¯𝑣691𝑙\bar{v}_{691}^{l}over¯ start_ARG italic_v end_ARG start_POSTSUBSCRIPT 691 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT are calculated using the embedding sets of the first and last names, respectively.

Comparison with Directly Optimizing E(v)𝐸𝑣E(v)italic_E ( italic_v ).

In Cross Initialization, we set the text encoder’s output as its input, i.e. vinit=E(v¯)subscript𝑣init𝐸¯𝑣v_{\text{init}}=E(\bar{v})italic_v start_POSTSUBSCRIPT init end_POSTSUBSCRIPT = italic_E ( over¯ start_ARG italic_v end_ARG ), and optimize the input vector vinitsubscript𝑣initv_{\text{init}}italic_v start_POSTSUBSCRIPT init end_POSTSUBSCRIPT. An alternative method is to directly optimize the output vector E(v¯)𝐸¯𝑣E(\bar{v})italic_E ( over¯ start_ARG italic_v end_ARG ). However, this approach eliminates the interaction between the new concept and other prompt tokens, as the new concept is not passed through the text encoder along with the other prompt tokens, leading to poor editability. This issue is also indicated in [2]. In contrast, Cross Initialization optimizes the input vector, thereby preserving the ability to create new compositions for the new concept.

4.3 Regularization

As illustrated in Sec. 4.2, the initial embedding is constructed using the mean center of embeddings from 691 well-known names. We assume that the region around this central embedding represents the subspace corresponding to the concept of the individual. High editability is expected when the learned embedding lies close to this subspace. Therefore, we introduce a regularization term to keep the learned embedding close to the central embedding throughout the optimization process. Specifically, we minimize the L2 distance between them, defined as:

reg=vvinit22.subscriptregsuperscriptsubscriptnorm𝑣subscript𝑣init22\mathcal{L}_{\text{reg}}=||v-v_{\text{init}}||_{2}^{2}.caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = | | italic_v - italic_v start_POSTSUBSCRIPT init end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (4)

Overall, our final optimization objective is defined as:

v*=argminvdiffusion+λreg.subscript𝑣subscript𝑣subscriptdiffusion𝜆subscriptregv_{*}=\arg\min_{v}\mathcal{L}_{\text{diffusion}}+\lambda\mathcal{L}_{\text{reg% }}.italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT diffusion end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT . (5)

Note that this regularization approach, also investigated in [17], faces challenges when applied in Textual Inversion. This is primarily due to the significant disparity between the initial and learned embeddings, as well as the coarseness of the super-category token. These factors limit the effectiveness of this regularization approach.

Refer to caption
Refer to caption
Figure 7: Scale (left) and orientation (right) of the textual embedding v*subscript𝑣v_{*}italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT, as initialized by Cross Initialization. Here, E(v*)𝐸subscript𝑣E(v_{*})italic_E ( italic_v start_POSTSUBSCRIPT * end_POSTSUBSCRIPT ) represents the output vector of the text encoder, and vinitsubscript𝑣initv_{\text{init}}italic_v start_POSTSUBSCRIPT init end_POSTSUBSCRIPT represents the initial state of the embedding. In contrast to the examples in Fig. 2, Cross Initialization maintains the learned embedding close to the initial state in terms of both scale and orientation.

Real Sample Textual Inversion DreamBooth NeTI Celeb Basis Ours
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a puzzled expression” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with an angry expression” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and Elon Musk are eating bread in front of the Eiffel Tower” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as Captain Marvel” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT latte art” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

Figure 8: Qualitative comparisons. Given a single input image, we present four images generated by each method using identical random seeds. Our approach demonstrates superior performance in identity preservation and editability. Notably, Cross Initialization is the only method that successfully edits an individual’s facial expression.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
A photo of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
ty** a paper
on a laptop
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT holding up
his accepted paper
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wearing
yellow jacket and
driving a motorbike
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT presenting a poster
at a conference
A photo of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
graduating after
finishing his PhD
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with
a sad expression
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with
a terrified expression
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT shakes hands
with Elon Musk
in news conference
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and Barack Obama
cooking together
in a kitchen
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and Anne Hathaway
enjoy a delicate
candlelight dinner
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
A sand sculpture
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Greek sculpture
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT Funko Pop
Manga drawing
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Pointillism painting
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Figure 9: Examples of personalized text-to-image generation obtained with Cross Initialization.

5 Experiments

In this section, we first present the implementation details of our method. Subsequently, we demonstrate its effectiveness by conducting a comparative analysis with four state-of-the-art personalization methods, focusing on aspects such as identity preservation, editability, and optimization time.

5.1 Implementation and Evaluation Setup

Implementation.

We utilize the publicly available Stable Diffusion v2.1 [48] as our base model. Images are generated at a resolution of 512×512512512512\times 512512 × 512. The hyper-parameter λ𝜆\lambdaitalic_λ is set to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT for all experiments. Given a single image as input, our experiments are conducted on a single A800 GPU, using a batch size of 8 and a learning rate of 0.005. All results are obtained using 320 optimization steps.

Evaluation Setup.

We evaluate each method using the images from CelebA-HQ test set [32, 27]. The prompts used are primarily sourced from [68] and [18]. We compare our method with four state-of-the-art personalization methods: Textual Inversion [17], DreamBooth [49], NeTI [2], and Celeb Basis [68]. The implementation details of baselines are presented in Appendix A. All methods are implemented for one-shot personalization. For quantitative evaluation, each method is evaluated on the first 200 images from CelebA-HQ test set using two metrics, including identity similarity and prompt similarity. For identity similarity, ArcFace [13], a pretrained face recognition model, is used to measure the identity preservation in generated images. Prompt similarity is measured by computing the CLIP score between generated images and text prompts. We exclude the prompts for stylization in the identity similarity assessment, as ArcFace is trained on real images.

5.2 Results

Qualitative Evaluation.

In Fig. 8, we present a visual comparison of personalized generation using four types of prompts: expression editing, background modification, individual interaction, and artistic style. Textual Inversion exhibits an overfitting problem, failing to compose the given individual in novel scenes. DreamBooth struggles to reconstruct the individual for complex editing prompts such as background modification and artistic style. It tends to disregard the new concept and generate images based solely on the remaining prompt tokens. In contrast, NeTI generates images based solely on the new concept without incorporating the other prompt tokens, indicating a severe overfitting problem. Both Celeb Basis and our method are capable of generating novel compositions of personalized concepts. Compared to Celeb Basis, our method shows superior identity preservation and excels in editing the individual’s expression. For all prompts, Cross Initialization achieves high-fidelity reconstruction of the individual’s identity while providing superior editability. Notably, it is the only method that successfully edits an individual’s facial expression. Fig. 9 shows more results with different prompts from our method. Additional qualitative results can be found in Appendices D and E. We also provide results on synthetic facial images in Appendix F.

Quantitative Evaluation.

We quantitatively evaluate our approach in two aspects: 1) identity similarity between the generated and input images, and 2) prompt similarity between the generated image and the given text prompt. All methods are evaluated over 20 text prompts, see Appendix B for a full list. These prompts cover expression editing (e.g., “S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a sad expression”), background modification (e.g., “S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT on the beach”), individual interaction (e.g., “S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT shakes hands with Anne Hathaway in news conference”), and artistic style (e.g., “S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT latte art”). For each prompt, we generate 32 images using the same random seed for all methods.

The results are shown in Tab. 1. DreamBooth excels in prompt similarity but ranks lowest in identity similarity. This is consistent with the qualitative observations, where DreamBooth often overlooks the new concept, focusing solely on the other prompt tokens. In contrast, NeTI achieves the highest identity similarity scores but ranks lowest in prompt similarity, as NeTI tends to overfit the input image. Besides these two extreme cases, our method demonstrates superior performance in both identity and prompt similarity metrics.

Table 1: Quantitative comparisons. “Identity” denotes the identity similarity between the generated and input images. “Prompt” denotes the prompt similarity between the generated image and the given text prompt. “Time” denotes the average personalization time in seconds.
Methods Identity\uparrow Prompt\uparrow Time\downarrow
Textual Inversion [17] 0.2115 0.2498 6331
DreamBooth [41] 0.2053 0.3015 623
NeTI [2] 0.3789 0.2325 1527
Celeb Basis [68] 0.2070 0.2683 140
Ours-fast 0.2225 0.2800 26
Ours 0.2517 0.2859 346
Table 2: User study results. We asked the participants to select the image that better preserves the identity and matches the prompt.
Baselines Prefer Baseline Prefer Ours
Textual Inversion [17] 22.0% 78.0%
DreamBooth [41] 9.3% 90.7%
NeTI [2] 24.7% 75.3%
Celeb Basis [68] 26.7% 73.3%

Personalization Time.

The average time for personalization using each method is reported in Tab. 1. Compared to Textual Inversion, our method significantly reduces the optimization time from 106 minutes to 6 minutes. Additionally, We develop a fast version of our method, denoted as “Ours-fast”, with a learning rate of 0.08. This fast version allows for learning the new concept in merely 25 optimization steps, taking only 26 seconds. As demonstrated in Tab. 1, this fast version achieves the quickest personalization while surpassing Celeb Basis and Textual Inversion in both identity similarity and prompt similarity. The visual results of this fast version are presented in Appendix C.

User Study.

We also evaluate our method from a human perspective by conducting a user study. We randomly selected one prompt from the prompt set and one image from the CelebA-HQ test set. These were used to generate personalized images for each method. In each question of the study, participants were presented with the input image and text prompt, as well as two generated images: one from our method and another from the baseline method. Participants were asked to select the image that better preserves the identity and matches the prompt. In total, we collected 600 responses from 30 participants, as shown in Tab. 2. The results show a clear preference for our method.

5.3 Ablation Study

We conduct an ablation study by separately removing each sub-module from our method. Specifically, we sequentially remove the following sub-modules: 1) Cross Initialization, 2) mean textual embedding, and 3) the regularization term. In Fig. 10, we present a visual comparison of the personalized images generated by each variant. The results indicate that all sub-modules are crucial for achieving identity-preserved and prompt-aligned personalized face generation. Specifically, the model without Cross Initialization produces results similar to those by Textual Inversion. This variant tends to generate images focusing either solely on the given concept or exclusively on the other prompt tokens. The models without mean textual embedding or the regularization term lead to degradation in editability, struggling to create consistent scenes as described in the prompt. More ablation study results are provided in Appendix G.


Input
w/o CI
w/o Mean
w/o Reg
Full
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

Figure 10: Ablation study. The prompt is “S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT plays the LEGO toys”. We compare the models trained without Cross Initialization (w/o CI), without mean textual embedding (w/o Mean), and without regularization (w/o Reg). As can be seen, all sub-modules are essential for achieving identity-preserved and prompt-aligned personalized face generation.

6 Conclusions and Future Work

We introduced a new initialization method for personalized text-to-image generation. We identified a significant disparity between the initial and learned embeddings in Textual Inversion, which often leads to an overfitting problem. Our approach, “Cross Initilization”, addresses this issue by initializing the textual embedding with the output of the text encoder. Cross Initialization enables more identity-preserved, prompt-aligned, and faster face personalization. In this work, we mainly examined the performance of Cross Initialization on the human being concept. For general concepts, we found that Cross Initialization is not as effective as it is for the human being concept. In future work, we plan to further investigate the applicability of Cross Initialization to a broader range of concepts.

References

  • Abdal et al. [2019] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2stylegan: How to embed images into the stylegan latent space? In ICCV, pages 4432–4441, 2019.
  • Alaluf et al. [2023] Yuval Alaluf, Elad Richardson, Gal Metzer, and Daniel Cohen-Or. A neural space-time representation for text-to-image personalization. arXiv preprint arXiv:2305.15391, 2023.
  • Arar et al. [2023] Moab Arar, Rinon Gal, Yuval Atzmon, Gal Chechik, Daniel Cohen-Or, Ariel Shamir, and Amit H. Bermano. Domain-agnostic tuning-encoder for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06925, 2023.
  • Avrahami et al. [2023] Omri Avrahami, Kfir Aberman, Ohad Fried, Daniel Cohen-Or, and Dani Lischinski. Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023.
  • Ba et al. [2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • Bai et al. [2022] Qingyan Bai, Yinghao Xu, Jiapeng Zhu, Weihao Xia, Yujiu Yang, and Yujun Shen. High-fidelity gan inversion with padding space. In ECCV, pages 36–53. Springer, 2022.
  • Balaji et al. [2022] Yogesh Balaji, Seungjun Nah, Xun Huang, Arash Vahdat, Jiaming Song, Karsten Kreis, Miika Aittala, Timo Aila, Samuli Laine, Bryan Catanzaro, et al. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022.
  • Chen et al. [2023a] Li Chen, Mengyi Zhao, Yiheng Liu, Mingxu Ding, Yangyang Song, Shizun Wang, Xu Wang, Hao Yang, **g Liu, Kang Du, and Min Zheng. Photoverse: Tuning-free image customization with text-to-image diffusion models. arXiv preprint arXiv:2309.05793, 2023a.
  • Chen et al. [2023b] Wenhu Chen, Hexiang Hu, Yandong Li, Nataniel Rui, Xuhui Jia, Ming-Wei Chang, and William W Cohen. Subject-driven text-to-image generation via apprenticeship learning. arXiv preprint arXiv:2304.00186, 2023b.
  • Choi et al. [2023] Jooyoung Choi, Yunjey Choi, Yunji Kim, Junho Kim, and Sungroh Yoon. Custom-edit: Text-guided image editing with customized diffusion models. arXiv preprint arXiv:2305.15779, 2023.
  • Cohen et al. [2022] Niv Cohen, Rinon Gal, Eli A Meirom, Gal Chechik, and Yuval Atzmon. “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In ECCV, pages 558–577. Springer, 2022.
  • Couairon et al. [2022] Guillaume Couairon, Jakob Verbeek, Holger Schwenk, and Matthieu Cord. Diffedit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  • Deng et al. [2019] Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In CVPR, pages 4690–4699, 2019.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. In NeurIPS, pages 8780–8794, 2021.
  • Ding et al. [2021] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. In NeurIPS, pages 19822–19835, 2021.
  • Dong et al. [2022] Ziyi Dong, Pengxu Wei, and Liang Lin. Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning. arXiv preprint arXiv:2211.11337, 2022.
  • Gal et al. [2022] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  • Gal et al. [2023] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Encoder-based domain tuning for fast personalization of text-to-image models. TOG, 42(4):1–13, 2023.
  • Gu et al. [2020] **** Gu, Yujun Shen, and Bolei Zhou. Image processing using multi-code gan prior. In CVPR, pages 3012–3021, 2020.
  • Gu et al. [2023] Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, and Mike Zheng Shou. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023.
  • Hao et al. [2023] Shaozhe Hao, Kai Han, Shihao Zhao, and Kwan-Yee K. Wong. Vico: Detail-preserving visual condition for personalized text-to-image generation. arXiv preprint arXiv:2306.00971, 2023.
  • He et al. [2023] Xingzhe He, Zhiwen Cao, Nicholas Kolkin, Lantao Yu, Helge Rhodin, and Ratheesh Kalarot. A data perspective on enhanced identity preservation for diffusion personalization. arXiv preprint arXiv:2311.04315, 2023.
  • Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In NeurIPS, pages 6840–6851, 2020.
  • Hyung et al. [2023] Junha Hyung, Jaeyo Shin, and Jaegul Choo. Magicapture: High-resolution multi-concept portrait customization. arXiv preprint arXiv:2309.06895, 2023.
  • Jia et al. [2023] Xuhui Jia, Yang Zhao, Kelvin CK Chan, Yandong Li, Han Zhang, Boqing Gong, Tingbo Hou, Huisheng Wang, and Yu-Chuan Su. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
  • Karras et al. [2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2018.
  • Kawar et al. [2023] Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023.
  • Kumari et al. [2023] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In CVPR, pages 1931–1941, 2023.
  • Liew et al. [2022] Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi Feng. Magicmix: Semantic mixing with diffusion models. arXiv preprint arXiv:2210.16056, 2022.
  • Lin et al. [2023] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In CVPR, pages 300–309, 2023.
  • Liu et al. [2015] Ziwei Liu, ** Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, pages 3730–3738, 2015.
  • Ma et al. [2023] Yiyang Ma, Huan Yang, Wen**g Wang, Jianlong Fu, and Jiaying Liu. Unified multi-modal latent diffusion for joint subject and text conditional image generation. arXiv preprint arXiv:2303.09319, 2023.
  • Metzer et al. [2023] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR, pages 12663–12673, 2023.
  • Mokady et al. [2022] Ron Mokady, Amir Hertz, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
  • Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • Parmar et al. [2022] Gaurav Parmar, Yijun Li, **gwan Lu, Richard Zhang, Jun-Yan Zhu, and Krishna Kumar Singh. Spatially-adaptive multilayer selection for gan inversion and editing. In CVPR, pages 11399–11409, 2022.
  • Patashnik et al. [2021] Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV, pages 2085–2094, 2021.
  • Pidhorskyi et al. [2020] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In CVPR, pages 14104–14113, 2020.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  • Raj et al. [2023] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, pages 8821–8831, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Reed et al. [2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In ICML, pages 1060–1069, 2016.
  • Richardson et al. [2021] Elad Richardson, Yuval Alaluf, Or Patashnik, Yotam Nitzan, Yaniv Azar, Stav Shapiro, and Daniel Cohen-Or. Encoding in style: a stylegan encoder for image-to-image translation. In CVPR, pages 2287–2296, 2021.
  • Richardson et al. [2023] Elad Richardson, Gal Metzer, Yuval Alaluf, Raja Giryes, and Daniel Cohen-Or. Texture: Text-guided texturing of 3d shapes. arXiv preprint arXiv:2302.01721, 2023.
  • Rippel et al. [2014] Oren Rippel, Michael Gelbart, and Ryan Adams. Learning ordered representations with nested dropout. In ICML, pages 1746–1754, 2014.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  • Ruiz et al. [2023a] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023a.
  • Ruiz et al. [2023b] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Wei Wei, Tingbo Hou, Yael Pritch, Neal Wadhwa, Michael Rubinstein, and Kfir Aberman. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023b.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022.
  • Sauer et al. [2023] Axel Sauer, Tero Karras, Samuli Laine, Andreas Geiger, and Timo Aila. Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis. arXiv preprint arXiv:2301.09515, 2023.
  • Shamsian et al. [2021] Aviv Shamsian, Aviv Navon, Ethan Fetaya, and Gal Chechik. Personalized federated learning using hypernetworks. In ICML, pages 9489–9502, 2021.
  • Shaw et al. [2018] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
  • Shi et al. [2023] **g Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
  • Smith et al. [2023] James Seale Smith, Yen-Chang Hsu, Lingyu Zhang, Ting Hua, Zsolt Kira, Yilin Shen, and Hongxia **. Continual diffusion: Continual customization of text-to-image diffusion with c-lora. arXiv preprint arXiv:2304.06027, 2023.
  • Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Tewel et al. [2023] Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personalization. In SIGGRAPH, 2023.
  • Tov et al. [2021] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for stylegan image manipulation. TOG, 40(4):1–14, 2021.
  • Tumanyan et al. [2022] Narek Tumanyan, Michal Geyer, Shai Bagon, and Tali Dekel. Plug-and-play diffusion features for text-driven image-to-image translation. arXiv preprint arXiv:2211.12572, 2022.
  • Valevski et al. [2022] Dani Valevski, Matan Kalman, Yossi Matias, and Yaniv Leviathan. Unitune: Text-driven image editing by fine tuning an image generation model on a single image. arXiv preprint arXiv:2210.09477, 2022.
  • Vinker et al. [2023] Yael Vinker, Andrey Voynov, Daniel Cohen-Or, and Ariel Shamir. Concept decomposition for visual exploration and inspiration. arXiv preprint arXiv:2305.18203, 2023.
  • von Platen et al. [2022] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022.
  • Voynov et al. [2023] Andrey Voynov, Qinghao Chu, Daniel Cohen-Or, and Kfir Aberman. p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
  • Wang et al. [2022] Tengfei Wang, Yong Zhang, Yanbo Fan, Jue Wang, and Qifeng Chen. High-fidelity gan inversion for image attribute editing. In CVPR, pages 11379–11388, 2022.
  • Wu et al. [2023] Zijie Wu, Chaohui Yu, Zhen Zhu, Fan Wang, and Xiang Bai. Singleinsert: Inserting new concepts from a single image into text-to-image models for flexible editing. arXiv preprint arXiv:2310.08094, 2023.
  • Yu et al. [2022] Jiahui Yu, Yuanzhong Xu, **g Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  • Yuan et al. [2023] Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, and Huicheng Zheng. Inserting anybody in diffusion models via celeb basis. In NeurIPS, 2023.
  • Zhou et al. [2023] Yufan Zhou, Ruiyi Zhang, Tong Sun, and **hui Xu. Enhancing detail preservation for customized text-to-image generation: A regularization-free approach. arXiv preprint arXiv:2305.13579, 2023.
  • Zhu et al. [2020a] Jiapeng Zhu, Yujun Shen, Deli Zhao, and Bolei Zhou. In-domain gan inversion for real image editing. In ECCV, pages 592–608, 2020a.
  • Zhu et al. [2020b] Peihao Zhu, Rameen Abdal, Yipeng Qin, John Femiani, and Peter Wonka. Improved stylegan embedding: Where are the good latents? arXiv preprint arXiv:2012.09036, 2020b.
\appendixpage

Appendix A Implementation Details of Baselines

We compare our method with four baseline methods: Textual Inversion [17], DreamBooth [49], NeTI [2], and Celeb Basis [68]. For Textual Inversion, we use the diffusers implementation [63] with Stable Diffusion v2.1 as the base model. The textual embeddings are initialized with the embeddings of “human face”. We perform 5,000 optimization steps using a learning rate of 5e-3 and a batch size of 8. For DreamBooth, we also use the diffusers implementation and tune the U-Net with prior preservation loss. We perform 800 fine-tuning steps using a learning rate of 2e-6 and a batch size of 1. For NeTI and Celeb Basis, we use their official implementations and follow the official hyperparameters described in their papers. Moreover, we apply the textual bypass and Nested Dropout [47] techniques for NeTI.

Table 3: The 20 prompts used in the quantitative evaluation.
          a photo of a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person with a sad expression
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person with a happy expression
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person with a puzzled expression
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person with an angry expression
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person plays the LEGO toys
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person on the beach
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person piloting a fighter jet
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person wearing the sweater, a backpack and
          cam** stove, outdoors, RAW, ultra high res
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person wearing a scifi spacesuit in space
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person and Anne Hathaway
          are baking a birthday cake
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person and Anne Hathaway
          taking a relaxing hike in the mountains
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person and Anne Hathaway sit on a sofa
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person and Anne Hathaway
          enjoying a day at an amusement park
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person shakes hands with
          Anne Hathaway in news conference
          cubism painting of a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person
          fauvism painting of a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person
          cave mural depicting a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person
          pointillism painting of a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person
          a S*superscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT person latte art

Appendix B Text Prompts

In Tab. 3, we list all 20 text prompts used in the quantitative evaluation. These prompts cover a range of modifications, including expression editing, background modification, individual interaction, and artistic style.

Appendix C Results for Our Fast Version Method

As illustrated in Sec. 5.2, we developed a fast version of our method with a learning rate of 0.08. This fast version enables learning of the new concept in 25 optimization steps, taking only 26 seconds. In Figs. 11 and 12, we provide qualitative results of applying this fast version to a variety of prompts. The results demonstrate that our fast version allows for high-quality personalized face generation within a remarkably short training time.

Appendix D Additional Qualitative Comparisons

In Fig. 13, we provide additional qualitative comparisons to the baseline methods on a wide range of prompts.

Appendix E Additional Qualitative Results

In Fig. 14 and Fig. 15, we provide additional qualitative results obtained by our method on a diverse set of prompts.

Appendix F Results on Synthetic Facial Images

Besides evaluating on real facial images, we also evaluate our method on synthetic facial images generated by StyleGAN. The results are shown in Fig. 16. As can be seen, our method achieves high-quality personalized face generation on synthetic facial images.

Appendix G Additional Ablation Study Results

As illustrated in Sec. 5.3, our ablation study involves the individual removal of the following sub-modules: 1) Cross Initialization, 2) mean textual embedding, and 3) the regularization term. Additional ablation study results for each variant are presented in Fig. 17.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT latte
art”
“Colorful graffiti
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
“Manga drawing
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
“Pencil drawing
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
“A sand sculpture
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wears a
sunglass and a life
jacket on a boat”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is
driving a car”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT piloting
a fighter jet”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wears a
suit in space”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT swims
in the ocean”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with an
admiring expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a
depressed expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with an
ecstatic expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a
puzzled expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a
terrified expression”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and Bill
Gates go to a
technology
exhibition together”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and Elon
Musk go to an art
exhibition together”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is standing
with Jeff Bezos
on a street”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Keanu Reeves
sit in the park”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Sergey Brin
sit on a sofa”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is surveying
an underground cave”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is having
a haircut in a
classic, retro-styled
barbershop”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is hiking
in a dense,
lush rainforest”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is crossing
the marathon
finish line”
“A vibrant, large-scale
chalk art of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
on a sidewalk”
Figure 11: Images generated by our fast version method with a learning rate of 0.08. Results are obtained after 25 optimization steps, taking only 26 seconds.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is living in
an abandoned building ”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is fine-tuning
a handmade violin
in a workshop”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT race car driver
is gearing up
in the pit lane
before a race”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT shakes hands
with Elon Musk
in a news conference”
“colorful graffiti
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
“Banksy art
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
“Cubism painting
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
“Fauvism painting
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT Funko pop”
“Watercolor painting
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT holding up
his accepted paper”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wears a
chefs hat
in the kitchen”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT buckled in
his seat
on a plane”
“A photo of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
graduating after
finishing his PhD”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as a cowboy
sitting on hay ”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is feeding
giraffes in a sunny
open zoo enclosure”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Bill Gates go
to a technology
exhibition together”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Keanu Reeves
on a boat”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as
Black Widow”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as
White Queen”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is coding in
a cozy home office”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is paddling
on a crystal-clear
alpine lake”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is repairing
a vintage bike
in a garage”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is writing
a novel in
a home library”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as
Ziggy Stardust”
Figure 12: Images generated by our fast version method with a learning rate of 0.08. Results are obtained after 25 optimization steps, taking only 26 seconds.

Real Sample Textual Inversion DreamBooth NeTI Celeb Basis Ours
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and Anne Hathaway sit on a sofa” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wears a sunglass on a boat” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wearing a scifi spacesuit in space” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as Captain America” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
“cave mural depicting S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT latte art” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
“A detailed oil painting of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wearing a royal gown in a palace” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

Figure 13: Additional qualitative comparisons. Given a single input image, we present four images generated by each method using identical random seeds. Our approach demonstrates superior performance in identity preservation and editability.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
“A highly detailed
digital art of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT mage
casting a fire ball”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is wearing
a magician hat
and a blue coat
in a garden”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wearing a casual
plain white shirt
surfing in the ocean”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is wearing
a brown sports jacket
and a hat, holding
a whip in his hand”
“Greek sculpture
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Steve Jobs
cooking together
in a kitchen”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Leonardo DiCaprio
sit on a sofa”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Michael Jackson
enjoy a delicate
candlelight dinner”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Robert Downey
enjoying a day
at an
amusement park”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and
Mark Zuckerberg
are reading a
book together”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wears
a chefs hat
in the kitchen”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is wearing
the sweater
outdoors ”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is looking
out of a window
on a rainy night”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT dressed
in a blue suit
is cooking
a gourmet meal ”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is carrying
vegetables in
vegetable market ”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as a
knight in
plate armor”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT in
assassins creed”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT in a
comic book”
“Ice sculpture
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT stained
glass window”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT portrait as
an asia old
warrior chief”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is
riding a dragon”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is
riding a horse”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as a priest
in blue robes,
national geographic”
“A concert poster
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Figure 14: Additional examples of personalized text-to-image generation obtained with Cross Initialization.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a
happy expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a
terrified expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a
depressed expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with an
amazed expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with a
confused expression”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is riding
a bicycle wearing
a shirt and a scarf”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wears a suit
on a soccer field”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is sitting
on a sofa
holding a cat”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and Keanu
Reeves dressed as
knights holding a
wooden board”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as
a Witcher”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
“A photo of
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wearing a
beret holding a
sign in front of
the Eiffel Tower”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is playing
guitar in a
lively urban setting”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT sitting in
a hammock with
sunglasses on”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as a Jedi”
“An oil painting
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT dressed as
a musketeer in
an old French town”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT in a serene
studio writing
elegant script”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT yoga instructor
leading a class
at dawn with the
sun in the background”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as an
amazon warrior”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT in the style
of stefan kostic
and david la chapelle”
“A highly detailed
digital art of
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT mage
standing on clouds”
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Real Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT cooking at
a night market”
“A dslr photo
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT painting
in a sunlit studio”
“Renaissance-style
portrait of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
astronaut in space
detailed starry
background
reflective helmet”
“A colorful mural
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT on an
urban street wall”
“Pop Art painting
of a modern smartphone
with classic art pieces
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT appearing
on the screen”
Figure 15: Additional examples of personalized text-to-image generation obtained with Cross Initialization.
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Synthetic Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT with an
ecstatic expression”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wears a
sunglass on
a boat ”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wears a
suit in
space ship”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as Hawkeye ”
“Marble sculpture
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Synthetic Sample
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT as a cowboy
sitting on hay”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT is driving
a car”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT stands in
the rain holding
an umbrella”
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT and Jeff Bezos
taking a relaxing
hike in the mountains ”
“3d modeling
of S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT
Figure 16: Additional results on synthetic facial images generated by StyleGAN, where the input images are sourced from [68].


Real Sample
& Prompt
w/o CI w/o Mean w/o Reg Full
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT latte art” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT wearing a scifi spacesuit in space” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
S*subscript𝑆S_{*}italic_S start_POSTSUBSCRIPT * end_POSTSUBSCRIPT piloting a fighter jet” Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption

Figure 17: Additional ablation study. We compare the models trained without Cross Initialization (w/o CI), without mean textual embedding (w/o Mean), and without regularization (w/o Reg). As can be seen, all sub-modules are essential for achieving identity-preserved and prompt-aligned personalized face generation.