AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

Aishwarya Agarwal, Srikrishna Karanam, and Balaji Vasan Srinivasan
Adobe Research, Bengaluru India
{aishagar,skaranam,balsrini}@adobe.com
Abstract

We consider the problem of customizing text-to-image diffusion models with user-supplied reference images. Given new prompts, the existing methods can capture the key concept from the reference images but fail to align the generated image with the prompt. In this work, we seek to address this key issue by proposing new methods that can easily be used in conjunction with existing customization methods that optimize the embeddings/weights at various intermediate stages of the text encoding process.

The first contribution of this paper is a dissection of the various stages of the text encoding process leading up to the conditioning vector for text-to-image models. We take a holistic view of existing customization methods and notice that key and value outputs from this process differs substantially from their corresponding baseline (non-customized) models (e.g., baseline stable diffusion). While this difference does not impact the concept being customized, it leads to other parts of the generated image not being aligned with the prompt (see first row in Fig 1). Further, we also observe that these keys and values allow independent control various aspects of the final generation, enabling semantic manipulation of the output. Taken together, the features spanning these keys and values, serve as the basis for our next contribution where we fix the aforementioned issues with existing methods. We propose a new post-processing algorithm, AlignIT, that infuses the keys and values for the concept of interest while ensuring the keys and values for all other tokens in the input prompt are unchanged.

Our proposed method can be plugged in directly to existing customization methods, leading to a substantial performance improvement in the alignment of the final result with the input prompt while retaining the customization quality. We conduct extensive experiments across various different customization methods and a wide variety of reference images and show consistent improvements both qualitatively and quantitatively.

[Uncaptioned image]
Figure 1: We propose a new algorithm, AlignIT, that can be used on top of any already-trained customization model to drastically improve the alignment of generated images with the text prompt directly at inference time, without requiring any retraining.

1 Introduction

We consider the problem of customizing the outputs of text-to-image diffusion models using concepts depicted in user-supplied reference images with a particular focus on improving the quality of alignment between input text prompts and the images generated using such customized models.

Building on top of the dramatic progress in text-to-image synthesis with text-guided diffusion models [9, 14, 17, 18], there has been much recent work in customizing these models with user-supplied reference images [5, 10, 7, 22]. Most of these methods embed knowledge from these reference images in the textual feature space as part of the text encoding process leading up to the conditioning vector used to condition the diffusion model. For instance, [5, 7] introduced a new token into the text vocabulary (e.g. <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s >) and optimized its embedding to represent the custom concept of the reference image. On the other hand, CatVersion [22] proposed to optimize a select set of layers in the encoder model that produces the text feature vector, whereas Custom Diffusion [10] tuned the parameters corresponding to the key and value weights. Across all methods, during inference, this embedded information of the custom concept is utilized as part of the text encoding process and new samples are synthesized.

From a usability perspective, it is crucial that customized text-to-image diffusion models are capable of generating images with custom concepts in novel scenes as described in the prompt. As noted in prior work [5], there are two key associated aspects. First, the model should be able to faithfully replicate the concept from the reference images (e.g., the cat from reference images should show up when that token is used in new prompts). Next, when using the custom token in a new prompt, the generated image must faithfully align the remaining parts of the scene with the prompt.

Refer to caption
Figure 2: Editability-reconstruction tradeoff in baselines.

To understand this clearly, consider the example shown in Figure 2. The first row shows images generated with s prompt a cat playing with a ball in garden using baseline stable diffusion. As can be seen from the outputs, the generated cat faithfully follows all aspects of the prompt (e.g., playing with ball, in a garden) suggesting the baseline model’s good alignment between the image and the input prompt. The other three rows show images generated using Textual Inversion [5], CatVersion [22] and Custom Diffusion [10] respectively, each model customized for the cat reference images with baseline stable diffusion. The results of these methods do not show the same level of alignment with the input prompt as the baseline results in the first row, suggesting a tradeoff between the reconstruction of the custom concept and the ability to edit this concept when used in novel generations. For example, in the second row, while the custom cat shows up in the first image, we do not see the ball. In the second image, while some aspects of garden starts showing up, it comes at the cost of deterioration in the custom cat. These results, part of a feature of such customization methods called reconstruction-editability tradeoff in [5, 20, 23], show that these models are unable to allow control and modification of custom concepts as part of novel generations, limiting their practical usability.

To address the aforementioned limitations of existing customization techniques, this paper proposes a new algorithm, AlignIT, that can be easily plugged into these methods, immediately improving their performance. This comprises several contributions. First, we begin with a careful analysis of the text encoding process (from input prompt to output keys and values used for cross attention) involved in all these customization methods. While these methods adopt different strategies to optimize the text embedding, a common observation across all of them is that the keys and values they produce for a certain input prompt differs substantially from their corresponding baseline model. This difference, while hel** produce custom concepts in the final result, also leads to other parts of the image not following the input prompt, leading to the misalignment between the output and the input text. Next, we also notice that these keys and values enable semantic manipulation and control over different aspects of the input prompt.

Refer to caption
Figure 3: Control enabled by keys and values in cross-attention layers

For instance, in Figure 3, by replacing the keys and values (in every cross attention layer) of the dog with a zebra, we can generate a zebra even with a dog as the input prompt in the first column. These two aspects (keys and values being different when compared to baseline stable diffusion and them being semantically manipulatable) are critical observations that motivate our next contribution, AlignIT, which when used with existing customization methods addresses their limitations.

Our key insight for AlignIT is to ensure only the keys and values corresponding to the custom concept of interest are modified in the text encoding process while kee** the keys and values for all other tokens in the input prompt unchanged from those in the custom model’s corresponding baseline version (e.g., stable diffusion).

As noted above, this is motivated by our observation that the customized model is able to reconstruct the custom concept using prompts like

a photo of a <<<sks>>> (see Figure 2 bottom). This suggests the keys and values of the learned embedding during customization has all the information about the concept from the reference images and we just have to ensure keys and values for other parts of the prompt (which also change with existing customization methods) do not change. Given a model trained with one of the above customization methods, we achieve this with a test-time-only adaptation by using the keys and values of the object from those computed with the custom model. Since our method involves the test-time manipulation of keys and values, it easily can be used in conjunction with any customization approach that optimizes text embedding during its training process. We conduct extensive experiments using the CustomConcept101 dataset and demonstrate our approach substantially improves the customization capabilities of three different existing methods. In Figure 1, we show some sample results. In the first row, with AlignIT, we are able to improve the quality of the textual inversion model (e.g., in the first column, baseline textual inversion does not generate a laptop which our method is able to correct). Similarly, in the second row, we show improved results with Custom Diffusion (e.g., in the first column, our method not only depicts badminton but also faithfully reconstructs the cat).

To summarize, our key contributions in this work are:

  • We dissect various stages of the text encoding process and discuss reasons why existing customization methods fail to generate images fully aligned with text prompt (as shown in Figure 2). We notice that key and value outputs from the text encoding process differ substantially from their corresponding non-customized models (e.g., baseline stable diffusion).

  • We demonstrate that the keys and values allow independent control over various aspects specified as part of the text prompt, enabling semantic manipulation of the generated image as shown in Figure 3.

  • We propose a novel algorithm, AlignIT, that utilises the identified properties of the keys and values and substantially improves the alignment of generated images with the prompt. AlignIT is a training-free algorithm that can be plugged into an already-trained customization model to improve its performance directly during inference.

2 Related Work

With remarkable advancements in text-to-image synthesis with text-guided diffusion models [9, 14, 17, 18], there has been much recent work in customizing these models with user-supplied reference images [5, 2, 7, 10, 22, 19, 8, 16, 6, 11].

Many customization methods embed knowledge from these reference images in the textual feature space as part of the text encoding process leading up to the conditioning vector used to condition the diffusion model. [5, 7] introduce a new token into the text vocabulary and optimized its embedding to represent the custom concept of interest (while kee** diffusion model weights fixed), which could then be used to synthesize novel customized variations of the reference image. CatVersion [22] finetunes the weights of the attention layers in the text encoder. CustomDiffusion [10] finetunes only the key and values weights in the cross-attention layers of the UNet to invert the concept of interest into a rare token. Han et al. [8] uses singular value decomposition to finetune the singular value matrix of the diffusion model backbone, significantly reducing the number of parameters needed for learning the target concepts. Perfusion [19] as well updates the key and value weights while introducing a gated Rank-one Model Editing [11] to make it easier to combine multiple concepts.

These existing methods though are able to reconstruct the custom concept of interest, but they often struggle to generate images that align fully with the text prompt. Prior works on aligning image and text have tackled this through attention-map re-weighing [4, 12, 21], latent-optimization [3, 1, 13], but none of these method address the alignment issue of customization methods and they instead aim to enhance the base models in generating text-aligned images. In this work, we seek to address this key issue by proposing new methods that can easily be used in conjunction with existing customization methods that optimize the embeddings/weights at various intermediate stages of the text encoding process.

3 Approach

As noted in Section 1, existing customization techniques can represent the custom concept of interest but when used to generate this concept in new scenarios, the outputs do not accurately align with the user’s intent/input prompt.

One key observation from Section 1 was the keys and values these techniques produce differ substantially from those produced by the corresponding baseline model, e.g., stable diffusion. To understand this better and how it informs our proposed method, we first begin with a brief summary of the text encoding process in text-to-image diffusion models, followed by a discussion on why existing methods fail and our proposed solution to address these issues.

3.1 Text encoding process

Given a text prompt (e.g., a photo of <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s >), there are several steps involved in producing the conditioning vector that feeds into the cross-attention layers of the noise prediction model (see Figure 4):
Stage 1. Each word/subword in the input prompt is tokenized, and each token’s embedding is retrieved from a precomputed database. This constitutes the first stage of the encoding process. Customization techniques such as [5, 7] use this stage to optimize the model and infuse custom-concept knowledge. For instance, using a placeholder string (e.g., <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > in Figure 4), [5] optimizes a reconstruction loss objective and learns a new embedding (e.g., vdsubscript𝑣𝑑v_{d}italic_v start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT for <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s >) for this placeholder string, essentially augmenting the existing vocabulary with this new information. During inference, given a new prompt with the <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > string, this new vocabulary is used to compute the conditioning vector and generate the output.
Stage 2. Given the per-token embeddings from stage 1, the next step is to compute the final text encoding C𝐶Citalic_C using the diffusion model’s text encoder (e.g., CLIP). Unlike [5] or [7], CatVersion [22] uses the last three attention layers within this text encoder module to embed knowledge of the custom concept (i.e., they modify the weights of these three attention layers).
Stage 3. The vector C𝐶Citalic_C from stage 2 is then input to all cross attention layers of the diffusion model’s noise prediction module (e.g., UNet [15] in stable diffusion). Each cross-attention layer’s key and value matrices, Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT respectively, are used to project C𝐶Citalic_C into output keys (K𝐾Kitalic_K) and values (V𝑉Vitalic_V). Different from [5, 22], another line of customization methods [10] tune these Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT matrices to embed the custom-concept knowledge.

With our proposed method, discussed below, we seek to improve the performance of existing customization methods that optimize embeddings or weights during any of the stages in the text encoding process described above.

Refer to caption
Figure 4: Various stages of the text encoding process.
Refer to caption
Figure 5: Cross-attention maps to demonstrate that baselines undesirably impact keys/values of tokens other than the concept of interest too.

3.2 Why do existing customization methods fail?

As shown in the discussion above, existing customization methods embed the custom knowledge from reference images at one of the three stages of the text encoding process. During inference, this is used (either learned embeddings as in [5] or text encoder attention weights [22] or cross-attention weight matrices [10]) to generate the keys and values given the input prompt, which then are used to perform denoising. Since any of these three types of optimization eventually lead to the output keys (K𝐾Kitalic_K) and values (V𝑉Vitalic_V) from stage 3 (in green in Figure 4), these K𝐾Kitalic_K and V𝑉Vitalic_V matrices are the only factors that influence/control the impact of the input text prompt on the final generated image. This means, in the context of existing customization methods, the quality of output depends on how well the information gets propagated from the input prompt to the final K𝐾Kitalic_K and V𝑉Vitalic_V matrices from stage 3. As shown in Figure 2, while this helps these methods reconstruct the custom concept of interest (e.g., the cat), they are unable to accurately generate other parts of the scene described in the input prompt.

To understand why this is the case, let us first begin with their corresponding baseline model (pretrained stable diffusion). Consider the first row in Figure 5 where we see an image generated by this baseline model for a cat playing with a ball in garden. The attention maps show all the key attributes in the prompt, cat, ball and garden, are well represented, suggesting the K𝐾Kitalic_K and V𝑉Vitalic_V outputs from stage 3 for each of the tokens (e.g. cat, ball etc) has all the required information properly propagated from the input text prompt via the three encoding stages of Figure 4.

Next, consider the result with textual inversion [5] in row 2 in Figure 5. Here, one can note while the custom concept’s attention map (corresponding to <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s >) is well highlighted, the attention maps for the ball and garden tokens differ substantially from what the baseline model produced, resulting in an undesirable output. This difference is because the custom concept’s optimized embedding v<sks>subscript𝑣expectation𝑠𝑘𝑠v_{<sks>}italic_v start_POSTSUBSCRIPT < italic_s italic_k italic_s > end_POSTSUBSCRIPT (note that the corresponding baseline model’s embedding for this token is vdsubscript𝑣𝑑v_{d}italic_v start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT in Figure 4) ends up impacting the keys and values of tokens other than <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > as well. This happens since the v<sks>subscript𝑣expectation𝑠𝑘𝑠v_{<sks>}italic_v start_POSTSUBSCRIPT < italic_s italic_k italic_s > end_POSTSUBSCRIPT vector (along with vasubscript𝑣𝑎v_{a}italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, vbsubscript𝑣𝑏v_{b}italic_v start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, and vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT) produces an input to the text transformer in Figure 4 that is different from the baseline (since vdsubscript𝑣𝑑v_{d}italic_v start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is now different from v<sks>subscript𝑣expectation𝑠𝑘𝑠v_{<sks>}italic_v start_POSTSUBSCRIPT < italic_s italic_k italic_s > end_POSTSUBSCRIPT), resulting in a C𝐶Citalic_C that is different from the baseline. A similar phenomenon can be noticed with CatVersion and custom diffusion methods as well (see rows 3 and 4 in Figure 5). Whereas CatVersion optimizes the text transformer directly, custom diffusion modifies the Wksubscript𝑊𝑘W_{k}italic_W start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and Wvsubscript𝑊𝑣W_{v}italic_W start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT values of Figure 4. These modifications end up impacting all the tokens, leading to either the custom cat not showing up or the ball/garden missing from the output.

The aforementioned observations motivate our proposed method, AlignIT, that allows more controlled infusion of custom knowledge into the model while not impacting the keys and values of other tokens in the input prompt, leading to both the retention of the custom concept as well as better alignment of the final generation with the input prompt.

3.3 AlignIT

Based on the observations from the previous section, the key insight for AlignIT is that the keys and values for any customization method should differ (when compared to the baseline model, e.g., stable diffusion) only for the custom token. This way, we can ensure both reconstruction of the custom concept and also adhere to the input prompt as closely as possible. This idea leads to our proposed method which we show can be used with any of the existing customization methods discussed previously and improve their performance both qualitatively and quantitatively.

Before discussing the details of AlignIT, we first explain how these keys and values allow for manipulation and independent control over various aspects of the input text prompt. This will then lead to the main ideas of our method. Let us begin with an example. Consider a prompt p𝑝pitalic_p a jum** dog (see third column in Figure 6). The first row in the third column shows the result with the baseline stable diffusion model. To generate the image in the second row, we follow the steps below:

  • We keep the input prompt p𝑝pitalic_p (a jum** dog) unchanged and compute the Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT \in n×dsuperscript𝑛𝑑\mathcal{R}^{n\times d}caligraphic_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT and Vpsubscript𝑉𝑝V_{p}italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT \in n×dsuperscript𝑛𝑑\mathcal{R}^{n\times d}caligraphic_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT for each cross-attention layer (n𝑛nitalic_n is the number of tokens and d𝑑ditalic_d is the feature dimensionality of the layer) as part of Stage 3 in Figure 4.

  • Given a new concept o𝑜oitalic_o (cat here) with which we want to replace/edit the main object in p𝑝pitalic_p (i.e., dog), we first construct a dummy prompt psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has the token o𝑜oitalic_o at the same index i𝑖iitalic_i (3333 in this case) as the input prompt p𝑝pitalic_p. We pad the remaining token slots with placeholders (e.g. *) that have no significance and do not impact the generation.

  • We next use the dummy prompt psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and compute the keys Kpsubscript𝐾superscript𝑝K_{p^{\prime}}italic_K start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and values Vpsubscript𝑉superscript𝑝V_{p^{\prime}}italic_V start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT for all the cross attention layers. Since the only token carrying significance in prompt psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is our concept of interest o𝑜oitalic_o at index i𝑖iitalic_i, Kp[i]subscript𝐾superscript𝑝delimited-[]𝑖K_{p^{\prime}}[i]italic_K start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ] and Vp[i]subscript𝑉superscript𝑝delimited-[]𝑖V_{p^{\prime}}[i]italic_V start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ] have all the knowledge required for capturing concept o𝑜oitalic_o in the final generation.

  • Before feeding the Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Vpsubscript𝑉𝑝V_{p}italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT computed in step 1 above for further steps of denoising, we modify them as follows (in each timestep):

    Kp[i]=Kp[i],Vp[i]=Vp[i]formulae-sequencesubscript𝐾𝑝delimited-[]𝑖subscript𝐾superscript𝑝delimited-[]𝑖subscript𝑉𝑝delimited-[]𝑖subscript𝑉superscript𝑝delimited-[]𝑖K_{p}[i]=K_{p^{\prime}}[i],V_{p}[i]=V_{p^{\prime}}[i]italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_i ] = italic_K start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ] , italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_i ] = italic_V start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ] (1)

    In other words, we copy the keys and values of this new concept (cat) into the location that previously had dog’s keys and values, while kee** all other keys and values unchanged. This way we end up generating images that follow all aspects of the original prompt p𝑝pitalic_p (e.g. jum**) while replacing the semantic concept (dog here) with the concept of interest (cat in this case). This is indeed the case in the third column/second row of Figure 6 where the dog is replaced by cat, while retaining other semantics (jum**) from the original image.

Refer to caption
Figure 6: Semantic manipulations offered by keys and values.

We show more examples in the other columns in Figure 6. For instance, in the first column, we replace the keys and values of dog with a cat, resulting in a cat showing up instead of the dog. We show another set of results in Figure 7 that have more semantic interactions between the two text prompts. All the images in the first row are generated using the prompt p𝑝pitalic_p banana in a white plate. For the second image, we modify keys and values as Kp[4]=Kp[4]subscript𝐾𝑝delimited-[]4subscript𝐾superscript𝑝delimited-[]4K_{p}[4]=K_{p^{\prime}}[4]italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ 4 ] = italic_K start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ 4 ], Vp[4]=Vp[4]subscript𝑉𝑝delimited-[]4subscript𝑉superscript𝑝delimited-[]4V_{p}[4]=V_{p^{\prime}}[4]italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ 4 ] = italic_V start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ 4 ], and hence end up with a banana in a black plate instead (since psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has black at index 4444). Similarly when we modify the keys and values for the second token in p𝑝pitalic_p, we see cucumber instead of banana in white plate as in the third image in first row. These experiments clearly demonstrate that by manipulating the per-token keys and values, it is possible to control various aspects of the final generation.

Refer to caption
Figure 7: Key-value interactions between two text prompts.

The experiments and discussion above form the basis for our proposed algorithm AlignIT for customization (presented in Algorithm 1). Given a set of reference images for which we want to customize the baseline text-to-image model (e.g., the cat images in the first row of Figure 8). We assume the existence of a model trained with an existing customization method, e.g., textual inversion [5]. Now, given a target prompt (e.g. a <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > dancing in front of times square) for which we seek to generate an image with the custom concept, we first replace <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > with a word representing a suitable class belonging to the tokenizer vocabulary to obtain p𝑝pitalic_p (a cat dancing in front of times square in this case). We next use the concept of interest (<sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > here) to construct the dummy prompt psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (‘* <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s >’ in this example) as noted above, and compute the keys Kpsubscript𝐾superscript𝑝K_{p^{\prime}}italic_K start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and values Vpsubscript𝑉superscript𝑝V_{p^{\prime}}italic_V start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT using the prompt psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with the already available customization model (e.g., textual inversion). These are then used to modify the keys Kpsubscript𝐾𝑝K_{p}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and values Vpsubscript𝑉𝑝V_{p}italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT while generating the image using the original prompt of interest p𝑝pitalic_p as Kp[i]=Kp[i]subscript𝐾𝑝delimited-[]𝑖subscript𝐾superscript𝑝delimited-[]𝑖K_{p}[i]=K_{p^{\prime}}[i]italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_i ] = italic_K start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ], Vp[i]=Vp[i]subscript𝑉𝑝delimited-[]𝑖subscript𝑉superscript𝑝delimited-[]𝑖V_{p}[i]=V_{p^{\prime}}[i]italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_i ] = italic_V start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ] (i𝑖iitalic_i = 2222 in this example). Note that while generation with prompt p𝑝pitalic_p happens with the baseline stable diffusion model, the knowledge from the reference images (in the form of keys Kpsubscript𝐾superscript𝑝K_{p^{\prime}}italic_K start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and values Vpsubscript𝑉superscript𝑝V_{p^{\prime}}italic_V start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT) comes from the customization model which is assumed to be already trained for these images.

Finally, since these keys (K𝐾Kitalic_K) and values (V𝑉Vitalic_V) can be computed from any customization method, and are solely responsible for conditioning the image generation process, our proposed method described above can be used on top of any already-trained customization model to improve its performance, which we demonstrate next.

Algorithm 1 : AlignIT

Input: Target prompt ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT having the custom concept token <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > (belonging to object class o𝑜oitalic_o) at index i𝑖iitalic_i
Parameter: Base SD, optimized embeddings/weights from the customized model
Output: Image aligned with prompt ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

1:  ppt𝑝subscript𝑝𝑡p\leftarrow p_{t}italic_p ← italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Replace <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > at index i𝑖iitalic_i with object class o𝑜oitalic_o)
2:  pptsuperscript𝑝subscript𝑝𝑡p^{\prime}\leftarrow p_{t}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (Replace all except <sks>expectation𝑠𝑘𝑠<sks>< italic_s italic_k italic_s > with *)
3:  Compute the customized model’s Kpsuperscriptsubscript𝐾𝑝K_{p}^{\prime}italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and Vpsuperscriptsubscript𝑉𝑝V_{p}^{\prime}italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT for all cross-attention layers with prompt psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT
4:  During generation with Base SD using prompt p𝑝pitalic_p, in each timestep, do Kp[i]=Kp[i]subscript𝐾𝑝delimited-[]𝑖subscript𝐾superscript𝑝delimited-[]𝑖K_{p}[i]=K_{p^{\prime}}[i]italic_K start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_i ] = italic_K start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ], Vp[i]=Vp[i]subscript𝑉𝑝delimited-[]𝑖subscript𝑉superscript𝑝delimited-[]𝑖V_{p}[i]=V_{p^{\prime}}[i]italic_V start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT [ italic_i ] = italic_V start_POSTSUBSCRIPT italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_i ]
Refer to caption
Figure 8: Qualitative comparison with AlignIT plugged into baselines for customized text-to-image generation.
Refer to caption
Figure 9: Additional Qualitative Comparison Results with AlignIT plugged into baselines for customized text-to-image generation.

4 Results

We conduct extensive qualitative and quantitative experiments using the CustomConcept101 dataset [10] and demonstrate improved customization performance with AlignIT across three different customization methods: Textual Inversion [5], Custom Diffusion [10] and CatVersion [22].

Qualitative Results We first begin by discussing our generation outputs. In Figure 8, we compare AlignIT results when used in conjunction with Textual Inversion [5], Custom Diffusion [10] and CatVersion [22], and one can note it clearly improves the performance of these methods. For instance, in the first row, Textual Inversion [5] failed to reconstruct the custom tortoise in the first column, while also missing aspects like swimming in a pool (second column) and blue color (third column). Similarly, in the second row, CatVersion [22] fails to follow the text prompts fully and misses out concepts like jacket, guitar, and laptop. On the other hand, by using AlignIT, these deficiencies can be alleviated, leading to clearly better-quality results (see the +AlignIT row in each case).

Table 1: CLIP-based comparisons to quantify AlignIT efficacy.
Method Text Alignment Image Alignment Overall
Textual Inversion 0.67 0.83 0.75
Textual Inversion + AlignIT 0.78 (+16.4%) 0.84 (+1.2%) 0.81 (+8.0%)
\hdashline[1.5pt/3pt] CatVersion 0.73 0.82 0.77
CatVersion + AlignIT 0.79 (+8.2%) 0.84 (+2.4%) 0.81 (+5.2%)
\hdashline[1.5pt/3pt] Custom Diffusion 0.77 0.82 0.79
Custom Diffusion + AlignIT 0.81 (+5.2%) 0.83 (+1.2%) 0.82 (+3.8%)
Table 2: Results from a user survey with 24 respondents.
Method Text Alignment Image Alignment Overall
Textual Inversion 4.3% 12.4% 7.3%
Textual Inversion + AlignIT 95.7% 87.6% 91.6%
\hdashline[1.5pt/3pt] Custom Diffusion 7.1% 11.8% 9.15%
Custom Diffusion + AlignIT 92.9% 88.2% 90.5%

Quantitative Results For customization methods, evaluating both the ability to replicate the custom concept, and the ability to modify the custom concept using textual prompts is important. We follow the existing protocol [22, 10] and quantify performance using CLIP-based distances. The CustomConcept101 dataset has a set of 20 curated prompts for each concept. As in prior work [10], we generate 50505050 images with randomly selected seeds for each prompt, giving us 1111K generated images for each concept. We measure CLIP text alignment score by computing the average similarity between the text prompt and the generated images for each prompt and concept, thereby evaluating the ability to modify the custom concept using textual prompts. We next follow CatVersion [22] and adjust the CLIP image alignment score to better focus on the similarity between the concept of interest and corresponding reference images to evaluate the concept reconstruction quality. We do this by computing masks for the concept of interest in the generated images and measuring similarities by discarding the pixels that do not belong to the concept of interest. We also report the geometric mean of the image and text alignment scores to get an estimate of the overall performance. Table 1 summarizes these results where much higher CLIP similarities are indicative of the improved customization effect of the generated results with our method when compared to the baselines. One can note that AlignIT dramatically improves the CLIP text alignment scores while maintaining high image alignment.

User Study Finally, we conduct a user study with generated images and evaluate the mean preference of AlignIT plugged with two baselines. Each participant is asked a set of 20 questions where we ask them to select either the image (among pair of images, each belonging to a baseline and AlignIT applied on top of baseline) that best aligns with the prompt or the one that best reconstructs the concept of interest given a reference image. From Table 2, one can clearly note that the users prefer the case where AlignIT is applied on top of the baselines in all of text-guided alignment, reconstruction fidelity and overall customization effect.

5 Conclusion

In this work, we noticed that existing customization techniques fail to generate images that fully align with user’s intent. We first discuss reasons behind failure of existing works. We demonstrated that existing methods (during inference) undesirably end up affecting the keys and values for tokens other than the custom concept of interest as well, thereby leading to misaligned images. To address these issues, we proposed an algorithm called AlignIT that can be plugged into any of these existing customization methods and fix these issues directly at test-time. We conducted extensive qualitative experiments on the CustomConcept101 dataset and demonstrated that the images generated after plugging AlignIT with the existing baselines are substantially more aligned with the input prompts, while also retaining the reconstruction quality of the concept of interest. Further, we also quantified our improvements with existing protocols and a user survey that clearly showed the efficacy of AlignIT.

References

  • [1] Aishwarya Agarwal, Srikrishna Karanam, KJ Joseph, Apoorv Saxena, Koustava Goswami, and Balaji Vasan Srinivasan. A-star: Test-time attention segregation and retention for text-to-image synthesis. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  • [2] Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, and Balaji Vasan Srinivasan. An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis. arXiv preprint arXiv:2311.11919, 2023.
  • [3] Hila Chefer, Yuval Alaluf, Yael Vinker, Lior Wolf, and Daniel Cohen-Or. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. ACM Transactions on Graphics (TOG), 42(4):1–10, 2023.
  • [4] Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. arXiv preprint arXiv:2212.05032, 2022.
  • [5] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. Proceedings of the International Conference on Learning Representations, 2023.
  • [6] Rinon Gal, Moab Arar, Yuval Atzmon, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. Encoder-based domain tuning for fast personalization of text-to-image models. ACM Transactions on Graphics (TOG), 42(4):1–13, 2023.
  • [7] Inhwa Han, Serin Yang, Taesung Kwon, and Jong Chul Ye. Highly personalized text embedding for image manipulation by stable diffusion. arXiv preprint arXiv:2303.08767, 2023.
  • [8] Ligong Han, Yinxiao Li, Han Zhang, Peyman Milanfar, Dimitris Metaxas, and Feng Yang. Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023.
  • [9] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • [10] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1931–1941, 2023.
  • [11] Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  • [12] Quynh Phung, Songwei Ge, and Jia-Bin Huang. Grounded text-to-image synthesis with attention refocusing. arXiv preprint arXiv:2306.05427, 2023.
  • [13] Royi Rassin, Eran Hirsch, Daniel Glickman, Shauli Ravfogel, Yoav Goldberg, and Gal Chechik. Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment. arXiv preprint arXiv:2306.08877, 2023.
  • [14] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • [15] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  • [16] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22500–22510, 2023.
  • [17] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  • [18] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • [19] Yoad Tewel, Rinon Gal, Gal Chechik, and Yuval Atzmon. Key-locked rank one editing for text-to-image personalization. In ACM SIGGRAPH 2023 Conference Proceedings, pages 1–11, 2023.
  • [20] Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG), 40(4):1–14, 2021.
  • [21] Qiucheng Wu, Yujian Liu, Handong Zhao, Trung Bui, Zhe Lin, Yang Zhang, and Shiyu Chang. Harnessing the spatial-temporal attention of diffusion models for high-fidelity text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7766–7776, 2023.
  • [22] Ruoyu Zhao, Mingrui Zhu, Shiyin Dong, Nannan Wang, and Xinbo Gao. Catversion: Concatenating embeddings for diffusion-based text-to-image personalization. arXiv preprint arXiv:2311.14631, 2023.
  • [23] Peihao Zhu, Rameen Abdal, Yipeng Qin, John Femiani, and Peter Wonka. Improved stylegan embedding: Where are the good latents? arXiv preprint arXiv:2012.09036, 2020.