HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang*1, Xinyi Yang*1, Yihao Feng*1, Can Qin1, Chia-Chih Chen1, Ning Yu1, Zeyuan Chen1,
Huan Wang1, Silvio Savarese1,2, Stefano Ermon2, Caiming Xiong1, Ran Xu1
1Salesforce AI Research, 2Stanford University
Abstract

Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models, where outputs are generated based on an input image and an editing instruction, could similarly benefit from human feedback, as their outputs may not adhere to the correct instructions and preferences of users. In this paper, we present a novel framework to harness human feedback for instructional visual editing (HIVE). Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences. We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward.

Besides, to mitigate the bias brought by the limitation of data, we contribute a new 1.1M training dataset, a 3.6K reward dataset for rewards learning, and a 1K evaluation dataset to boost the performance of instructional image editing. We conduct extensive empirical experiments quantitatively and qualitatively, showing that HIVE is favored over previous state-of-the-art instructional image editing approaches by a large margin. 00footnotetext: *Denotes equal contribution. Primary contact: [email protected].
Our project page: https://shugerdou.github.io/hive/.

[Uncaptioned image]
Figure 1: We show four groups of representative results. In each triplet, from left to right are: the original image, InstructPix2Pix [7] using our data (IP2P-Ours), and HIVE. We observe that HIVE leads to more acceptable results than the model without human feedback. For instance, in the left two examples, IP2P-Ours understands the editing instruction “remove” and “change to blue” individually, but fails to understand the corresponding objects. Human feedback resolves this ambiguity, as shown in other examples as well.

1 Introduction

Refer to caption
Figure 2: Overall architecture of HIVE. The first step is to train a baseline HIVE without human feedback. In the second step, we collect human feedback to rank variant outputs for each image-instruction pair, and train a reward model to learn the rewards. In the third step, we fine-tune diffusion models by integrating the estimated rewards.

State-of-the-art (SOTA) text-to-image generative models have shown impressive performance in terms of both image quality and alignment between output images and captions [1, 46, 44]. Thanks to the impressive generation abilities of these models, instructional image editing has emerged as one of the most promising application scenarios for content generation [7]. Different from traditional image editing [3, 16, 55, 30, 16, 55], where both the input and the edited caption are needed, instructional image editing only requires human-readable instructions. For instance, classic image editing approaches require an input caption “a dog is playing a ball”, and an edited caption “a cat is playing a ball”. In contrast, instructional image editing only needs editing instruction such as “change the dog to a cat”. This experience mimics how humans naturally perform image editing.

Instructional image editing was first proposed in InstructPix2Pix [7], which fine-tunes a pre-trained stable diffusion [46] by curating a triplet of the original image, instruction, and edited image, with the help of GPT-3 [8] and Prompt-to-Prompt image editing [16]. Though achieving promising results, the training data generation process of InstructPix2Pix lacks explicit alignment between editing instructions and edited images.

Consequently, the modified images may only align to a certain extent with the editing instructions, as shown in the second column of Fig. 4. Furthermore, since these editing instructions are provided by human users, it’s crucial that the final edited images accurately reflect the users’ true intentions and preferences. Typically, humans prefer to make selective changes to the original images, which are usually not factored into the training data or objectives of InstructPix2Pix [7]. Considering this observation and the recent successes of ChatGPT [35], we propose to refine the stable diffusion process with human feedback. This adjustment aims to ensure that the edited images more closely correspond to editing instructions provided by humans.

For large language models (LLMs) such as InstructGPT [35, 37], we often first learn a reward function to reflect what humans care about or prefer on the generated text output, and then leverage reinforcement learning (RL) algorithms such as proximal policy optimization (PPO) [50] to fine-tune the models. This process is often referred to as reinforcement learning with human feedback (RLHF). Leveraging RLHF to fine-tune diffusion-based generative models, however, remains challenging. Applying on-policy algorithms (e.g.,PPO) to maximize rewards during the fine-tuning process can be prohibitively expensive due to the hundreds or thousands of denoising steps required for each sampled image. Moreover, even with fast sampling methods [52, 57, 21, 31], it is still challenging to back-propagate the gradient signal to the parameters of the U-Net. 111 We present a rigorous discussion on the difficulty in Appendix C.1.

To address the technical issues described above, we propose Harnessing Human Feedback for Instructional Visual Editing (HIVE), which allows us to fine-tune diffusion-based generative models with human feedback. As shown in Fig. 2, HIVE consists of three steps:

1)  We perform instructional supervised fine-tuning on the dataset that combines our newly collected 1.1M training data and the data from InstructPix2Pix. Since observing failure cases and suspecting the grounding visual components from image to instruction is still a challenging problem, we collect 1.1M training data.

2)  For each input image and editing instruction pair, we ask human annotators to rank variant outputs of the fine-tuned model from step 1, which gives us a reward learning dataset. Using the collected dataset, we then train a reward model (RM) that reflects human preferences.

3)  We estimate the reward for each training data used in step 1, and integrate the reward to perform human feedback diffusion model finetuning using our proposed objectives presented in Sec. 3.4.

Our main contributions are summarized as follows:

\bullet  To tackle the technical challenge of fine-tuning diffusion models using human feedback, we introduce two scalable fine-tuning approaches in Sec. 3.4, which are computationally efficient and offer similar costs compared with supervised fine-tuning. Moreover, we empirically show that human feedback is an essential component to boost the performance of instructional image editing models.

\bullet  To explore the fundamental ability of instructional editing, we create a new dataset for HIVE including three sub-datasets: a new 1.1M training dataset, a 3.6K reward dataset for rewards learning, and a 1K evaluation dataset.

\bullet  To increase the diversity of the data for training, we introduce cycle consistency augmentation based on the inversion of editing instruction. Our dataset has been enriched with one pair of data for bi-directional editing.

2 Related Work

Text-To-Image Generation.   Text-to-image generative models have achieved tremendous success in the past decade. Generative adversarial nets (GANs) [15] is one of the fundamental methods that dominated the early-stage works [45, 61, 59]. Recently, diffusion models [51, 17, 53, 52] have achieved state-of-the-art text-to-image generation performance. [12, 34, 43, 44, 47, 60, 46, 29]. As a result, instead of training a text-to-image model from scratch, our work focuses on fine-tuning existing stable diffusion model [46], by leveraging additional human feedback.

Image Editing.   Similarly, diffusion models based image editing methods, e.g. SDEdit [32], BlendedDiffusion [3], BlendedLatentDiffusion [2], DiffusionClip [22], EDICT [55] or MagicMix [30], have garnered significant attention in recent years. To leverage a pre-trained image-text representation (e.g., CLIP [42], BLIP [28]) and text-to-image diffusion based pre-trained models [44, 47, 46], most existing works focus on text-based localized editing [5, 33, 16]. Prompt-to-Prompt [16] edits the cross-attention layer in Imagen and stable diffusion to control the similarity of image and text prompt. ControlNet [62] and UniControl [41] adopt controllable conditions to control image editings. Recently, InstructPix2Pix [7] tackle the problem via a different approach, requiring only human-readable editing instruction to perform image editing. Our work follows the same direction as InstructPix2Pix[7] and leverages human feedback to address the misalignment between editing instructions and resulting edited images.

Learning with Human Feedback.   Incorporating human feedback into the learning process can be a highly effective way to enhance performance across various tasks such as fine-tuning LLMs [35, 4, 48, 37, 54], robotic simulation [10, 18], computer vision [40], and to name a few. Many existing works leverage PPO [50] to align to human feedback, however on-policy RL algorithms are not suitable for diffusion-based model fine-tuning (See more discussion in Appendix C.1).

Simultaneously, several concurrent works [56, 58, 25] study the text-to-image generation problem using human feedback. ReFL [58] investigates how to back-propagate the reward signal to random latter denoising step in the diffusion process, while [56] explores how to design fine-grained human preference score to improve the generation quality. [25] leverages human feedback to align text-to-image generation, where they naively view reward as weights to perform maximum likelihood training. Different from the above works, our work tackles the problem of instructional image editing, where there are little or even no ground truth data for the alignment between human-readable editing instructions and edited images. In addition, the conditions on both image input and instructions make the human feedback more valuable than standard text-to-image tasks, since the conditions make the training harder than standard text-to-image tasks.

3 Methodology

In this section, we introduce the new datasets we collected in Sec. 3.1, and explain the three major steps of HIVE in the rest of the section. Concretely, we introduce the instructional supervised training in Sec. 3.2, and describe how to train a reward model to score edited images in Sec. 3.3, then present two scalable fine-tuning methods to align diffusion models with human feedback in Sec. 3.4.

3.1 Dataset

Instructional Edit Training Dataset.   We follow the same method of [7] to generate the training dataset. We collect 1K images and their corresponding captions. We ask three annotators to write three instructions and corresponding edited captions based on the collected input captions. Therefore, we obtain 9K prompt triplets: input caption, instruction, and edited caption. We fine-tune GPT-3 [8] with OpenAI API v0.25.0 [36] with them. We use the fine-tuned GPT-3 to generate five instructions and edited captions per input image-caption pair in Laion-Aesthetics V2 [49]. We observe that the captions from Laion are not always visually descriptive, so we use BLIP [28] to generate more diverse types of image captions. Later stable diffusion based Prompt-to-Prompt [16] is adopted to generate paired images. In addition, we design a cycle-consistent augmentation method (Sec. 3.2.1) to generate additional training data. We generate 1.17M training triplets in total. Combining the 281K training data from [7], we obtain 1.45M training image pairs along with instructions.

Reward Fine-tuning Dataset.   We collect 3.6K image-instruction pairs for the task of reward fine-tuning. Among them, 1.6K image-instruction pairs are manually collected, and the rest are from Laion-Aesthetics V2 with GPT-3 generated instructions. We use this dataset to ask annotators to rank various model outputs.

Evaluation Dataset.   We use two evaluation datasets: the test dataset in [7] for quantitative evaluation and a new 1K dataset collected for the user study. The quantitative evaluation dataset is generated following the same method as the training dataset, which means that the dataset does not contain real images. Our collected 1K dataset contains 200 real images, and each image is annotated with five human-written instructions. More details of annotation tooling, guidelines, and analysis are in Appendix A.

3.2 Instructional Supervised Training

We follow the instructional fine-tuning method in [7] with two major upgrades on dataset curation (Sec. 3.1) and cycle consistency augmentation (Sec. 3.2.1). A pre-trained stable diffusion model [46] is adopted as the backbone architecture. In instructional supervised training, the stable diffusion model has two conditions c=[cI,cE]𝑐subscript𝑐𝐼subscript𝑐𝐸c=\left[c_{I},c_{E}\right]italic_c = [ italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ], where cEsubscript𝑐𝐸c_{E}italic_c start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT is the editing instruction, and cIsubscript𝑐𝐼c_{I}italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT is the latent space of the original input image. In the training process, a pre-trained auto-encoder [23] with encoder \mathcal{E}caligraphic_E and decoder 𝒟𝒟\mathcal{D}caligraphic_D is used to convert between edited image 𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG and its latent representation z=(𝒙~)𝑧~𝒙z=\mathcal{E}(\tilde{\boldsymbol{x}})italic_z = caligraphic_E ( over~ start_ARG bold_italic_x end_ARG ). The diffusion process is composed of an equally weighted sequence of denoising autoencoders ϵθ(zt,t,c)subscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡𝑐\epsilon_{\theta}(z_{t},t,c)italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ), t=1,,T𝑡1𝑇t=1,\cdots,Titalic_t = 1 , ⋯ , italic_T, which are trained to predict a denoised variant of their input ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, a noisy version of z𝑧zitalic_z. The objective of instructional supervised training is:

L=𝔼(𝒙~),c,ϵ𝒩(0,1),t[ϵϵθ(zt,t,c))22].\displaystyle L=\mathbb{E}_{\mathcal{E}(\tilde{\boldsymbol{x}}),c,\epsilon\sim% \mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z_{t},t,c))\|_{2}^{2}% \Big{]}\,.\vspace{-.5em}italic_L = blackboard_E start_POSTSUBSCRIPT caligraphic_E ( over~ start_ARG bold_italic_x end_ARG ) , italic_c , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

3.2.1 Cycle Consistency Augmentation

Cycle consistency is a powerful technique that has been widely applied in image-to-image generation [63, 19]. It involves coupling and inverting bi-directional map**s of two variables X𝑋Xitalic_X and Y𝑌Yitalic_Y, G:XY:𝐺𝑋𝑌G:X\rightarrow Yitalic_G : italic_X → italic_Y and F:YX:𝐹𝑌𝑋F:Y\rightarrow Xitalic_F : italic_Y → italic_X, such that F(G(X))X𝐹𝐺𝑋𝑋F(G(X))\approx Xitalic_F ( italic_G ( italic_X ) ) ≈ italic_X and vice versa. This approach has been shown to enhance generative map** in both directions.

While Instructpix2pix [7] considers instructional image editing as a single-direction map**, we propose adding cycle consistency. Our approach involves a forward-pass editing step, F:xinst𝒙~:𝐹superscript𝑖𝑛𝑠𝑡𝑥~𝒙F:x\stackrel{{\scriptstyle inst}}{{\longrightarrow}}\tilde{\boldsymbol{x}}italic_F : italic_x start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_i italic_n italic_s italic_t end_ARG end_RELOP over~ start_ARG bold_italic_x end_ARG. We then introduce instruction reversion to enable a reverse-pass map**, R:𝒙~instx:𝑅superscriptsimilar-toabsent𝑖𝑛𝑠𝑡~𝒙𝑥R:\tilde{\boldsymbol{x}}\stackrel{{\scriptstyle\sim inst}}{{\longrightarrow}}xitalic_R : over~ start_ARG bold_italic_x end_ARG start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG ∼ italic_i italic_n italic_s italic_t end_ARG end_RELOP italic_x. In this way, we could close the loop of image editing as: xinst𝒙~instxsuperscript𝑖𝑛𝑠𝑡𝑥~𝒙superscriptsimilar-toabsent𝑖𝑛𝑠𝑡𝑥x\stackrel{{\scriptstyle inst}}{{\longrightarrow}}\tilde{\boldsymbol{x}}% \stackrel{{\scriptstyle\sim inst}}{{\longrightarrow}}xitalic_x start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG italic_i italic_n italic_s italic_t end_ARG end_RELOP over~ start_ARG bold_italic_x end_ARG start_RELOP SUPERSCRIPTOP start_ARG ⟶ end_ARG start_ARG ∼ italic_i italic_n italic_s italic_t end_ARG end_RELOP italic_x, e.g. “add a dog” to “remove the dog”.

To ensure the effectiveness of this technique, we need to separate invertible and non-invertible instructions from the dataset. We devised a rule-based method that combines speech tagging and template matching. We found that most instructions adhere to a particular structure, with the verb appearing at the start, followed by objects and prepositions. Thus, we grammatically tagged all instructions using the Natural Language Toolkit (NLTK) 222https://www.nltk.org/. We identified all invertible verbs and pairing verbs, and also analyzed the semantics of the objects and the prepositions used. By summarizing invertible instructions in predefined templates, we matched desired instructions. Our analysis revealed that 29.1% of the instructions in the dataset were invertible. We augmented this data to create more comprehensive training data, which facilitated cycle consistency. For more information, see Appendix B.1.

3.3 Human Feedback Reward Learning

The second step of HIVE is to learn a reward function ϕ(𝒙~,c)subscriptitalic-ϕ~𝒙𝑐\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ), which takes the original input image, the text instruction condition c=[cI,cE]𝑐subscript𝑐𝐼subscript𝑐𝐸c=\left[c_{I},c_{E}\right]italic_c = [ italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ], and the edited image 𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG that is generated by the fine-tuned stable diffusion as input, and outputs a scalar that reflects human preference.

Unlike InstructGPT which only takes text as input, our reward model ϕ(𝒙~,c)subscriptitalic-ϕ~𝒙𝑐\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) needs to measure the alignment between instructions and the edited images. To address the challenge, we present a reward model architecture in Fig. 3, which leverages pre-trained vision-language models such as BLIP [28]. More specifically, the reward model employs an image-grounded text encoder as the multi-modal encoder to take the joint image embedding and the text instruction as input and produce a multi-modal embedding. A linear layer is then applied to the multi-modal embedding to map it to a scalar value. More details are in Appendix B.2.

With the specifically designed network architecture, we train the reward function ϕ(𝒙~,c)subscriptitalic-ϕ~𝒙𝑐\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) with our collected reward fine-tuning dataset 𝒟humansubscript𝒟human\mathcal{D}_{\mathrm{human}}caligraphic_D start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT induced in Sec. 3.1. For each input image cIsubscript𝑐𝐼c_{I}italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and instruction cEsubscript𝑐𝐸c_{E}italic_c start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT pair, we have K𝐾Kitalic_K edited images {𝒙~}k=1Ksuperscriptsubscript~𝒙𝑘1𝐾\{\tilde{\boldsymbol{x}}\}_{k=1}^{K}{ over~ start_ARG bold_italic_x end_ARG } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ranked by human annotators, and denote the human preference of edited image 𝒙~isubscript~𝒙𝑖\tilde{\boldsymbol{x}}_{i}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over 𝒙~jsubscript~𝒙𝑗\tilde{\boldsymbol{x}}_{j}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT by 𝒙~i𝒙~jsucceedssubscript~𝒙𝑖subscript~𝒙𝑗\tilde{\boldsymbol{x}}_{i}\succ\tilde{\boldsymbol{x}}_{j}over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≻ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. Then we can follow the Bradley-Terry model of preferences [6, 37] to define the pairwise loss function:

RM(ϕ):=𝒙~i𝒙~jlog[exp(ϕ(𝒙~i,c))k=i,jexp(ϕ(𝒙~k,c))],assignsubscriptRMitalic-ϕsubscriptsucceedssubscript~𝒙𝑖subscript~𝒙𝑗subscriptitalic-ϕsubscript~𝒙𝑖𝑐subscript𝑘𝑖𝑗subscriptitalic-ϕsubscript~𝒙𝑘𝑐\ell_{\mathrm{RM}}(\phi):=-\sum_{\tilde{\boldsymbol{x}}_{i}\succ\tilde{% \boldsymbol{x}}_{j}}\log\left[\frac{\exp(\mathcal{R}_{\phi}(\tilde{\boldsymbol% {x}}_{i},c))}{\sum_{k=i,j}\exp(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}}_{k},c% ))}\right]\,,roman_ℓ start_POSTSUBSCRIPT roman_RM end_POSTSUBSCRIPT ( italic_ϕ ) := - ∑ start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≻ over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log [ divide start_ARG roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = italic_i , italic_j end_POSTSUBSCRIPT roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_c ) ) end_ARG ] ,

where (i,j)[1K]𝑖𝑗delimited-[]1𝐾(i,j)\in[1\ldots K]( italic_i , italic_j ) ∈ [ 1 … italic_K ] and we can get (K2)binomial𝐾2K\choose 2( binomial start_ARG italic_K end_ARG start_ARG 2 end_ARG ) pairs of comparison for each condition c𝑐citalic_c. Similar to [37], we put all the (K2)binomial𝐾2K\choose 2( binomial start_ARG italic_K end_ARG start_ARG 2 end_ARG ) pairs for each condition c𝑐citalic_c in a single batch to learn the reward functions. We provide a detailed reward model training discussion in Appendix B.2.

Refer to caption
Figure 3: Model architecture for reward R(𝒙~,c)𝑅~𝒙𝑐R(\tilde{\boldsymbol{x}},c)italic_R ( over~ start_ARG bold_italic_x end_ARG , italic_c ). Here the reward model evaluates human preference for an edited image of a hand selecting an orange compared to the original input image of the hand selecting an apple. The input to the reward model includes both images and a text instruction. The output is a score indicating the degree of preference for the edited image based on the input image and instruction.

3.4 Human Feedback based Model Fine-tuning

With the learned reward function ϕ(c,𝒙~)subscriptitalic-ϕ𝑐~𝒙\mathcal{R}_{\phi}(c,\tilde{\boldsymbol{x}})caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_c , over~ start_ARG bold_italic_x end_ARG ), the next step is to improve the instructional supervised training model by reward maximization. As a result, we can obtain an instructional diffusion model that aligns with human preferences.

The RL fine-tuning techniques we present are built upon recent offline RL techniques [27, 38, 9, 20] With an input image and editing instruction condition c=[cI,cE]𝑐subscript𝑐𝐼subscript𝑐𝐸c=[c_{I},c_{E}]italic_c = [ italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ], we define the edited image data distribution generated by the instructional supervised diffusion model as p(𝒙~|c)𝑝conditional~𝒙𝑐p(\tilde{\boldsymbol{x}}|~{}c)italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ), and the edited image data distribution generated by the current diffusion model we want to optimize as ρ(𝒙~|c)𝜌conditional~𝒙𝑐\rho(\tilde{\boldsymbol{x}}|~{}c)italic_ρ ( over~ start_ARG bold_italic_x end_ARG | italic_c ) , then under the pessimistic principle of offline RL, we can optimize ρ𝜌\rhoitalic_ρ by the following objectives:

J(ρ):=maxρ𝔼c[\displaystyle\textstyle J(\rho):=\max_{\rho}\mathbb{E}_{c}\big{[}italic_J ( italic_ρ ) := roman_max start_POSTSUBSCRIPT italic_ρ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT [ 𝔼𝒙~ρ(|c)[ϕ(𝒙~,c)]\displaystyle\mathbb{E}_{\tilde{\boldsymbol{x}}\sim\rho(\cdot|c)}[\mathcal{R}_% {\phi}(\tilde{\boldsymbol{x}},c)]-blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG ∼ italic_ρ ( ⋅ | italic_c ) end_POSTSUBSCRIPT [ caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) ] -
ηKL(ρ(𝒙~|c)||p(𝒙~|c))],\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\eta\mathrm{KL}(\rho(\tilde{% \boldsymbol{x}}|c)||p(\tilde{\boldsymbol{x}}|c))\big{]}\,,italic_η roman_KL ( italic_ρ ( over~ start_ARG bold_italic_x end_ARG | italic_c ) | | italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) ) ] , (1)

where η𝜂\etaitalic_η is a hyper-parameter. The first term in Eq. (1) is the standard reward maximization in RL. The second term is a regularization to stabilize learning, which is a widely used technique in offline RL [24], and is also adopted for PPO fine-tuning of InstructGPT (a.k.a “PPO-ptx”) [37].

To avoid using sampling-based methods to optimize ρ𝜌\rhoitalic_ρ, we can differentiate J(ρ)𝐽𝜌J(\rho)italic_J ( italic_ρ ) w.r.t ρ(𝒙~|c)𝜌conditional~𝒙𝑐\rho(\tilde{\boldsymbol{x}}|c)italic_ρ ( over~ start_ARG bold_italic_x end_ARG | italic_c ) and solve for the optimal ρ(𝒙~|c)superscript𝜌conditional~𝒙𝑐\rho^{*}(\tilde{\boldsymbol{x}}|c)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ), resulting the following expression for the optimal solution of Eq. (1):

ρ(𝒙~|c)p(𝒙~|c)exp(ϕ(𝒙~,c)/η),proportional-tosuperscript𝜌conditional~𝒙𝑐𝑝conditional~𝒙𝑐subscriptitalic-ϕ~𝒙𝑐𝜂\displaystyle\rho^{*}(\tilde{\boldsymbol{x}}|c)\propto p(\tilde{\boldsymbol{x}% }|c)\exp\left(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)/\eta\right)\,,italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) ∝ italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) / italic_η ) , (2)

or ρ(𝒙~|c)=1Z(c)p(𝒙~|c)exp(ϕ(𝒙~,c)/η)superscript𝜌conditional~𝒙𝑐1𝑍𝑐𝑝conditional~𝒙𝑐subscriptitalic-ϕ~𝒙𝑐𝜂\rho^{*}(\tilde{\boldsymbol{x}}|c)=\frac{1}{Z(c)}p(\tilde{\boldsymbol{x}}|c)% \exp\left(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)/\eta\right)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) = divide start_ARG 1 end_ARG start_ARG italic_Z ( italic_c ) end_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) / italic_η ), with Z(c)=p(𝒙~|c)exp(ϕ(𝒙~,c)/η)𝑑𝒙~𝑍𝑐𝑝conditional~𝒙𝑐subscriptitalic-ϕ~𝒙𝑐𝜂differential-d~𝒙Z(c)=\int p(\tilde{\boldsymbol{x}}|c)\exp\left(\mathcal{R}_{\phi}(\tilde{% \boldsymbol{x}},c)/\eta\right)d\tilde{\boldsymbol{x}}italic_Z ( italic_c ) = ∫ italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) / italic_η ) italic_d over~ start_ARG bold_italic_x end_ARG being the partition function. A detailed derivation is in Appendix C.2.

Weighted Reward Loss.   The optimal target distribution ρ(𝒙~|c)superscript𝜌conditional~𝒙𝑐\rho^{*}(\tilde{\boldsymbol{x}}|c)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) in Eq. (2) can be viewed as an exponential reward-weighted distribution for p(𝒙~|c)𝑝conditional~𝒙𝑐p(\tilde{\boldsymbol{x}}|c)italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ). Moreover, we have already obtained the empirical edited image data drawn from p(𝒙~|c)𝑝conditional~𝒙𝑐p(\tilde{\boldsymbol{x}}|c)italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) when constructing the instructional editing dataset, and we can view the exponential reward weighted edited image 𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG from the instructional editing dataset as an empirical approximation of samples drawn from ρ(𝒙~|c)superscript𝜌conditional~𝒙𝑐\rho^{*}(\tilde{\boldsymbol{x}}|c)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ). Formally, we can fine-tune a diffusion model thus it generates data from ρ(𝒙~|c)superscript𝜌conditional~𝒙𝑐\rho^{*}(\tilde{\boldsymbol{x}}|c)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ), resulting in the weighted reward loss:

WR(θ):=𝔼(𝒙~),c,ϵ𝒩(0,1),t[ω(𝒙~,c)ϵϵθ(zt,t,c)22],assignsubscriptWR𝜃subscript𝔼formulae-sequencesimilar-to~𝒙𝑐italic-ϵ𝒩01𝑡delimited-[]𝜔~𝒙𝑐superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡𝑐22\ell_{\mathrm{WR}}(\theta):=\mathbb{E}_{\mathcal{E}(\tilde{\boldsymbol{x}}),c,% \epsilon\sim\mathcal{N}(0,1),t}\left[\omega(\tilde{\boldsymbol{x}},c)\cdot% \left\|\epsilon-\epsilon_{\theta}(z_{t},t,c)\right\|_{2}^{2}\right]\,,roman_ℓ start_POSTSUBSCRIPT roman_WR end_POSTSUBSCRIPT ( italic_θ ) := blackboard_E start_POSTSUBSCRIPT caligraphic_E ( over~ start_ARG bold_italic_x end_ARG ) , italic_c , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ italic_ω ( over~ start_ARG bold_italic_x end_ARG , italic_c ) ⋅ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

with ω(𝒙~,c)=exp(ϕ(𝒙~,c)/η)𝜔~𝒙𝑐subscriptitalic-ϕ~𝒙𝑐𝜂\omega(\tilde{\boldsymbol{x}},c)=\exp\left(\mathcal{R}_{\phi}(\tilde{% \boldsymbol{x}},c)/\eta\right)italic_ω ( over~ start_ARG bold_italic_x end_ARG , italic_c ) = roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) / italic_η ) being the exponential reward weight for edited image 𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG and condition c𝑐citalic_c. Different from RL literature [39, 38] using exponential reward or advantage weights to learn a policy function, our weighted reward loss is derived for fine-tuning stable diffusion.

Condition Reward Loss.   We can also leverage the control-as-inference perspective of RL [26] to transform Eq. (2) to a conditional reward expression, thus we can directly view the reward as a conditional label to fine-tune diffusion models. Similar to [26], we introduce a new binary variable Rsuperscript𝑅R^{*}italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT indicating whether human prefers the edited image or not, where R=1superscript𝑅1R^{*}=1italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 denotes that human prefers the edited image, and R=0superscript𝑅0R^{*}=0italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0 denotes that human does not prefer, thus we have p(R=1|𝒙~,c)exp(ϕ(𝒙~,c))proportional-to𝑝superscript𝑅conditional1~𝒙𝑐subscriptitalic-ϕ~𝒙𝑐p(R^{*}=1~{}|~{}\tilde{\boldsymbol{x}},c)\propto\exp\left(\mathcal{R}_{\phi}(% \tilde{\boldsymbol{x}},c)\right)italic_p ( italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 | over~ start_ARG bold_italic_x end_ARG , italic_c ) ∝ roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) ). Together with Eq. (2), and applying Bayes rules gives us the following derivation:

p(𝒙~|c)𝑝conditional~𝒙𝑐\displaystyle\textstyle p(\tilde{\boldsymbol{x}}|c)italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) exp(ϕ(𝒙~,c)/η):=q(𝒙~|c)(p(R=1|𝒙~,c))1/ηassignsubscriptitalic-ϕ~𝒙𝑐𝜂𝑞conditional~𝒙𝑐superscript𝑝superscript𝑅conditional1~𝒙𝑐1𝜂\displaystyle\exp\left(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)/\eta\right% ):=q(\tilde{\boldsymbol{x}}|c)\left(p(R^{*}=1~{}|~{}\tilde{\boldsymbol{x}},c)% \right)^{1/\eta}roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) / italic_η ) := italic_q ( over~ start_ARG bold_italic_x end_ARG | italic_c ) ( italic_p ( italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 | over~ start_ARG bold_italic_x end_ARG , italic_c ) ) start_POSTSUPERSCRIPT 1 / italic_η end_POSTSUPERSCRIPT
=p(𝒙~|c)(p(𝒙~|R=1,c)p(R=1|c)p(𝒙~|c))1/ηabsent𝑝conditional~𝒙𝑐superscript𝑝conditional~𝒙superscript𝑅1𝑐𝑝superscript𝑅conditional1𝑐𝑝conditional~𝒙𝑐1𝜂\displaystyle=p(\tilde{\boldsymbol{x}}|c)\left(\frac{p(\tilde{\boldsymbol{x}}|% ~{}R^{*}=1,c)p(R^{*}=1|~{}c)}{p(\tilde{\boldsymbol{x}}|c)}\right)^{1/\eta}= italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) ( divide start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 , italic_c ) italic_p ( italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 | italic_c ) end_ARG start_ARG italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) end_ARG ) start_POSTSUPERSCRIPT 1 / italic_η end_POSTSUPERSCRIPT
p(𝒙~|c)11/ηp(𝒙~|R=1,c)1/η,proportional-toabsent𝑝superscriptconditional~𝒙𝑐11𝜂𝑝superscriptconditional~𝒙superscript𝑅1𝑐1𝜂\displaystyle\propto p(\tilde{\boldsymbol{x}}|c)^{1-1/\eta}p(\tilde{% \boldsymbol{x}}|~{}R^{*}=1,c)^{1/\eta}\,,∝ italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) start_POSTSUPERSCRIPT 1 - 1 / italic_η end_POSTSUPERSCRIPT italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 , italic_c ) start_POSTSUPERSCRIPT 1 / italic_η end_POSTSUPERSCRIPT ,

where we drop p(R=1|c)𝑝superscript𝑅conditional1𝑐p(R^{*}=1|~{}c)italic_p ( italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 1 | italic_c ) since it is a constant w.r.t 𝒙~~𝒙\tilde{\boldsymbol{x}}over~ start_ARG bold_italic_x end_ARG. We can now view the reward for each edited image as an additional condition. Define the new condition c~=[cI,cE,cR]~𝑐subscript𝑐𝐼subscript𝑐𝐸subscript𝑐𝑅\tilde{c}=[c_{I},c_{E},c_{R}]over~ start_ARG italic_c end_ARG = [ italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ], with cRsubscript𝑐𝑅c_{R}italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT as the reward label, we can fine-tune the diffusion model with the condition reward loss:

CR(θ)=𝔼(x),c~,ϵ𝒩(0,1),t[ϵϵθ(zt,t,c~)22].subscriptCR𝜃subscript𝔼formulae-sequencesimilar-to𝑥~𝑐italic-ϵ𝒩01𝑡delimited-[]superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡~𝑐22\displaystyle\textstyle\ell_{\mathrm{CR}}(\theta)=\mathbb{E}_{\mathcal{E}(x),% \tilde{c},\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z% _{t},t,\tilde{c})\|_{2}^{2}\Big{]}\,.roman_ℓ start_POSTSUBSCRIPT roman_CR end_POSTSUBSCRIPT ( italic_θ ) = blackboard_E start_POSTSUBSCRIPT caligraphic_E ( italic_x ) , over~ start_ARG italic_c end_ARG , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t end_POSTSUBSCRIPT [ ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , over~ start_ARG italic_c end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .
Refer to caption
Figure 4: Comparisons between IP2P-Official (InstructPix2Pix official model), IP2P-Ours (InstructPix2Pix using our data) and HIVE. HIVE can boost performance by understanding the instruction correctly.

We quantize the reward into five categories, based on the quantile of the empirical reward distribution of the training dataset, and convert the reward value into a text prompt. For instance, if the reward value of a training pair lies in the bottom 20% of the reward distribution of the dataset, then we convert the reward value as a text prompt condition cR:=assignsubscript𝑐𝑅absentc_{R}:=italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT :=“The image quality is one out of five”. And during the inference time to generate edited images, we fix the text prompt as cR:=assignsubscript𝑐𝑅absentc_{R}:=italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT :=“The image quality is five out of five”, indicating we want the generated edited images with the highest reward. We empirically find this technique improves the stability of fine-tuning.

4 Experiments

This section presents the experimental results and ablation studies of HIVE’s technical choices, demonstrating the effectiveness of our method. We adopt the default guidance scale parameters in InstrcutPix2Pix for a fair comparison. Through our experiments, we discovered that the conditional reward loss performs slightly better than the weighted reward loss, and therefore, we present our results based on the conditional reward loss. The detailed comparisons can be found in Sec. 4.2 and Appendix D.3.

Refer to caption
Figure 5: Comparisons between IP2P-Official, IP2P-Ours, and HIVE. It plots tradeoffs between consistency with the input image and consistency with the edit. The higher the better. For all methods, we adopt the same parameters as that in [7].

We evaluate our method using two datasets: a synthetic evaluation dataset with 15,652 image pairs from [7] and a self-collected 1K evaluation dataset with real image-instruction pairs. For the synthetic dataset, we follow InstructPix2Pix’s quantitative evaluation metric and plot the trade-offs between CLIP image similarity and directional CLIP similarity [14]. For the 1K dataset, we conduct a user study where for each instruction, the images generated by competing methods are reviewed and voted by three human annotators, and the winner is determined by majority votes.

Refer to caption Refer to caption
IP2P-Official vs IP2P-Ours IP2P-Ours vs HIVE
Figure 6: User study of comparison between (a) IP2P-Official vs IP2P-Ours and (b) IP2P-Ours and HIVE. IP2P-Ours obtains 30% more votes than IP2P-Official. HIVE obtains 25% more votes than that IP2P-Ours.

4.1 Baseline Comparisons

We perform experiments with the same setup as InstructPix2Pix, where stable diffusion (SD) v1.5 is adopted. We compare three models: InstructPix2Pix official model (IP2P-Official), InstructPix2Pix using our data (IP2P-Ours) 333It is the same to HIVE without human feedback., and HIVE. We report the quantitative results on the synthetic evaluation dataset in Fig. 5. We observe that IP2P-Ours improves notably over IP2P-Official (blue curve vs. green curve). Moreover, human feedback further boosts the performance of HIVE (red curve vs blue curve) over IP2P-Ours by a large margin. In other words, with the same directional similarity value, HIVE obtains better image consistency than InstructPix2Pix.

To test the effectiveness of HIVE on real-world images, we report the user study results on the 1K evaluation dataset. We use “Tie” to represent that users think results are equally good or equally bad. As shown in Fig. 6(a), IP2P-Ours gets around 30% more votes than the IP2P-Official. The result is consistent with the user study on the synthetic dataset. We also demonstrate the user study outcome between HIVE and IP2P-Ours in Fig. 6(b). The user study indicates similar conclusions to the consistency plot, where HIVE gets around 25% more favorites than IP2P-Ours.

In Fig. 4, we present representative edits that demonstrate the effectiveness of HIVE. The results show that while using more data can partially improve editing instructions without human feedback, the reward model leads to better alignment between instruction and the edited image. For example, in the second row, IP2P-Ours generates a door-like object, but with the guidance of human feedback, the generated door matches human perception better. In the fourth row, the example of which is from the failure examples in [7], HIVE can locate the tie and change its color correctly.

Additionally, our visual analysis of the results (Fig. 7) indicates that the HIVE model tends to preserve the remaining part of the original image that is not instructed to be edited, while IP2P-Ours leads to excessive image editing more often. For instance, in the first example of Fig. 7, HIVE blends a pond naturally into the original image. The two InstructPix2Pix models fulfill the same instruction, however, at the same time, alter the uninstructed part of the original background.

Refer to caption
Figure 7: Human feedback tends to help HIVE avoid unwanted excessive image modifications.

4.2 Ablation Study

Weighted Reward and Condition Reward Loss.   We perform user study on HIVE with these two losses individually. As shown in Fig. 8, these two losses obtain similar human preferences on the evaluation dataset. More comparisons are in Appendix D.

Refer to caption Refer to caption
HIVE with weighted reward loss HIVE with condition reward loss
Figure 8: User study of pairwise comparison between (a) HIVE with weighted reward loss and (b) HIVE with condition reward loss. The human preferences are close to each other.

Cycle Consistency   We analyze the impact of it which is introduced in Sec. 3.2.1. The top five augmentations in the cycle consistency are demonstrated in Fig. 9(a). We perform evaluation on both synthetic dataset and the 1K evaluation dataset. The user study in Fig. 9(b) shows that the cycle consistency augmentation improves the performance of HIVE by a notable margin.

Refer to caption Refer to caption
Top five augmentations. User study of cycle consistency.
Figure 9: Cycle consistency analysis.

Success Rate on Verbs   It is observed that five verbs take around 85% of all verbs, where details can be found in Sec. A. We compare HIVE with IP2P-Ours on these five verbs, and report the success rate of these two methods on these verbs. It is seen in Fig. 10 that HIVE improves the most on “add” from 23.5% to 28.7%.

Refer to caption Refer to caption
IP2P-Ours HIVE
Figure 10: Success rate of IP2P-Ours and HIVE on top five verbs.

Other Baselines.   To test the effectiveness of HIVE, we experiment two additional baselines. In Fig. 11(a), we upgrade the backbone of stable diffusion from v1.5 to v2.1. We observe that the upgraded backbone slightly improves the results. In Fig. 11(b), we directly use the reward scalar instead of the reward prompt as the condition for training, and the condition on the highest reward scalar for generating the image. We adopt the user study to compare it (named HIVE-reward) with HIVE. HIVE obtains 25.8 % more votes than the baseline model conditioned on the reward score. This is mainly because directly conditioning on the highest reward might cause overfiting.

Refer to caption Refer to caption
HIVE SD v1.5 and v2.1 HIVE-Reward vs. HIVE
Figure 11: User study of pairwise comparison between (a) HIVE with SD v1.5 and v2.1 and (b) HIVE conditioning on reward score and HIVE. The human preferences are very close to each other.

Failure Cases and Limitations.   We summarize representative failure cases in Fig. 12. First, some instructions cannot be understood. In the upper left example in Fig. 12, the prompt “zoom in” or similar instructions can rarely be successful. We believe the root cause is current training data generation method fails to generate image pairs with this type of instruction. Second, counting and spatial reasoning are common failure cases (see the upper right example in Fig. 12). We find that the instruction “one”, “two”, or “on the right” can lead to many undesired results. Third, the object understanding sometimes is wrong. In the bottom left example, the red color is changed on the wrong object. This is a common error in HIVE, where instructed edited objects are wrongly recognized.

We find some other limitations as well. One limitation of HIVE is that it cannot bring benefits to the cases where all outputs by the model without human feedback obtain the same wrong results. In such cases, user preferences cannot always be beneficial to the results. We believe that improving the data as well as the base model is an important step in the future. Another limitation is that compared to Prompt-to-Prompt [16], which is used to generate our training data, HIVE sometimes leads to some unstructured change in the image. We think that it is because of the limitation of the current training data. Instructed editing can have more diverse and ambiguous scenarios than traditional image editing problems. Using GPT-3 to finetune prompts to generate the training data is limited by the model and the labeled data. More ablation studies are in Appendix D.

Refer to caption
Figure 12: Failure examples.

5 Conclusion and Discussion

In our paper, we introduce a novel framework called HIVE that enables instructional image editing with human feedback. Our framework integrates human feedback, which is quantified as reward values, into the diffusion model fine-tuning process. We design two variants of the approach and both of them improve performance over previous state-of-the-art instructional image editing methods. Our work demonstrates instructional image editing with human feedback is a variable approach to align image generation with human preference, thus unlocking new opportunities and potential to scale up the model capabilities towards more powerful applications such as conversational image editing. While our method demonstrates impressive performance, we have also identified failure scenarios, as discussed in Sec. 4.2. In addition, it is possible that our trained model inherits bias and suffers from harmful content from pre-trained foundation models such as Stable Diffusion, GPT3 and BLIP. These limitations would be considered when interpreting our results, and we expect red teaming with human feedback to mitigate some of the risks in future work.

References

  • Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  • Avrahami et al. [2022a] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022a.
  • Avrahami et al. [2022b] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022b.
  • Bai et al. [2022] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  • Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723. Springer, 2022.
  • Bradley and Terry [1952] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • Brooks et al. [2022] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilyva Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • Chen et al. [2021] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
  • Christiano et al. [2017] Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. NeurIPS, 2017.
  • Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Gal et al. [2022] Rinon Gal, Or Patashnik, Haggai Maron, Amit Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics, 41(4):1–13, 2022.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. NeurIPS, 2014.
  • Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
  • Ibarz et al. [2018] Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. NeurIPS, 2018.
  • Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  • Janner et al. [2021] Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, 2021.
  • Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URL https://arxiv. org/abs/2206.00364, 2022.
  • Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2426–2435, 2022.
  • Kingma and Welling [2013] Diederik Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  • Lee et al. [2023] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  • Levine [2018] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
  • Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022.
  • Li et al. [2023] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. arXiv:2301.07093, 2023.
  • Liew et al. [2022] Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi Feng. Magicmix: Semantic mixing with diffusion models. arXiv preprint arXiv:2210.16056, 2022.
  • Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  • Meng et al. [2021] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
  • Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  • Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  • OpenAI [a] OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, a.
  • OpenAI [b] OpenAI. Openaiapi. https://platform.openai.com/docs/guides/fine-tuning, b.
  • Ouyang et al. [2022] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  • Peng et al. [2019] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
  • Peters et al. [2010] Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1607–1612, 2010.
  • Pinto et al. [2023] Andre Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, and Xiaohua Zhai. Tuning computer vision models with task rewards. arXiv preprint arXiv:2302.08242, 2023.
  • Qin et al. [2023] Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. In NeurIPS, 2023.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
  • Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
  • Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Reed et al. [2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text-to-image synthesis. In ICML, 2016.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
  • Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
  • Scheurer et al. [2022] Jeremy Scheurer, Jon Ander Campos, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback. arXiv preprint arXiv:2204.14146, 2022.
  • Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2111.02114, 2021.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of International Conference on Machine Learning, pages 2256–2265, 2015.
  • Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020.
  • Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
  • Stiennon et al. [2020] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. NeurIPS, 2020.
  • Wallace et al. [2022] Bram Wallace, Akash Gokul, and Nikhil Naik. Edict: Exact diffusion inversion via coupled transformations. arXiv preprint arXiv:2211.12446, 2022.
  • Wu et al. [2023] Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
  • Xiao et al. [2021] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804, 2021.
  • Xu et al. [2023] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
  • Xu et al. [2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
  • Yu et al. [2022] Jiahui Yu, Yuanzhong Xu, **g Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  • Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE International Conference on Computer Vision (ICCV), 2023.
  • Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.

Appendix

Appendix A Data Collection and User Study

In the evaluation steps, we collect real-world images with instructions using Amazon Mechanical Turk (Mturk) 444https://www.mturk.com. We randomly collect 200 real-world images. Then we ask Mturk annotators to write five instructions for each image, and encourage them to have wild imaginations and diversify the instruction types. We encourage annotators to not be limited to making the image realistic. For example, annotators can write “add a horse in the sky”. A screenshot of the interface is illustrated in Fig. 13. We analyze the top five verbs and nouns in the evaluation dataset. It is shown in Fig. 15(a) that the verbs “add”, “change”, “make”, “remove” and “put” make up around 85% of all verbs, which means that the editing instruction verbs have a long-tail distribution. In contrast, the distribution of nouns in Fig. 15(b) is close to uniform, where the top five nouns represent only around 20% of all nouns.

Refer to caption
Figure 13: Mturk writing editing instructions interface: write five instructions per image.

In user studies, we use Mturk to ask annotators to evaluate edited images. A screenshot of the interface is shown in Fig. 14. The annotators are provided with the original image, two edited images, and the editing instruction. They are asked to select the better edited image. The third option indicates that the edited images are equally good or equally bad. We ask three annotators to label one data sample, and use the majority votes to determine the results. We shuffle the edited images to avoid choosing the left image over the right and vice versa.

Refer to caption
Figure 14: Mturk labeling interface: select the better edited image.
Refer to caption Refer to caption
Top five verbs Top five nouns
Figure 15: Top five verbs and nouns in the evaluation dataset.

Appendix B Implementation Details

B.1 Instructional Supervised Training

We use pre-trained stable diffusion models as the initial checkpoint to start instructional supervised training. We train HIVE on 40GB NVIDIA A100 GPUs for 500 epochs. We use the learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and the image size of 256. In the inference, we use 512 as the default image resolution.

B.2 Human Feedback Rewards Learning

Refer to caption
Figure 16: Overall architecture of HIVE. Different from Fig. 2, in the third step, we use weighted reward loss instead of condition reward loss to fine-tune the diffusion model.

As shown in Fig. 3, the reward model takes in an input image cIsubscript𝑐𝐼c_{I}italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, a text instruction cEsubscript𝑐𝐸c_{E}italic_c start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, and an edited image x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG and outputs a scalar value. Inspired by the recent work on the vision-language model, especially BLIP [28], we employ a visual transformer [13] as our image encoder and an image-grounded text encoder as the multimodal encoder for images and text. Finally, we set a linear layer on top of the image-grounded text encoder to map the multimodal embedding to a scalar value.

(1) Visual transformer. We encode both the input image cIsubscript𝑐𝐼c_{I}italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT and edited image x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG with the same visual transformer. Then we obtain the joint image embedding by concatenating the two image embeddings vit(cI)𝑣𝑖𝑡subscript𝑐𝐼vit(c_{I})italic_v italic_i italic_t ( italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ), vit(x~)𝑣𝑖𝑡~𝑥vit(\tilde{x})italic_v italic_i italic_t ( over~ start_ARG italic_x end_ARG ).

(2) Image-grounded text encoder. The image-grounded text encoder is a multimodal encoder that inserts one additional cross-attention layer between the self-attention layer and the feed-forward network for each transformer block of BERT [11]. The additional cross-attention layer incorporates visual information into the text model. The output embedding of the image-grounded text encoder is used as the multimodal representation of the (cIsubscript𝑐𝐼c_{I}italic_c start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT, cEsubscript𝑐𝐸c_{E}italic_c start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT, x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG) triplet.

We gather a dataset comprising 3,634 images for the purpose of ranking. For each image, we generate five variant edited images, and ask an annotator to rank images from best to worst. Additionally, we ask annotators to indicate if any of the following scenarios apply: (1) all edited images are edited but none of them follow the instruction; (2) all edited images are visually the same as the original image; (3) all images are edited beyond the scope of instruction; (4) edited images have harmful content containing sex, violence, porn, etc; and (5) all edited images look similar to each other. We compare training reward models by filtering some/all of these options.

We note that a considerable portion of the collected data falls under at least one of the aforementioned categories, indicating that even for humans, ranking these images is challenging. As a result, we only use the data that did not include any non-rankable options in the reward model training. From a pool of 1,412 images, we select 1,285 for the training set, while the remaining images were used for the validation set. The reward model is trained on a dataset of comparisons between multiple model outputs on the same input. Each comparison sample contains an input image, an instruction, five edited versions of the image, and the corresponding rankings. We divide the dataset into training and validation sets based on the distribution of the corresponding instructions.

We apply the method in Sec. 3.3 on the reward data to develop a reward model. We initialize the reward model from the pre-trained BLIP, which was trained on paired images and captions using three objectives: image-text contrastive learning, image-text matching, and masked language modeling. Although there is a domain gap between BLIP’s pre-training data and our reward data, where the captions in BLIP’s data describe a single image, and the instructions in our data refer to the difference between image pairs. We hypothesized that leveraging the learned alignment between text and image in BLIP could enhance the reward model’s ability to comprehend the relationship between the instruction and the image pairs.

The reward model is trained using 4 A100 GPUs for 10 epochs, employing a learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and weight decay of 0.05. The image encoder’s and multimodal encoder’s last layer outputs are utilized as image and multimodal representations, respectively. The encoders’ final layer is the only fine-tuned component.

We use the trained reward model to generate a reward score on our training data. We perform two experiments. The first experiment takes the exponential rewards as weights and fine-tunes the diffusion model with weighted reward loss as described in Sec. 3.4. See Fig. 16 for the visualization of the method. The second experiment transforms the rewards to text prompts and fine-tunes the diffusion model with the condition reward loss as described in Sec. 3.4. The method is introduced in Fig. 2. We compare those two experiment settings, and results can be found in Sec. D.3.

Appendix C Reward Maximization for Diffusion-Based Generative Models

C.1 Discussion on On-Policy based Reward Maximization for Diffusion Models

Directly adapting on-policy RL methods to the current training pipeline might be computationally expensive, but we do not conclude that sampling-based approaches are not doable for diffusion models. We consider develo** more scalable sampling-based methods as future work.

We start the sampling methods derivation with the following objective:

J(θ):=maxπθ𝔼cpc[𝔼𝒙~πθ(|c)[ϕ(𝒙~,c)]ηKL[p𝒟(𝒙~|c)||πθ(𝒙~|c)]],\displaystyle J(\theta):=\max_{\pi_{\theta}}\mathbb{E}_{c\sim p_{c}}\Big{[}% \mathbb{E}_{\tilde{\boldsymbol{x}}\sim\pi_{\theta}(\cdot|c)}\left[\mathcal{R}_% {\phi}(\tilde{\boldsymbol{x}},c)\right]-\eta\mathrm{KL}[p_{\mathcal{D}}(\tilde% {\boldsymbol{x}}|c)||\pi_{\theta}(\tilde{\boldsymbol{x}}|c)]\Big{]}\,,italic_J ( italic_θ ) := roman_max start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_c ∼ italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ blackboard_E start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_c ) end_POSTSUBSCRIPT [ caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) ] - italic_η roman_KL [ italic_p start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) | | italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) ] ] , (3)

where pc(c)p𝒟(𝒙~|c)subscript𝑝𝑐𝑐subscript𝑝𝒟conditional~𝒙𝑐p_{c}(c)p_{\mathcal{D}}(\tilde{\boldsymbol{x}}|c)italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_c ) italic_p start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) is the joint distribution of the condition and edited images pair, and πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT denotes the policy or the diffusion model we want to optimize. Note that p𝒟(𝒙~|c)subscript𝑝𝒟conditional~𝒙𝑐p_{\mathcal{D}}(\tilde{\boldsymbol{x}}|c)italic_p start_POSTSUBSCRIPT caligraphic_D end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) and π(𝒙~|c)𝜋conditional~𝒙𝑐\pi(\tilde{\boldsymbol{x}}|c)italic_π ( over~ start_ARG bold_italic_x end_ARG | italic_c ) are swaped compared with the objective in Eq. (1). The second term in Eq. (3), is the KL Minimization formula for maximum likelihood estimation, equivalent to the loss of diffusion models. We represent the policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT via the reverse process of a conditional diffusion model:

πθ(𝒙~|c):=pθ(𝒙~0:T|c)=p0(𝒙~T)t=1Tpθ(𝒙~t1|𝒙~t;c),assignsubscript𝜋𝜃conditional~𝒙𝑐subscript𝑝𝜃conditionalsuperscript~𝒙:0𝑇𝑐subscript𝑝0superscript~𝒙𝑇superscriptsubscriptproduct𝑡1𝑇subscript𝑝𝜃conditionalsuperscript~𝒙𝑡1superscript~𝒙𝑡𝑐\displaystyle\pi_{\theta}(\tilde{\boldsymbol{x}}|c):=p_{\theta}(\tilde{% \boldsymbol{x}}^{0:T}|~{}c)=p_{0}(\tilde{\boldsymbol{x}}^{T})\prod_{t=1}^{T}p_% {\theta}(\tilde{\boldsymbol{x}}^{t-1}|\tilde{\boldsymbol{x}}^{t};c)\,,italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) := italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT 0 : italic_T end_POSTSUPERSCRIPT | italic_c ) = italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_c ) ,

where p0(𝒙~T):=𝒩(𝒙~T,𝟎;𝐈)assignsubscript𝑝0superscript~𝒙𝑇𝒩superscript~𝒙𝑇0𝐈p_{0}(\tilde{\boldsymbol{x}}^{T}):=\mathcal{N}(\tilde{\boldsymbol{x}}^{T},\bf{% 0};\bf{{I}})italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) := caligraphic_N ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_0 ; bold_I ), and pθ(𝒙~t1|𝒙~t;c):=𝒩(𝒙~t|μθ(𝒙~t,t),σt2)Iassignsubscript𝑝𝜃conditionalsuperscript~𝒙𝑡1superscript~𝒙𝑡𝑐𝒩conditionalsuperscript~𝒙𝑡subscript𝜇𝜃subscript~𝒙𝑡𝑡superscriptsubscript𝜎𝑡2Ip_{\theta}(\tilde{\boldsymbol{x}}^{t-1}|\tilde{\boldsymbol{x}}^{t};c):=% \mathcal{N}(\tilde{\boldsymbol{x}}^{t}|\mu_{\theta}(\tilde{\boldsymbol{x}}_{t}% ,t),\sigma_{t}^{2})\bf{\mathrm{I}}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_c ) := caligraphic_N ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_μ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ) , italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_I is a Gaussian distribution, whose parameters are defined by score function ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and stepsize of noise scalings. So we can get a edited image sample 𝒙~0superscript~𝒙0\tilde{\boldsymbol{x}}^{0}over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT by running a reverse diffusion chain:

𝒙~t1|𝒙~t=1αt(𝒙~t1αt1α¯tϵθ(𝒙~t,c,t))+σt𝒛t,𝒛𝒩(0,I),fort=T,,1,formulae-sequenceconditionalsuperscript~𝒙𝑡1superscript~𝒙𝑡1subscript𝛼𝑡superscript~𝒙𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscriptitalic-ϵ𝜃superscript~𝒙𝑡𝑐𝑡subscript𝜎𝑡subscript𝒛𝑡formulae-sequencesimilar-to𝒛𝒩0Ifor𝑡𝑇1\displaystyle\tilde{\boldsymbol{x}}^{t-1}|\tilde{\boldsymbol{x}}^{t}=\frac{1}{% \sqrt{\alpha_{t}}}\left(\tilde{\boldsymbol{x}}^{t}-\frac{1-\alpha_{t}}{\sqrt{1% -\bar{\alpha}_{t}}}\epsilon_{\theta}(\tilde{\boldsymbol{x}}^{t},c,t)\right)+% \sigma_{t}\boldsymbol{z}_{t},~{}~{}\boldsymbol{z}\sim\mathcal{N}(\textbf{0},% \textbf{I}),\text{for}~{}~{}t=T,\ldots,1\,,over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , italic_c , italic_t ) ) + italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z ∼ caligraphic_N ( 0 , I ) , for italic_t = italic_T , … , 1 ,

and 𝒙~T𝒩(0,I)similar-tosuperscript~𝒙𝑇𝒩0I\tilde{\boldsymbol{x}}^{T}\sim\mathcal{N}(\textbf{0},\textbf{I})over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , I ).

As a result, the reverse diffusion process can be viewed as a black box function defined by ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and noises ϵ:=(𝒛T,,𝒛1,𝒙~T)assignbold-italic-ϵsubscript𝒛𝑇subscript𝒛1superscript~𝒙𝑇\boldsymbol{\epsilon}:=(\boldsymbol{z}_{T},\ldots,\boldsymbol{z}_{1},\tilde{% \boldsymbol{x}}^{T})bold_italic_ϵ := ( bold_italic_z start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , … , bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ), which we can view as a shared parameter network with noises. And for each layer, we can view the parameter is the score function ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Define the network as

𝒙~0:=f(c,ϵ;θ),ϵpnoise(),cpc(),formulae-sequenceassignsuperscript~𝒙0𝑓𝑐bold-italic-ϵ𝜃formulae-sequencesimilar-tobold-italic-ϵsubscript𝑝noisesimilar-to𝑐subscript𝑝𝑐\displaystyle\tilde{\boldsymbol{x}}^{0}:=f(c,\boldsymbol{\epsilon};\theta)\,,~% {}~{}\boldsymbol{\epsilon}\sim p_{\mathrm{noise}}(\cdot),c\sim p_{c}(\cdot)\,,over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT := italic_f ( italic_c , bold_italic_ϵ ; italic_θ ) , bold_italic_ϵ ∼ italic_p start_POSTSUBSCRIPT roman_noise end_POSTSUBSCRIPT ( ⋅ ) , italic_c ∼ italic_p start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( ⋅ ) ,

where we can rewrite the first term as

𝔼c𝒟,ϵpnoise()[ϕ(f(c,ϵ;θ),c)],subscript𝔼formulae-sequencesimilar-to𝑐𝒟similar-tobold-italic-ϵsubscript𝑝noisedelimited-[]subscriptitalic-ϕ𝑓𝑐bold-italic-ϵ𝜃𝑐\displaystyle\mathbb{E}_{c\sim\mathcal{D},\boldsymbol{\epsilon}\sim p_{\mathrm% {noise}}(\cdot)}[\mathcal{R}_{\phi}(f(c,\boldsymbol{\epsilon};\theta),c)]\,,blackboard_E start_POSTSUBSCRIPT italic_c ∼ caligraphic_D , bold_italic_ϵ ∼ italic_p start_POSTSUBSCRIPT roman_noise end_POSTSUBSCRIPT ( ⋅ ) end_POSTSUBSCRIPT [ caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_f ( italic_c , bold_italic_ϵ ; italic_θ ) , italic_c ) ] ,

and we can optimize the parameter θ𝜃\thetaitalic_θ with path gradient if subscript\mathcal{R}_{\cdot}caligraphic_R start_POSTSUBSCRIPT ⋅ end_POSTSUBSCRIPT is differentiable with path gradient. Similarly, suppose we want to optimize the first term via PPO. In that case, the main technical difficulty is to estimate θlogπθ(𝒙~|c)subscript𝜃subscript𝜋𝜃conditional~𝒙𝑐\nabla_{\theta}\log\pi_{\theta}(\tilde{\boldsymbol{x}}|c)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ), which can be estimated with the following derivation:

θlogπθ(𝒙~|c)=θlogpθ(𝒙~0:T|c)=t=1Tθlogpθ(𝒙~t1|𝒙~t;c).subscript𝜃subscript𝜋𝜃conditional~𝒙𝑐subscript𝜃subscript𝑝𝜃conditionalsuperscript~𝒙:0𝑇𝑐superscriptsubscript𝑡1𝑇subscript𝜃subscript𝑝𝜃conditionalsuperscript~𝒙𝑡1superscript~𝒙𝑡𝑐\displaystyle\nabla_{\theta}\log\pi_{\theta}(\tilde{\boldsymbol{x}}|c)=\nabla_% {\theta}\log p_{\theta}(\tilde{\boldsymbol{x}}^{0:T}|~{}c)=\sum_{t=1}^{T}% \nabla_{\theta}\log p_{\theta}(\tilde{\boldsymbol{x}}^{t-1}|\tilde{\boldsymbol% {x}}^{t};c)\,.∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) = ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT 0 : italic_T end_POSTSUPERSCRIPT | italic_c ) = ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT | over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ; italic_c ) .

Note that for both the end-to-end path gradient method and PPO we require to sample the reverse chain from 𝒙~Tsuperscript~𝒙𝑇\tilde{\boldsymbol{x}}^{T}over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to 𝒙~0superscript~𝒙0\tilde{\boldsymbol{x}}^{0}over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, thus we can estimate θlogπ(𝒙~|c)subscript𝜃𝜋conditional~𝒙𝑐\nabla_{\theta}\log\pi(\tilde{\boldsymbol{x}}|c)∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π ( over~ start_ARG bold_italic_x end_ARG | italic_c ) using the empirical samples 𝒙~0:Tsuperscript~𝒙:0𝑇\tilde{\boldsymbol{x}}^{0:T}over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT 0 : italic_T end_POSTSUPERSCRIPT.

For the above two methods, to perform one step policy gradient update, we need to run the whole reverse chain to get an edited image sample 𝒙~0superscript~𝒙0\tilde{\boldsymbol{x}}^{0}over~ start_ARG bold_italic_x end_ARG start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT to estimate the parameter gradient for the first term. As a result, the computational cost is the number of diffusion steps more extensive than the supervised fine-tuning cost. Now we need more than two days to fine-tune the stable diffusion model, so for standard LDM, where the number of steps is 1000, we can not finish the training within an acceptable training time. Even if we can use some fast sampling methods such as DDIM or variance preserve (VP) based noise scaling, the diffusion steps are still more than 5 or 10. Further, we haven’t seen any previous work using such noise scaling to fine-tune stable diffusion. As a result, we think naive sampling methods might have high risk to obtain similar performance, compared with our current offline RL based approaches.

C.2 Derivation for Eq. (2)

Take a functional view of Eq. (2), and differentiate J(ρ)𝐽𝜌J(\rho)italic_J ( italic_ρ ) w.r.t ρ𝜌\rhoitalic_ρ, we get

J(ρ)ρ=ϕ(𝒙~|c)η(logρ(𝒙~|c)+1logp(𝒙~|c)).𝐽𝜌𝜌subscriptitalic-ϕconditional~𝒙𝑐𝜂𝜌conditional~𝒙𝑐1𝑝conditional~𝒙𝑐\displaystyle\frac{\partial J(\rho)}{\partial\rho}=\mathcal{R}_{\phi}(\tilde{% \boldsymbol{x}}|c)-\eta\left(\log\rho(\tilde{\boldsymbol{x}}|c)+1-\log p(% \tilde{\boldsymbol{x}}|c)\right)\,.divide start_ARG ∂ italic_J ( italic_ρ ) end_ARG start_ARG ∂ italic_ρ end_ARG = caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) - italic_η ( roman_log italic_ρ ( over~ start_ARG bold_italic_x end_ARG | italic_c ) + 1 - roman_log italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) ) .

Setting J(ρ)ρ=0𝐽𝜌𝜌0\frac{\partial J(\rho)}{\partial\rho}=0divide start_ARG ∂ italic_J ( italic_ρ ) end_ARG start_ARG ∂ italic_ρ end_ARG = 0 gives us

logρ(𝒙~|c)𝜌conditional~𝒙𝑐\displaystyle\log\rho(\tilde{\boldsymbol{x}}|c)roman_log italic_ρ ( over~ start_ARG bold_italic_x end_ARG | italic_c ) =1ηϕ(𝒙~|c)+logp(𝒙~|c)1,absent1𝜂subscriptitalic-ϕconditional~𝒙𝑐𝑝conditional~𝒙𝑐1\displaystyle=\frac{1}{\eta}\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}}|c)+\log p% (\tilde{\boldsymbol{x}}|c)-1\,,= divide start_ARG 1 end_ARG start_ARG italic_η end_ARG caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ) + roman_log italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) - 1 ,
ρ(𝒙~|c)𝜌conditional~𝒙𝑐\displaystyle\rho(\tilde{\boldsymbol{x}}|c)italic_ρ ( over~ start_ARG bold_italic_x end_ARG | italic_c ) p(𝒙~|c)exp(ϕ(𝒙~,c)/η).proportional-toabsent𝑝conditional~𝒙𝑐subscriptitalic-ϕ~𝒙𝑐𝜂\displaystyle\propto p(\tilde{\boldsymbol{x}}|c)\exp\left(\mathcal{R}_{\phi}(% \tilde{\boldsymbol{x}},c)/\eta\right)\,.∝ italic_p ( over~ start_ARG bold_italic_x end_ARG | italic_c ) roman_exp ( caligraphic_R start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_italic_x end_ARG , italic_c ) / italic_η ) .

Thus we can get the optimal ρ(𝒙~|c)superscript𝜌conditional~𝒙𝑐\rho^{*}(\tilde{\boldsymbol{x}}|c)italic_ρ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over~ start_ARG bold_italic_x end_ARG | italic_c ).

Appendix D Additional Ablation Study

D.1 SD v1.5 and v2.1.

In Sec. 4.2, we upgrade the backbone of stable diffusion from v1.5 to v2.1, where OpenCLIP text encoder [49] replaces the CLIP text encoder [42]. In this section, we demonstrate the quantitative consistency plot in Fig. 17(a) on the synthetic evaluation dataset, which shows similar conclusions to the user study in Fig. 11(a). We compare IP2P-Ours v1.5 with v2.1 as well. An interesting observation is that we train IP2P-Ours with SD v2.1 and show in Fig. 17(b) that its improvement over SD v1.5 is larger than HIVE in Fig. 17(a).

Refer to caption Refer to caption
IP2P-Ours with SD v1.5 and v2.1 InstructPix2Pix with SD v1.5 and v2.1
Figure 17: HIVE and IP2P-Ours with SD v1.5 and v2.1.

D.2 Model Adaptation

We demonstrate that HIVE is able to adapt the reward model that is trained on a different backbone from the backbone in Step. 3. We use the SD v1.5 generated data to train the reward model, and process the rest steps using SD v2.1. We report user study results in Fig. 18. It is observed that the users vote similarly between the reward models that are trained on two SD backbones. In other words, the reward model is able to adapt from one backbone to another.

Refer to caption
Figure 18: SD v1.5 trained vs. SD v2.1 trained reward model

D.3 Weighted Reward and Conditional Reward Losses

We compare the weighted reward loss and conditional reward loss on the synthetic evaluation dataset. As shown in Fig. 19, the performances of these two losses are close to each other, while the conditional reward loss is slightly better. Therefore we adopt the conditional reward loss in all our experiments.

Refer to caption
Figure 19: HIVE with weighted reward loss and conditional reward loss.

D.4 Training with Less Data

We analyze the effect of the training data size. We compare HIVE with SD v1.5 at four training dataset size ratios: 100%, 50%, 30% and 10%. As shown in Fig. 20, significantly decreasing the size of the dataset, e.g. 10% data, leads to worse ability to perform large image edits. On the other hand, reasonable decreasing dataset size can result in a similar yet slightly worse performance e.g. 50% data.

Refer to caption
Figure 20: HIVE with different training data size.

D.5 Subcategory Analysis

We classify the editing into the following sub-categories: changing the global style, adjust attributes for the main object, add/remove objects, manipulate objects, and other challenging cases such as zooming and camera view changes. We use ChatGPT 555https://chat.openai.com/ to determine which sub-category the instruction belongs to. Specifically, the numbers of instructions in each sub-category are as follows: changing global style (133), adjust attributes for the main object (134), add/remove objects (508), manipulate objects (219), and others (6). We analyze user study results for each sub-category. It is shown in Fig. 21 that the most improvement comes from the sub-categories ”Add/remove objects” and ”Manipulate objects”.

Refer to caption Refer to caption Refer to caption
Change global style Adjust attributes Add/remove
Refer to caption Refer to caption
Manipulate objects Others
Figure 21: Subcategory analysis between IP2P and HIVE.

D.6 Additional Visualized Results

We illustrate additional visualized results in Fig. 2223242526, where each row illustrates three instructional editing examples.

Refer to caption
Figure 22: Additional editing results.
Refer to caption
Figure 23: Additional editing results.
Refer to caption
Figure 24: Additional editing results.
Refer to caption
Figure 25: Additional editing results.
Refer to caption
Figure 26: Additional editing results.