HIVE: Harnessing Human Feedback for Instructional Visual Editing

Shu Zhang*¹, Xinyi Yang*¹, Yihao Feng*¹, Can Qin¹, Chia-Chih Chen¹, Ning Yu¹, Zeyuan Chen¹,
Huan Wang¹, Silvio Savarese^1,2, Stefano Ermon², Caiming Xiong¹, Ran Xu¹
¹Salesforce AI Research, ²Stanford University

Abstract

Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models, where outputs are generated based on an input image and an editing instruction, could similarly benefit from human feedback, as their outputs may not adhere to the correct instructions and preferences of users. In this paper, we present a novel framework to harness human feedback for instructional visual editing (HIVE). Specifically, we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences. We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward.

Besides, to mitigate the bias brought by the limitation of data, we contribute a new 1.1M training dataset, a 3.6K reward dataset for rewards learning, and a 1K evaluation dataset to boost the performance of instructional image editing. We conduct extensive empirical experiments quantitatively and qualitatively, showing that HIVE is favored over previous state-of-the-art instructional image editing approaches by a large margin. ⁰⁰footnotetext: *Denotes equal contribution. Primary contact: [email protected].
Our project page: https://shugerdou.github.io/hive/.

Figure 1: We show four groups of representative results. In each triplet, from left to right are: the original image, InstructPix2Pix [7] using our data (IP2P-Ours), and HIVE. We observe that HIVE leads to more acceptable results than the model without human feedback. For instance, in the left two examples, IP2P-Ours understands the editing instruction “remove” and “change to blue” individually, but fails to understand the corresponding objects. Human feedback resolves this ambiguity, as shown in other examples as well.

1 Introduction

Refer to caption — Figure 2: Overall architecture of HIVE. The first step is to train a baseline HIVE without human feedback. In the second step, we collect human feedback to rank variant outputs for each image-instruction pair, and train a reward model to learn the rewards. In the third step, we fine-tune diffusion models by integrating the estimated rewards.

State-of-the-art (SOTA) text-to-image generative models have shown impressive performance in terms of both image quality and alignment between output images and captions [1, 46, 44]. Thanks to the impressive generation abilities of these models, instructional image editing has emerged as one of the most promising application scenarios for content generation [7]. Different from traditional image editing [3, 16, 55, 30, 16, 55], where both the input and the edited caption are needed, instructional image editing only requires human-readable instructions. For instance, classic image editing approaches require an input caption “a dog is playing a ball”, and an edited caption “a cat is playing a ball”. In contrast, instructional image editing only needs editing instruction such as “change the dog to a cat”. This experience mimics how humans naturally perform image editing.

Instructional image editing was first proposed in InstructPix2Pix [7], which fine-tunes a pre-trained stable diffusion [46] by curating a triplet of the original image, instruction, and edited image, with the help of GPT-3 [8] and Prompt-to-Prompt image editing [16]. Though achieving promising results, the training data generation process of InstructPix2Pix lacks explicit alignment between editing instructions and edited images.

Consequently, the modified images may only align to a certain extent with the editing instructions, as shown in the second column of Fig. 4. Furthermore, since these editing instructions are provided by human users, it’s crucial that the final edited images accurately reflect the users’ true intentions and preferences. Typically, humans prefer to make selective changes to the original images, which are usually not factored into the training data or objectives of InstructPix2Pix [7]. Considering this observation and the recent successes of ChatGPT [35], we propose to refine the stable diffusion process with human feedback. This adjustment aims to ensure that the edited images more closely correspond to editing instructions provided by humans.

For large language models (LLMs) such as InstructGPT [35, 37], we often first learn a reward function to reflect what humans care about or prefer on the generated text output, and then leverage reinforcement learning (RL) algorithms such as proximal policy optimization (PPO) [50] to fine-tune the models. This process is often referred to as reinforcement learning with human feedback (RLHF). Leveraging RLHF to fine-tune diffusion-based generative models, however, remains challenging. Applying on-policy algorithms (e.g.,PPO) to maximize rewards during the fine-tuning process can be prohibitively expensive due to the hundreds or thousands of denoising steps required for each sampled image. Moreover, even with fast sampling methods [52, 57, 21, 31], it is still challenging to back-propagate the gradient signal to the parameters of the U-Net. ¹¹1 We present a rigorous discussion on the difficulty in Appendix C.1.

To address the technical issues described above, we propose Harnessing Human Feedback for Instructional Visual Editing (HIVE), which allows us to fine-tune diffusion-based generative models with human feedback. As shown in Fig. 2, HIVE consists of three steps:

1) We perform instructional supervised fine-tuning on the dataset that combines our newly collected 1.1M training data and the data from InstructPix2Pix. Since observing failure cases and suspecting the grounding visual components from image to instruction is still a challenging problem, we collect 1.1M training data.

2) For each input image and editing instruction pair, we ask human annotators to rank variant outputs of the fine-tuned model from step 1, which gives us a reward learning dataset. Using the collected dataset, we then train a reward model (RM) that reflects human preferences.

3) We estimate the reward for each training data used in step 1, and integrate the reward to perform human feedback diffusion model finetuning using our proposed objectives presented in Sec. 3.4.

Our main contributions are summarized as follows:

$\bullet$ To tackle the technical challenge of fine-tuning diffusion models using human feedback, we introduce two scalable fine-tuning approaches in Sec. 3.4, which are computationally efficient and offer similar costs compared with supervised fine-tuning. Moreover, we empirically show that human feedback is an essential component to boost the performance of instructional image editing models.

$\bullet$ To explore the fundamental ability of instructional editing, we create a new dataset for HIVE including three sub-datasets: a new 1.1M training dataset, a 3.6K reward dataset for rewards learning, and a 1K evaluation dataset.

$\bullet$ To increase the diversity of the data for training, we introduce cycle consistency augmentation based on the inversion of editing instruction. Our dataset has been enriched with one pair of data for bi-directional editing.

2 Related Work

Text-To-Image Generation. Text-to-image generative models have achieved tremendous success in the past decade. Generative adversarial nets (GANs) [15] is one of the fundamental methods that dominated the early-stage works [45, 61, 59]. Recently, diffusion models [51, 17, 53, 52] have achieved state-of-the-art text-to-image generation performance. [12, 34, 43, 44, 47, 60, 46, 29]. As a result, instead of training a text-to-image model from scratch, our work focuses on fine-tuning existing stable diffusion model [46], by leveraging additional human feedback.

Image Editing. Similarly, diffusion models based image editing methods, e.g. SDEdit [32], BlendedDiffusion [3], BlendedLatentDiffusion [2], DiffusionClip [22], EDICT [55] or MagicMix [30], have garnered significant attention in recent years. To leverage a pre-trained image-text representation (e.g., CLIP [42], BLIP [28]) and text-to-image diffusion based pre-trained models [44, 47, 46], most existing works focus on text-based localized editing [5, 33, 16]. Prompt-to-Prompt [16] edits the cross-attention layer in Imagen and stable diffusion to control the similarity of image and text prompt. ControlNet [62] and UniControl [41] adopt controllable conditions to control image editings. Recently, InstructPix2Pix [7] tackle the problem via a different approach, requiring only human-readable editing instruction to perform image editing. Our work follows the same direction as InstructPix2Pix[7] and leverages human feedback to address the misalignment between editing instructions and resulting edited images.

Learning with Human Feedback. Incorporating human feedback into the learning process can be a highly effective way to enhance performance across various tasks such as fine-tuning LLMs [35, 4, 48, 37, 54], robotic simulation [10, 18], computer vision [40], and to name a few. Many existing works leverage PPO [50] to align to human feedback, however on-policy RL algorithms are not suitable for diffusion-based model fine-tuning (See more discussion in Appendix C.1).

Simultaneously, several concurrent works [56, 58, 25] study the text-to-image generation problem using human feedback. ReFL [58] investigates how to back-propagate the reward signal to random latter denoising step in the diffusion process, while [56] explores how to design fine-grained human preference score to improve the generation quality. [25] leverages human feedback to align text-to-image generation, where they naively view reward as weights to perform maximum likelihood training. Different from the above works, our work tackles the problem of instructional image editing, where there are little or even no ground truth data for the alignment between human-readable editing instructions and edited images. In addition, the conditions on both image input and instructions make the human feedback more valuable than standard text-to-image tasks, since the conditions make the training harder than standard text-to-image tasks.

3 Methodology

In this section, we introduce the new datasets we collected in Sec. 3.1, and explain the three major steps of HIVE in the rest of the section. Concretely, we introduce the instructional supervised training in Sec. 3.2, and describe how to train a reward model to score edited images in Sec. 3.3, then present two scalable fine-tuning methods to align diffusion models with human feedback in Sec. 3.4.

3.1 Dataset

Instructional Edit Training Dataset. We follow the same method of [7] to generate the training dataset. We collect 1K images and their corresponding captions. We ask three annotators to write three instructions and corresponding edited captions based on the collected input captions. Therefore, we obtain 9K prompt triplets: input caption, instruction, and edited caption. We fine-tune GPT-3 [8] with OpenAI API v0.25.0 [36] with them. We use the fine-tuned GPT-3 to generate five instructions and edited captions per input image-caption pair in Laion-Aesthetics V2 [49]. We observe that the captions from Laion are not always visually descriptive, so we use BLIP [28] to generate more diverse types of image captions. Later stable diffusion based Prompt-to-Prompt [16] is adopted to generate paired images. In addition, we design a cycle-consistent augmentation method (Sec. 3.2.1) to generate additional training data. We generate 1.17M training triplets in total. Combining the 281K training data from [7], we obtain 1.45M training image pairs along with instructions.

Reward Fine-tuning Dataset. We collect 3.6K image-instruction pairs for the task of reward fine-tuning. Among them, 1.6K image-instruction pairs are manually collected, and the rest are from Laion-Aesthetics V2 with GPT-3 generated instructions. We use this dataset to ask annotators to rank various model outputs.

Evaluation Dataset. We use two evaluation datasets: the test dataset in [7] for quantitative evaluation and a new 1K dataset collected for the user study. The quantitative evaluation dataset is generated following the same method as the training dataset, which means that the dataset does not contain real images. Our collected 1K dataset contains 200 real images, and each image is annotated with five human-written instructions. More details of annotation tooling, guidelines, and analysis are in Appendix A.

3.2 Instructional Supervised Training

We follow the instructional fine-tuning method in [7] with two major upgrades on dataset curation (Sec. 3.1) and cycle consistency augmentation (Sec. 3.2.1). A pre-trained stable diffusion model [46] is adopted as the backbone architecture. In instructional supervised training, the stable diffusion model has two conditions $c=\left[c_{I},c_{E}\right]$ , where $c_{E}$ is the editing instruction, and $c_{I}$ is the latent space of the original input image. In the training process, a pre-trained auto-encoder [23] with encoder $\mathcal{E}$ and decoder $\mathcal{D}$ is used to convert between edited image $\tilde{\boldsymbol{x}}$ and its latent representation $z=\mathcal{E}(\tilde{\boldsymbol{x}})$ . The diffusion process is composed of an equally weighted sequence of denoising autoencoders $\epsilon_{\theta}(z_{t},t,c)$ , $t=1,\cdots,T$ , which are trained to predict a denoised variant of their input $z_{t}$ , a noisy version of $z$ . The objective of instructional supervised training is:

\displaystyle L=\mathbb{E}_{\mathcal{E}(\tilde{\boldsymbol{x}}),c,\epsilon\sim% \mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z_{t},t,c))\|_{2}^{2}% \Big{]}\,.\vspace{-.5em}

3.2.1 Cycle Consistency Augmentation

Cycle consistency is a powerful technique that has been widely applied in image-to-image generation [63, 19]. It involves coupling and inverting bi-directional map**s of two variables $X$ and $Y$ , $G:X\rightarrow Y$ and $F:Y\rightarrow X$ , such that $F(G(X))\approx X$ and vice versa. This approach has been shown to enhance generative map** in both directions.

While Instructpix2pix [7] considers instructional image editing as a single-direction map**, we propose adding cycle consistency. Our approach involves a forward-pass editing step, $F:x\stackrel{{\scriptstyle inst}}{{\longrightarrow}}\tilde{\boldsymbol{x}}$ . We then introduce instruction reversion to enable a reverse-pass map**, $R:\tilde{\boldsymbol{x}}\stackrel{{\scriptstyle\sim inst}}{{\longrightarrow}}x$ . In this way, we could close the loop of image editing as: $x\stackrel{{\scriptstyle inst}}{{\longrightarrow}}\tilde{\boldsymbol{x}}% \stackrel{{\scriptstyle\sim inst}}{{\longrightarrow}}x$ , e.g. “add a dog” to “remove the dog”.

To ensure the effectiveness of this technique, we need to separate invertible and non-invertible instructions from the dataset. We devised a rule-based method that combines speech tagging and template matching. We found that most instructions adhere to a particular structure, with the verb appearing at the start, followed by objects and prepositions. Thus, we grammatically tagged all instructions using the Natural Language Toolkit (NLTK) ²²2https://www.nltk.org/. We identified all invertible verbs and pairing verbs, and also analyzed the semantics of the objects and the prepositions used. By summarizing invertible instructions in predefined templates, we matched desired instructions. Our analysis revealed that 29.1% of the instructions in the dataset were invertible. We augmented this data to create more comprehensive training data, which facilitated cycle consistency. For more information, see Appendix B.1.

3.3 Human Feedback Reward Learning

The second step of HIVE is to learn a reward function $\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)$ , which takes the original input image, the text instruction condition $c=\left[c_{I},c_{E}\right]$ , and the edited image $\tilde{\boldsymbol{x}}$ that is generated by the fine-tuned stable diffusion as input, and outputs a scalar that reflects human preference.

Unlike InstructGPT which only takes text as input, our reward model $\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)$ needs to measure the alignment between instructions and the edited images. To address the challenge, we present a reward model architecture in Fig. 3, which leverages pre-trained vision-language models such as BLIP [28]. More specifically, the reward model employs an image-grounded text encoder as the multi-modal encoder to take the joint image embedding and the text instruction as input and produce a multi-modal embedding. A linear layer is then applied to the multi-modal embedding to map it to a scalar value. More details are in Appendix B.2.

With the specifically designed network architecture, we train the reward function $\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)$ with our collected reward fine-tuning dataset $\mathcal{D}_{\mathrm{human}}$ induced in Sec. 3.1. For each input image $c_{I}$ and instruction $c_{E}$ pair, we have $K$ edited images $\{\tilde{\boldsymbol{x}}\}_{k=1}^{K}$ ranked by human annotators, and denote the human preference of edited image $\tilde{\boldsymbol{x}}_{i}$ over $\tilde{\boldsymbol{x}}_{j}$ by $\tilde{\boldsymbol{x}}_{i}\succ\tilde{\boldsymbol{x}}_{j}$ . Then we can follow the Bradley-Terry model of preferences [6, 37] to define the pairwise loss function:

$\ell_{\mathrm{RM}}(\phi):=-\sum_{\tilde{\boldsymbol{x}}_{i}\succ\tilde{% \boldsymbol{x}}_{j}}\log\left[\frac{\exp(\mathcal{R}_{\phi}(\tilde{\boldsymbol% {x}}_{i},c))}{\sum_{k=i,j}\exp(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}}_{k},c% ))}\right]\,,$

where $(i,j)\in[1\ldots K]$ and we can get $K\choose 2$ pairs of comparison for each condition $c$ . Similar to [37], we put all the $K\choose 2$ pairs for each condition $c$ in a single batch to learn the reward functions. We provide a detailed reward model training discussion in Appendix B.2.

3.4 Human Feedback based Model Fine-tuning

With the learned reward function $\mathcal{R}_{\phi}(c,\tilde{\boldsymbol{x}})$ , the next step is to improve the instructional supervised training model by reward maximization. As a result, we can obtain an instructional diffusion model that aligns with human preferences.

The RL fine-tuning techniques we present are built upon recent offline RL techniques [27, 38, 9, 20] With an input image and editing instruction condition $c=[c_{I},c_{E}]$ , we define the edited image data distribution generated by the instructional supervised diffusion model as $p(\tilde{\boldsymbol{x}}|~{}c)$ , and the edited image data distribution generated by the current diffusion model we want to optimize as $\rho(\tilde{\boldsymbol{x}}|~{}c)$ , then under the pessimistic principle of offline RL, we can optimize $\rho$ by the following objectives:

	$\displaystyle\textstyle J(\rho):=\max_{\rho}\mathbb{E}_{c}\big{[}$	$\displaystyle\mathbb{E}_{\tilde{\boldsymbol{x}}\sim\rho(\cdot\|c)}[\mathcal{R}_% {\phi}(\tilde{\boldsymbol{x}},c)]-$
		$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\eta\mathrm{KL}(\rho(\tilde{% \boldsymbol{x}}\|c)\|\|p(\tilde{\boldsymbol{x}}\|c))\big{]}\,,$		(1)

where $\eta$ is a hyper-parameter. The first term in Eq. (1) is the standard reward maximization in RL. The second term is a regularization to stabilize learning, which is a widely used technique in offline RL [24], and is also adopted for PPO fine-tuning of InstructGPT (a.k.a “PPO-ptx”) [37].

To avoid using sampling-based methods to optimize $\rho$ , we can differentiate $J(\rho)$ w.r.t $\rho(\tilde{\boldsymbol{x}}|c)$ and solve for the optimal $\rho^{*}(\tilde{\boldsymbol{x}}|c)$ , resulting the following expression for the optimal solution of Eq. (1):

\displaystyle\rho^{*}(\tilde{\boldsymbol{x}}|c)\propto p(\tilde{\boldsymbol{x}% }|c)\exp\left(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)/\eta\right)\,,

(2)

or $\rho^{*}(\tilde{\boldsymbol{x}}|c)=\frac{1}{Z(c)}p(\tilde{\boldsymbol{x}}|c)% \exp\left(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)/\eta\right)$ , with $Z(c)=\int p(\tilde{\boldsymbol{x}}|c)\exp\left(\mathcal{R}_{\phi}(\tilde{% \boldsymbol{x}},c)/\eta\right)d\tilde{\boldsymbol{x}}$ being the partition function. A detailed derivation is in Appendix C.2.

Weighted Reward Loss. The optimal target distribution $\rho^{*}(\tilde{\boldsymbol{x}}|c)$ in Eq. (2) can be viewed as an exponential reward-weighted distribution for $p(\tilde{\boldsymbol{x}}|c)$ . Moreover, we have already obtained the empirical edited image data drawn from $p(\tilde{\boldsymbol{x}}|c)$ when constructing the instructional editing dataset, and we can view the exponential reward weighted edited image $\tilde{\boldsymbol{x}}$ from the instructional editing dataset as an empirical approximation of samples drawn from $\rho^{*}(\tilde{\boldsymbol{x}}|c)$ . Formally, we can fine-tune a diffusion model thus it generates data from $\rho^{*}(\tilde{\boldsymbol{x}}|c)$ , resulting in the weighted reward loss:

$\ell_{\mathrm{WR}}(\theta):=\mathbb{E}_{\mathcal{E}(\tilde{\boldsymbol{x}}),c,% \epsilon\sim\mathcal{N}(0,1),t}\left[\omega(\tilde{\boldsymbol{x}},c)\cdot% \left\|\epsilon-\epsilon_{\theta}(z_{t},t,c)\right\|_{2}^{2}\right]\,,$

with $\omega(\tilde{\boldsymbol{x}},c)=\exp\left(\mathcal{R}_{\phi}(\tilde{% \boldsymbol{x}},c)/\eta\right)$ being the exponential reward weight for edited image $\tilde{\boldsymbol{x}}$ and condition $c$ . Different from RL literature [39, 38] using exponential reward or advantage weights to learn a policy function, our weighted reward loss is derived for fine-tuning stable diffusion.

Condition Reward Loss. We can also leverage the control-as-inference perspective of RL [26] to transform Eq. (2) to a conditional reward expression, thus we can directly view the reward as a conditional label to fine-tune diffusion models. Similar to [26], we introduce a new binary variable $R^{*}$ indicating whether human prefers the edited image or not, where $R^{*}=1$ denotes that human prefers the edited image, and $R^{*}=0$ denotes that human does not prefer, thus we have $p(R^{*}=1~{}|~{}\tilde{\boldsymbol{x}},c)\propto\exp\left(\mathcal{R}_{\phi}(% \tilde{\boldsymbol{x}},c)\right)$ . Together with Eq. (2), and applying Bayes rules gives us the following derivation:

	$\displaystyle\textstyle p(\tilde{\boldsymbol{x}}\|c)$	$\displaystyle\exp\left(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)/\eta\right% ):=q(\tilde{\boldsymbol{x}}\|c)\left(p(R^{*}=1~{}\|~{}\tilde{\boldsymbol{x}},c)% \right)^{1/\eta}$
		$\displaystyle=p(\tilde{\boldsymbol{x}}\|c)\left(\frac{p(\tilde{\boldsymbol{x}}\|% ~{}R^{}=1,c)p(R^{}=1\|~{}c)}{p(\tilde{\boldsymbol{x}}\|c)}\right)^{1/\eta}$
		$\displaystyle\propto p(\tilde{\boldsymbol{x}}\|c)^{1-1/\eta}p(\tilde{% \boldsymbol{x}}\|~{}R^{*}=1,c)^{1/\eta}\,,$

where we drop $p(R^{*}=1|~{}c)$ since it is a constant w.r.t $\tilde{\boldsymbol{x}}$ . We can now view the reward for each edited image as an additional condition. Define the new condition $\tilde{c}=[c_{I},c_{E},c_{R}]$ , with $c_{R}$ as the reward label, we can fine-tune the diffusion model with the condition reward loss:

\displaystyle\textstyle\ell_{\mathrm{CR}}(\theta)=\mathbb{E}_{\mathcal{E}(x),% \tilde{c},\epsilon\sim\mathcal{N}(0,1),t}\Big{[}\|\epsilon-\epsilon_{\theta}(z% _{t},t,\tilde{c})\|_{2}^{2}\Big{]}\,.

We quantize the reward into five categories, based on the quantile of the empirical reward distribution of the training dataset, and convert the reward value into a text prompt. For instance, if the reward value of a training pair lies in the bottom 20% of the reward distribution of the dataset, then we convert the reward value as a text prompt condition $c_{R}:=$ “The image quality is one out of five”. And during the inference time to generate edited images, we fix the text prompt as $c_{R}:=$ “The image quality is five out of five”, indicating we want the generated edited images with the highest reward. We empirically find this technique improves the stability of fine-tuning.

4 Experiments

This section presents the experimental results and ablation studies of HIVE’s technical choices, demonstrating the effectiveness of our method. We adopt the default guidance scale parameters in InstrcutPix2Pix for a fair comparison. Through our experiments, we discovered that the conditional reward loss performs slightly better than the weighted reward loss, and therefore, we present our results based on the conditional reward loss. The detailed comparisons can be found in Sec. 4.2 and Appendix D.3.

We evaluate our method using two datasets: a synthetic evaluation dataset with 15,652 image pairs from [7] and a self-collected 1K evaluation dataset with real image-instruction pairs. For the synthetic dataset, we follow InstructPix2Pix’s quantitative evaluation metric and plot the trade-offs between CLIP image similarity and directional CLIP similarity [14]. For the 1K dataset, we conduct a user study where for each instruction, the images generated by competing methods are reviewed and voted by three human annotators, and the winner is determined by majority votes.

4.1 Baseline Comparisons

We perform experiments with the same setup as InstructPix2Pix, where stable diffusion (SD) v1.5 is adopted. We compare three models: InstructPix2Pix official model (IP2P-Official), InstructPix2Pix using our data (IP2P-Ours) ³³3It is the same to HIVE without human feedback., and HIVE. We report the quantitative results on the synthetic evaluation dataset in Fig. 5. We observe that IP2P-Ours improves notably over IP2P-Official (blue curve vs. green curve). Moreover, human feedback further boosts the performance of HIVE (red curve vs blue curve) over IP2P-Ours by a large margin. In other words, with the same directional similarity value, HIVE obtains better image consistency than InstructPix2Pix.

To test the effectiveness of HIVE on real-world images, we report the user study results on the 1K evaluation dataset. We use “Tie” to represent that users think results are equally good or equally bad. As shown in Fig. 6(a), IP2P-Ours gets around 30% more votes than the IP2P-Official. The result is consistent with the user study on the synthetic dataset. We also demonstrate the user study outcome between HIVE and IP2P-Ours in Fig. 6(b). The user study indicates similar conclusions to the consistency plot, where HIVE gets around 25% more favorites than IP2P-Ours.

In Fig. 4, we present representative edits that demonstrate the effectiveness of HIVE. The results show that while using more data can partially improve editing instructions without human feedback, the reward model leads to better alignment between instruction and the edited image. For example, in the second row, IP2P-Ours generates a door-like object, but with the guidance of human feedback, the generated door matches human perception better. In the fourth row, the example of which is from the failure examples in [7], HIVE can locate the tie and change its color correctly.

Additionally, our visual analysis of the results (Fig. 7) indicates that the HIVE model tends to preserve the remaining part of the original image that is not instructed to be edited, while IP2P-Ours leads to excessive image editing more often. For instance, in the first example of Fig. 7, HIVE blends a pond naturally into the original image. The two InstructPix2Pix models fulfill the same instruction, however, at the same time, alter the uninstructed part of the original background.

4.2 Ablation Study

Weighted Reward and Condition Reward Loss. We perform user study on HIVE with these two losses individually. As shown in Fig. 8, these two losses obtain similar human preferences on the evaluation dataset. More comparisons are in Appendix D.

Cycle Consistency We analyze the impact of it which is introduced in Sec. 3.2.1. The top five augmentations in the cycle consistency are demonstrated in Fig. 9(a). We perform evaluation on both synthetic dataset and the 1K evaluation dataset. The user study in Fig. 9(b) shows that the cycle consistency augmentation improves the performance of HIVE by a notable margin.

Success Rate on Verbs It is observed that five verbs take around 85% of all verbs, where details can be found in Sec. A. We compare HIVE with IP2P-Ours on these five verbs, and report the success rate of these two methods on these verbs. It is seen in Fig. 10 that HIVE improves the most on “add” from 23.5% to 28.7%.

Other Baselines. To test the effectiveness of HIVE, we experiment two additional baselines. In Fig. 11(a), we upgrade the backbone of stable diffusion from v1.5 to v2.1. We observe that the upgraded backbone slightly improves the results. In Fig. 11(b), we directly use the reward scalar instead of the reward prompt as the condition for training, and the condition on the highest reward scalar for generating the image. We adopt the user study to compare it (named HIVE-reward) with HIVE. HIVE obtains 25.8 % more votes than the baseline model conditioned on the reward score. This is mainly because directly conditioning on the highest reward might cause overfiting.

Failure Cases and Limitations. We summarize representative failure cases in Fig. 12. First, some instructions cannot be understood. In the upper left example in Fig. 12, the prompt “zoom in” or similar instructions can rarely be successful. We believe the root cause is current training data generation method fails to generate image pairs with this type of instruction. Second, counting and spatial reasoning are common failure cases (see the upper right example in Fig. 12). We find that the instruction “one”, “two”, or “on the right” can lead to many undesired results. Third, the object understanding sometimes is wrong. In the bottom left example, the red color is changed on the wrong object. This is a common error in HIVE, where instructed edited objects are wrongly recognized.

We find some other limitations as well. One limitation of HIVE is that it cannot bring benefits to the cases where all outputs by the model without human feedback obtain the same wrong results. In such cases, user preferences cannot always be beneficial to the results. We believe that improving the data as well as the base model is an important step in the future. Another limitation is that compared to Prompt-to-Prompt [16], which is used to generate our training data, HIVE sometimes leads to some unstructured change in the image. We think that it is because of the limitation of the current training data. Instructed editing can have more diverse and ambiguous scenarios than traditional image editing problems. Using GPT-3 to finetune prompts to generate the training data is limited by the model and the labeled data. More ablation studies are in Appendix D.

5 Conclusion and Discussion

In our paper, we introduce a novel framework called HIVE that enables instructional image editing with human feedback. Our framework integrates human feedback, which is quantified as reward values, into the diffusion model fine-tuning process. We design two variants of the approach and both of them improve performance over previous state-of-the-art instructional image editing methods. Our work demonstrates instructional image editing with human feedback is a variable approach to align image generation with human preference, thus unlocking new opportunities and potential to scale up the model capabilities towards more powerful applications such as conversational image editing. While our method demonstrates impressive performance, we have also identified failure scenarios, as discussed in Sec. 4.2. In addition, it is possible that our trained model inherits bias and suffers from harmful content from pre-trained foundation models such as Stable Diffusion, GPT3 and BLIP. These limitations would be considered when interpreting our results, and we expect red teaming with human feedback to mitigate some of the risks in future work.

References

Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
Avrahami et al. [2022a] Omri Avrahami, Ohad Fried, and Dani Lischinski. Blended latent diffusion. arXiv preprint arXiv:2206.02779, 2022a.
Avrahami et al. [2022b] Omri Avrahami, Dani Lischinski, and Ohad Fried. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022b.
Bai et al. [2022] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
Bar-Tal et al. [2022] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723. Springer, 2022.
Bradley and Terry [1952] Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
Brooks et al. [2022] Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. arXiv preprint arXiv:2211.09800, 2022.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilyva Sutskever, and Dario Amodei. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
Chen et al. [2021] Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
Christiano et al. [2017] Paul Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. NeurIPS, 2017.
Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Dhariwal and Nichol [2021] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. NeurIPS, 34:8780–8794, 2021.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Gal et al. [2022] Rinon Gal, Or Patashnik, Haggai Maron, Amit Bermano, Gal Chechik, and Daniel Cohen-Or. Stylegan-nada: Clip-guided domain adaptation of image generators. ACM Transactions on Graphics, 41(4):1–13, 2022.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. NeurIPS, 2014.
Hertz et al. [2022] Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. NeurIPS, 33:6840–6851, 2020.
Ibarz et al. [2018] Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. NeurIPS, 2018.
Isola et al. [2017] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
Janner et al. [2021] Michael Janner, Qiyang Li, and Sergey Levine. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, 2021.
Karras et al. [2022] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models, 2022. URL https://arxiv. org/abs/2206.00364, 2022.
Kim et al. [2022] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2426–2435, 2022.
Kingma and Welling [2013] Diederik Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kumar et al. [2020] Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
Lee et al. [2023] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
Levine [2018] Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
Levine et al. [2020] Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. arXiv preprint arXiv:2201.12086, 2022.
Li et al. [2023] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. arXiv:2301.07093, 2023.
Liew et al. [2022] Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi Feng. Magicmix: Semantic mixing with diffusion models. arXiv preprint arXiv:2210.16056, 2022.
Lu et al. [2022] Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
Meng et al. [2021] Chenlin Meng, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021.
Meng et al. [2022] Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
Nichol et al. [2021] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
OpenAI [a] OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, a.
OpenAI [b] OpenAI. Openaiapi. https://platform.openai.com/docs/guides/fine-tuning, b.
Ouyang et al. [2022] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
Peng et al. [2019] Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019.
Peters et al. [2010] Jan Peters, Katharina Mulling, and Yasemin Altun. Relative entropy policy search. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 1607–1612, 2010.
Pinto et al. [2023] Andre Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, and Xiaohua Zhai. Tuning computer vision models with task rewards. arXiv preprint arXiv:2302.08242, 2023.
Qin et al. [2023] Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. In NeurIPS, 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020, 2021.
Ramesh et al. [2021] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
Reed et al. [2016] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text-to-image synthesis. In ICML, 2016.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
Saharia et al. [2022] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
Scheurer et al. [2022] Jeremy Scheurer, Jon Ander Campos, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback. arXiv preprint arXiv:2204.14146, 2022.
Schuhmann et al. [2021] Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2111.02114, 2021.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of International Conference on Machine Learning, pages 2256–2265, 2015.
Song et al. [2020] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020.
Song and Ermon [2019] Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
Stiennon et al. [2020] Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul Christiano. Learning to summarize from human feedback. NeurIPS, 2020.
Wallace et al. [2022] Bram Wallace, Akash Gokul, and Nikhil Naik. Edict: Exact diffusion inversion via coupled transformations. arXiv preprint arXiv:2211.12446, 2022.
Wu et al. [2023] Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
Xiao et al. [2021] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion gans. arXiv preprint arXiv:2112.07804, 2021.
Xu et al. [2023] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. arXiv preprint arXiv:2304.05977, 2023.
Xu et al. [2018] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR, 2018.
Yu et al. [2022] Jiahui Yu, Yuanzhong Xu, **g Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
Zhang et al. [2017] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
Zhang et al. [2023] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In IEEE International Conference on Computer Vision (ICCV), 2023.
Zhu et al. [2017] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.

Appendix

Appendix A Data Collection and User Study

In the evaluation steps, we collect real-world images with instructions using Amazon Mechanical Turk (Mturk) ⁴⁴4https://www.mturk.com. We randomly collect 200 real-world images. Then we ask Mturk annotators to write five instructions for each image, and encourage them to have wild imaginations and diversify the instruction types. We encourage annotators to not be limited to making the image realistic. For example, annotators can write “add a horse in the sky”. A screenshot of the interface is illustrated in Fig. 13. We analyze the top five verbs and nouns in the evaluation dataset. It is shown in Fig. 15(a) that the verbs “add”, “change”, “make”, “remove” and “put” make up around 85% of all verbs, which means that the editing instruction verbs have a long-tail distribution. In contrast, the distribution of nouns in Fig. 15(b) is close to uniform, where the top five nouns represent only around 20% of all nouns.

In user studies, we use Mturk to ask annotators to evaluate edited images. A screenshot of the interface is shown in Fig. 14. The annotators are provided with the original image, two edited images, and the editing instruction. They are asked to select the better edited image. The third option indicates that the edited images are equally good or equally bad. We ask three annotators to label one data sample, and use the majority votes to determine the results. We shuffle the edited images to avoid choosing the left image over the right and vice versa.

Appendix B Implementation Details

B.1 Instructional Supervised Training

We use pre-trained stable diffusion models as the initial checkpoint to start instructional supervised training. We train HIVE on 40GB NVIDIA A100 GPUs for 500 epochs. We use the learning rate of $10^{-4}$ and the image size of 256. In the inference, we use 512 as the default image resolution.

B.2 Human Feedback Rewards Learning

As shown in Fig. 3, the reward model takes in an input image $c_{I}$ , a text instruction $c_{E}$ , and an edited image $\tilde{x}$ and outputs a scalar value. Inspired by the recent work on the vision-language model, especially BLIP [28], we employ a visual transformer [13] as our image encoder and an image-grounded text encoder as the multimodal encoder for images and text. Finally, we set a linear layer on top of the image-grounded text encoder to map the multimodal embedding to a scalar value.

(1) Visual transformer. We encode both the input image $c_{I}$ and edited image $\tilde{x}$ with the same visual transformer. Then we obtain the joint image embedding by concatenating the two image embeddings $vit(c_{I})$ , $vit(\tilde{x})$ .

(2) Image-grounded text encoder. The image-grounded text encoder is a multimodal encoder that inserts one additional cross-attention layer between the self-attention layer and the feed-forward network for each transformer block of BERT [11]. The additional cross-attention layer incorporates visual information into the text model. The output embedding of the image-grounded text encoder is used as the multimodal representation of the ( $c_{I}$ , $c_{E}$ , $\tilde{x}$ ) triplet.

We gather a dataset comprising 3,634 images for the purpose of ranking. For each image, we generate five variant edited images, and ask an annotator to rank images from best to worst. Additionally, we ask annotators to indicate if any of the following scenarios apply: (1) all edited images are edited but none of them follow the instruction; (2) all edited images are visually the same as the original image; (3) all images are edited beyond the scope of instruction; (4) edited images have harmful content containing sex, violence, porn, etc; and (5) all edited images look similar to each other. We compare training reward models by filtering some/all of these options.

We note that a considerable portion of the collected data falls under at least one of the aforementioned categories, indicating that even for humans, ranking these images is challenging. As a result, we only use the data that did not include any non-rankable options in the reward model training. From a pool of 1,412 images, we select 1,285 for the training set, while the remaining images were used for the validation set. The reward model is trained on a dataset of comparisons between multiple model outputs on the same input. Each comparison sample contains an input image, an instruction, five edited versions of the image, and the corresponding rankings. We divide the dataset into training and validation sets based on the distribution of the corresponding instructions.

We apply the method in Sec. 3.3 on the reward data to develop a reward model. We initialize the reward model from the pre-trained BLIP, which was trained on paired images and captions using three objectives: image-text contrastive learning, image-text matching, and masked language modeling. Although there is a domain gap between BLIP’s pre-training data and our reward data, where the captions in BLIP’s data describe a single image, and the instructions in our data refer to the difference between image pairs. We hypothesized that leveraging the learned alignment between text and image in BLIP could enhance the reward model’s ability to comprehend the relationship between the instruction and the image pairs.

The reward model is trained using 4 A100 GPUs for 10 epochs, employing a learning rate of $10^{-4}$ and weight decay of 0.05. The image encoder’s and multimodal encoder’s last layer outputs are utilized as image and multimodal representations, respectively. The encoders’ final layer is the only fine-tuned component.

We use the trained reward model to generate a reward score on our training data. We perform two experiments. The first experiment takes the exponential rewards as weights and fine-tunes the diffusion model with weighted reward loss as described in Sec. 3.4. See Fig. 16 for the visualization of the method. The second experiment transforms the rewards to text prompts and fine-tunes the diffusion model with the condition reward loss as described in Sec. 3.4. The method is introduced in Fig. 2. We compare those two experiment settings, and results can be found in Sec. D.3.

Appendix C Reward Maximization for Diffusion-Based Generative Models

C.1 Discussion on On-Policy based Reward Maximization for Diffusion Models

Directly adapting on-policy RL methods to the current training pipeline might be computationally expensive, but we do not conclude that sampling-based approaches are not doable for diffusion models. We consider develo** more scalable sampling-based methods as future work.

We start the sampling methods derivation with the following objective:

\displaystyle J(\theta):=\max_{\pi_{\theta}}\mathbb{E}_{c\sim p_{c}}\Big{[}% \mathbb{E}_{\tilde{\boldsymbol{x}}\sim\pi_{\theta}(\cdot|c)}\left[\mathcal{R}_% {\phi}(\tilde{\boldsymbol{x}},c)\right]-\eta\mathrm{KL}[p_{\mathcal{D}}(\tilde% {\boldsymbol{x}}|c)||\pi_{\theta}(\tilde{\boldsymbol{x}}|c)]\Big{]}\,,

(3)

where $p_{c}(c)p_{\mathcal{D}}(\tilde{\boldsymbol{x}}|c)$ is the joint distribution of the condition and edited images pair, and $\pi_{\theta}$ denotes the policy or the diffusion model we want to optimize. Note that $p_{\mathcal{D}}(\tilde{\boldsymbol{x}}|c)$ and $\pi(\tilde{\boldsymbol{x}}|c)$ are swaped compared with the objective in Eq. (1). The second term in Eq. (3), is the KL Minimization formula for maximum likelihood estimation, equivalent to the loss of diffusion models. We represent the policy $\pi_{\theta}$ via the reverse process of a conditional diffusion model:

\displaystyle\pi_{\theta}(\tilde{\boldsymbol{x}}|c):=p_{\theta}(\tilde{% \boldsymbol{x}}^{0:T}|~{}c)=p_{0}(\tilde{\boldsymbol{x}}^{T})\prod_{t=1}^{T}p_% {\theta}(\tilde{\boldsymbol{x}}^{t-1}|\tilde{\boldsymbol{x}}^{t};c)\,,

where $p_{0}(\tilde{\boldsymbol{x}}^{T}):=\mathcal{N}(\tilde{\boldsymbol{x}}^{T},\bf{% 0};\bf{{I}})$ , and $p_{\theta}(\tilde{\boldsymbol{x}}^{t-1}|\tilde{\boldsymbol{x}}^{t};c):=% \mathcal{N}(\tilde{\boldsymbol{x}}^{t}|\mu_{\theta}(\tilde{\boldsymbol{x}}_{t}% ,t),\sigma_{t}^{2})\bf{\mathrm{I}}$ is a Gaussian distribution, whose parameters are defined by score function $\epsilon_{\theta}$ and stepsize of noise scalings. So we can get a edited image sample $\tilde{\boldsymbol{x}}^{0}$ by running a reverse diffusion chain:

\displaystyle\tilde{\boldsymbol{x}}^{t-1}|\tilde{\boldsymbol{x}}^{t}=\frac{1}{% \sqrt{\alpha_{t}}}\left(\tilde{\boldsymbol{x}}^{t}-\frac{1-\alpha_{t}}{\sqrt{1% -\bar{\alpha}_{t}}}\epsilon_{\theta}(\tilde{\boldsymbol{x}}^{t},c,t)\right)+% \sigma_{t}\boldsymbol{z}_{t},~{}~{}\boldsymbol{z}\sim\mathcal{N}(\textbf{0},% \textbf{I}),\text{for}~{}~{}t=T,\ldots,1\,,

and $\tilde{\boldsymbol{x}}^{T}\sim\mathcal{N}(\textbf{0},\textbf{I})$ .

As a result, the reverse diffusion process can be viewed as a black box function defined by $\epsilon_{\theta}$ and noises $\boldsymbol{\epsilon}:=(\boldsymbol{z}_{T},\ldots,\boldsymbol{z}_{1},\tilde{% \boldsymbol{x}}^{T})$ , which we can view as a shared parameter network with noises. And for each layer, we can view the parameter is the score function $\epsilon_{\theta}$ . Define the network as

\displaystyle\tilde{\boldsymbol{x}}^{0}:=f(c,\boldsymbol{\epsilon};\theta)\,,~% {}~{}\boldsymbol{\epsilon}\sim p_{\mathrm{noise}}(\cdot),c\sim p_{c}(\cdot)\,,

where we can rewrite the first term as

\displaystyle\mathbb{E}_{c\sim\mathcal{D},\boldsymbol{\epsilon}\sim p_{\mathrm% {noise}}(\cdot)}[\mathcal{R}_{\phi}(f(c,\boldsymbol{\epsilon};\theta),c)]\,,

and we can optimize the parameter $\theta$ with path gradient if $\mathcal{R}_{\cdot}$ is differentiable with path gradient. Similarly, suppose we want to optimize the first term via PPO. In that case, the main technical difficulty is to estimate $\nabla_{\theta}\log\pi_{\theta}(\tilde{\boldsymbol{x}}|c)$ , which can be estimated with the following derivation:

\displaystyle\nabla_{\theta}\log\pi_{\theta}(\tilde{\boldsymbol{x}}|c)=\nabla_% {\theta}\log p_{\theta}(\tilde{\boldsymbol{x}}^{0:T}|~{}c)=\sum_{t=1}^{T}% \nabla_{\theta}\log p_{\theta}(\tilde{\boldsymbol{x}}^{t-1}|\tilde{\boldsymbol% {x}}^{t};c)\,.

Note that for both the end-to-end path gradient method and PPO we require to sample the reverse chain from $\tilde{\boldsymbol{x}}^{T}$ to $\tilde{\boldsymbol{x}}^{0}$ , thus we can estimate $\nabla_{\theta}\log\pi(\tilde{\boldsymbol{x}}|c)$ using the empirical samples $\tilde{\boldsymbol{x}}^{0:T}$ .

For the above two methods, to perform one step policy gradient update, we need to run the whole reverse chain to get an edited image sample $\tilde{\boldsymbol{x}}^{0}$ to estimate the parameter gradient for the first term. As a result, the computational cost is the number of diffusion steps more extensive than the supervised fine-tuning cost. Now we need more than two days to fine-tune the stable diffusion model, so for standard LDM, where the number of steps is 1000, we can not finish the training within an acceptable training time. Even if we can use some fast sampling methods such as DDIM or variance preserve (VP) based noise scaling, the diffusion steps are still more than 5 or 10. Further, we haven’t seen any previous work using such noise scaling to fine-tune stable diffusion. As a result, we think naive sampling methods might have high risk to obtain similar performance, compared with our current offline RL based approaches.

C.2 Derivation for Eq. (2)

Take a functional view of Eq. (2), and differentiate $J(\rho)$ w.r.t $\rho$ , we get

\displaystyle\frac{\partial J(\rho)}{\partial\rho}=\mathcal{R}_{\phi}(\tilde{% \boldsymbol{x}}|c)-\eta\left(\log\rho(\tilde{\boldsymbol{x}}|c)+1-\log p(% \tilde{\boldsymbol{x}}|c)\right)\,.

Setting $\frac{\partial J(\rho)}{\partial\rho}=0$ gives us

	$\displaystyle\log\rho(\tilde{\boldsymbol{x}}\|c)$	$\displaystyle=\frac{1}{\eta}\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}}\|c)+\log p% (\tilde{\boldsymbol{x}}\|c)-1\,,$
	$\displaystyle\rho(\tilde{\boldsymbol{x}}\|c)$	$\displaystyle\propto p(\tilde{\boldsymbol{x}}\|c)\exp\left(\mathcal{R}_{\phi}(% \tilde{\boldsymbol{x}},c)/\eta\right)\,.$

Thus we can get the optimal $\rho^{*}(\tilde{\boldsymbol{x}}|c)$ .

Appendix D Additional Ablation Study

D.1 SD v1.5 and v2.1.

In Sec. 4.2, we upgrade the backbone of stable diffusion from v1.5 to v2.1, where OpenCLIP text encoder [49] replaces the CLIP text encoder [42]. In this section, we demonstrate the quantitative consistency plot in Fig. 17(a) on the synthetic evaluation dataset, which shows similar conclusions to the user study in Fig. 11(a). We compare IP2P-Ours v1.5 with v2.1 as well. An interesting observation is that we train IP2P-Ours with SD v2.1 and show in Fig. 17(b) that its improvement over SD v1.5 is larger than HIVE in Fig. 17(a).

D.2 Model Adaptation

We demonstrate that HIVE is able to adapt the reward model that is trained on a different backbone from the backbone in Step. 3. We use the SD v1.5 generated data to train the reward model, and process the rest steps using SD v2.1. We report user study results in Fig. 18. It is observed that the users vote similarly between the reward models that are trained on two SD backbones. In other words, the reward model is able to adapt from one backbone to another.

D.3 Weighted Reward and Conditional Reward Losses

We compare the weighted reward loss and conditional reward loss on the synthetic evaluation dataset. As shown in Fig. 19, the performances of these two losses are close to each other, while the conditional reward loss is slightly better. Therefore we adopt the conditional reward loss in all our experiments.

D.4 Training with Less Data

We analyze the effect of the training data size. We compare HIVE with SD v1.5 at four training dataset size ratios: 100%, 50%, 30% and 10%. As shown in Fig. 20, significantly decreasing the size of the dataset, e.g. 10% data, leads to worse ability to perform large image edits. On the other hand, reasonable decreasing dataset size can result in a similar yet slightly worse performance e.g. 50% data.

D.5 Subcategory Analysis

We classify the editing into the following sub-categories: changing the global style, adjust attributes for the main object, add/remove objects, manipulate objects, and other challenging cases such as zooming and camera view changes. We use ChatGPT ⁵⁵5https://chat.openai.com/ to determine which sub-category the instruction belongs to. Specifically, the numbers of instructions in each sub-category are as follows: changing global style (133), adjust attributes for the main object (134), add/remove objects (508), manipulate objects (219), and others (6). We analyze user study results for each sub-category. It is shown in Fig. 21 that the most improvement comes from the sub-categories ”Add/remove objects” and ”Manipulate objects”.

D.6 Additional Visualized Results

We illustrate additional visualized results in Fig. 22, 23, 24, 25, 26, where each row illustrates three instructional editing examples.

	$\displaystyle\textstyle p(\tilde{\boldsymbol{x}}\|c)$	$\displaystyle\exp\left(\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}},c)/\eta\right% ):=q(\tilde{\boldsymbol{x}}\|c)\left(p(R^{*}=1~{}\|~{}\tilde{\boldsymbol{x}},c)% \right)^{1/\eta}$
		$\displaystyle=p(\tilde{\boldsymbol{x}}\|c)\left(\frac{p(\tilde{\boldsymbol{x}}\|% ~{}R^{}=1,c)p(R^{}=1\|~{}c)}{p(\tilde{\boldsymbol{x}}\|c)}\right)^{1/\eta}$
		$\displaystyle\propto p(\tilde{\boldsymbol{x}}\|c)^{1-1/\eta}p(\tilde{% \boldsymbol{x}}\|~{}R^{*}=1,c)^{1/\eta}\,,$

	$\displaystyle\log\rho(\tilde{\boldsymbol{x}}\|c)$	$\displaystyle=\frac{1}{\eta}\mathcal{R}_{\phi}(\tilde{\boldsymbol{x}}\|c)+\log p% (\tilde{\boldsymbol{x}}\|c)-1\,,$
	$\displaystyle\rho(\tilde{\boldsymbol{x}}\|c)$	$\displaystyle\propto p(\tilde{\boldsymbol{x}}\|c)\exp\left(\mathcal{R}_{\phi}(% \tilde{\boldsymbol{x}},c)/\eta\right)\,.$


HIVE with weighted reward loss	HIVE with condition reward loss


IP2P-Ours with SD v1.5 and v2.1	InstructPix2Pix with SD v1.5 and v2.1


Change global style	Adjust attributes	Add/remove


Manipulate objects	Others