Make Me Happier: Evoking Emotions Through Image Diffusion Models

Qing Lin
I2R and CFAR, Agency for Science,
Technology and Research (A*STAR), Singapore
Nanyang Technological University, Singapore
&**gfeng Zhang
School of Computer Science,
the University of Auckland, New Zealand
RIKEN AIP, Tokyo, Japan
&Yew Soon Ong
CFAR, Agency for Science, Technology
and Research (A*STAR), Singapore
Nanyang Technological University,
Singapore
&Mengmi Zhang
I2R and CFAR, Agency for Science,
Technology and Research (A*STAR), Singapore
Nanyang Technological University, Singapore
[email protected]
Corresponding author
Abstract

Despite the rapid progress in image generation, emotional image editing remains under-explored. The semantics, context, and structure of an image can evoke emotional responses, making emotional image editing techniques valuable for various real-world applications, including treatment of psychological disorders, commercialization of products, and artistic design. For the first time, we present a novel challenge of emotion-evoked image generation, aiming to synthesize images that evoke target emotions while retaining the semantics and structures of the original scenes. To address this challenge, we propose a diffusion model capable of effectively understanding and editing source images to convey desired emotions and sentiments. Moreover, due to the lack of emotion editing datasets, we provide a unique dataset consisting of 340,000 pairs of images and their emotion annotations. Furthermore, we conduct human psychophysics experiments and introduce four new evaluation metrics to systematically benchmark all the methods. Experimental results demonstrate that our method surpasses all competitive baselines. Our diffusion model is capable of identifying emotional cues from original images, editing images that elicit desired emotions, and meanwhile, preserving the semantic structure of the original images. All code, model, and dataset will be made public.

Refer to caption
Figure 1: The generated images evoke a sense of happiness in viewers, contrasting with the negative emotions elicited by the source images. Given a source image that triggers negative emotions (framed in green), our method (Ours) synthesizes a new image that elicits the given positive target emotions (in red), while maintaining the essential elements and structures of the scene. For instance, in the first quadrant where trash on the ground evokes negative emotions, our model performs image editing by eliminating the trash while preserving the outdoor context, including trees and grass, in the same spatial arrangement as the original scene. For comparisons, we include other competitive methods (Sec.4). Zoom in for better views.

1 Introduction

“I am feeling down. Enhance my room to bring in more excitement."

What we perceive not only entails useful visual information but also evokes profound emotional reactions. Incorporating visual cues that elicit emotions into images holds practical significance in various domains. For instance, in commercials, it amplifies product branding and shapes public opinions, captivating consumer attention. Similarly, integrating emotion-evoking elements into AR/VR experiences aids in addressing psychological disorders such as autism and schizophrenia. Despite significant advancements in image generation techniques Gal et al. (2022); Wang et al. (2023); Yu et al. (2018), there has been a notable absence of focus given to emotion conditioning. Here, we introduce the novel and important problem of emotion-evoked image generation — given an initial image and a target emotion, the task is to create an image that elicits the specified emotion of human viewers while maintaining the original scene’s semantics and structures.

Emotion-evoked image generation poses a challenge as it requires a deep understanding and identification of the subtle context and semantic elements present in the source images that evoke emotional responses. Additionally, this task also requires image editing capabilities to effectively remove, replace, add, or modify elements within the source images to elicit the intended emotional response. For instance, in Figure 1, the presence of a burning lawn significantly contributes to feelings of anger. Consequently, in the generated image, the burning lawn is substituted with a serene scene featuring a cluster of flowers, thereby transforming the overall emotional response to awe.

Studies Yang et al. (2023, 2018) have identified both global and local factors influencing emotional states, such as overall color and tonal shifts in image backgrounds, local facial expressions, and the presence of emotion-associated objects like graveyards correlated with sadness and balloons correlated with amusement. As also seen in Figure 2, introducing fearful elements like fire or a woman in white into forests (Column 2) can readily evoke negative emotions, as these elements prompt instinctual responses to unpleasant or adverse environments. However, not all concepts serve as emotional triggers; some may remain neutral, such as the sky, plants, or windows. In the same examples in Column 2, forests are neutral concepts and do not elicit emotional responses. Given that human emotions tend to exhibit greater sensitivity to negative stimuli over positive stimuli Baumeister et al. (2001); Ito et al. (1998), simply manipulating global intensity by adjusting the brightness of entire images in Column 2 would not necessarily alter the negative emotions.

Drawing inspiration from these emotion-associated factors, we propose our model, EmoEditor, for emotion-evoked image generation. We base EmoEditor on state-of-the-art stable diffusion models and introduce three technical novelties. (1) EmoEditor employs a novel dual-branch architecture in the reverse diffusion process, that integrates emotion-conditioned global context and local emotional cues from the source images, to ensure coherence with the original content while reflecting the desired emotional outcome. (2) During training, the network’s behaviors are guided by aligning model creativity with human expectations in the neuro-symbolic space, implicitly learning the map** from target emotions to human-annotated text instructions. (3) During inference, iterative emotion discrimination mechanisms are employed to autonomously select emotionally coherent images.

Before the introduction of our EmoEditor, traditional methods, such as color transfer and style transfer Pitié et al. (2007); Gatys et al. (2015); Weng et al. (2023), were commonly employed to modify the emotional tone of images. These methods typically require reference emotional images or textual cues as additional inputs. However, they face challenges in achieving satisfactory performance in eliciting desired emotions because they struggle to understand, identify, and manipulate the local regions of source images that evoke emotional responses. In Figure 2, while darkening brightness or deepening hues may easily evoke fear (Example 1), simply brightening colors in an image of a forest fire is inadequate for eliciting amusement (Example 4).

Refer to caption
Figure 2: Emotions are influenced by both global and local cues from the source images. We present two image generation examples showcasing the transition from positive to negative emotions in Column 1 (shaded in blue) and two examples from negative to positive emotions in Column 2 (shaded in orange). In each example, the target emotions are in red, and source images are framed in green. Generated results from three methods are presented. Zoom in for better views.

With the rapid advancements in diffusion models, various effective text-to-image editing techniques have emerged Kawar et al. (2023); Brooks et al. (2023); Ruiz et al. (2023). However, these methods are not specifically tailored for understanding and identifying emotion-associated elements within images. To address this gap, we propose a competitive baseline (Large Model Series), concatenating multiple large language and vision models. This baseline first understands source images through image captioning, then generates emotion-prompted instructions using language models, and finally edits images prompted by these instructions with diffusion models. Upon comparison with existing methods through a series of human psychophysics experiments and four new quantitative evaluations, our end-to-end trained EmoEditor achieves outstanding image generation results. It effectively preserves the structural coherence and semantic consistency wtih the source images while eliciting the target emotion of human viewers.

To obtain training data for emotion-evoked image generation, we establish a dataset curation pipeline and introduce the first large-scale EmoPair dataset, comprising approximately 340,000 pairs of images annotated with emotion-conditioned editing instructions. Unlike existing datasets on visual emotion analysis Mikels et al. (2005); Machajdik, Hanbury (2010); You et al. (2016); Peng et al. (2015); Yang et al. (2023), which assign emotion labels to individual images for classification, our dataset provides paired images annotated with source and target emotion labels.

The main contributions of this paper are highlighted below: (1) We introduce a new and important problem of emotion-evoked image generation. To benchmark all methods, we establish a comprehensive framework for evaluating model performances, incorporating human psychophysics experiments and four newly introduced metrics. (2) We introduce EmoEditor, an emotion-evoked diffusion model, which employs a novel two-branch architecture to integrate global context with local semantic regions of source images that evoke emotional responses. (3) EmoEditor learns emotional cues during end-to-end training, eliminating the need for additional emotion reference images. During inference, EmoEditor generates emotion-evoked images without hand-crafted emotion-conditioned text instructions, while preserving context and structural coherence with the source image. Its performance is superior among all the competitive methods. (4) We curate the first large-scale EmoPair dataset, containing 340,000 image pairs with emotion annotations.

2 Related Work

2.1 Visual Emotion Analysis

Visual Emotion Analysis (VEA) aims to understand and predict human emotional responses from visual data. In psychology, Categorical Emotion States (CES) form the basis for emotion classification models like the 8-category Mikels model Mikels et al. (2005). Early research used manually crafted features such as color and content Machajdik, Hanbury (2010), while recent advancements employ deep learning to capture emotional cues more accurately You et al. (2015); Yang et al. (2018). Despite progress, generating images to evoke specific emotions remains underexplored. We propose EmoEditor, a two-branch diffusion model that integrates emotion recognition and image generation to address this. Unlike the Affective Image Filter Weng et al. (2023), which requires explicit textual instructions, EmoEditor does not need text guidance to achieve desired emotions.

Existing VEA datasets Peng et al. (2015); You et al. (2016) focus on emotion classification, including EmoSet Yang et al. (2023), which has 120k images with emotion labels and attributes. However, they lack image pairs with annotated emotions and text editing instructions essential for emotion-evoked image generation. Thus, we introduce the EmoPair dataset, comprising 340k image pairs with differing source and target emotions while ensuring consistent scene semantics and structure.

2.2 Image Editing

Color and Style Transfer. Methods like PDF-TransferPitié et al. (2007), Neural Style TransferGatys et al. (2015), and CLIP-StylerKwon, Ye (2022) manipulate images using reference images, but they struggle with local region adjustments, limiting their versatility in generating emotion-evoking images. Our EmoEditor balances local and global manipulations without needing additional references.

Text-to-Image Editing. Text-to-image editing is increasingly popular. Earlier studies used GANs Goodfellow et al. (2014) and CLIP models Radford et al. (2021) for image manipulation based on detailed text descriptions Patashnik et al. (2021); Abdal et al. (2022). Recent diffusion models like Imagic Kawar et al. (2023), Prompt-to-prompt Hertz et al. (2022), InstructPix2Pix Brooks et al. (2023), and DreamBooth Ruiz et al. (2023) generate high-quality images from text prompts but often ignore emotional aspects and rely heavily on detailed instructions. In contrast, our EmoEditor needs only target emotion and source image inputs, using end-to-end training to understand and edit emotional cues while preserving image structure and coherence.

3 Method

We formulate emotion-evoked image generation as a supervised learning task. First, we curate the EmoPair dataset containing image pairs with emotion labels and corresponding text instructions for editing (Sec.3.1). Next, EmoEditor is trained on this dataset (Sec.3.2). During inference, it takes source images and target emotions, producing generated images without the need for human-annotated text instructions or emotion reference images.

Refer to caption
Figure 3: Pipeline for Curating Our EmoPair Dataset. The dataset comprises two subsets: EmoPair-Annotated Subset (EPAS, left blue box) and EmoPair-Generated Subset (EPGS, right orange box). Each subset includes schematics depicting the creation, selection, and labeling of image pairs in the upper quadrants, with two example pairs in the lower quadrants. Each example pair comprises a source image (framed in green) and a target image. The classified source and target emotion labels (highlighted in red) and target-emotion-driven text instructions for image editing are provided. See more examples in Supp.Sec.S1. See Sec.3.1 for EmoPair dataset.

3.1 Curating Our EmoPair Dataset

Given the unavailability of a paired image dataset for emotion editing, we curate EmoPair. It consists of two subsets: EmoPair-Annotated Subset (EPAS), containing 331,595 image pairs from Ip2pBrooks et al. (2023) annotated with emotion labels; and EmoPair-Generated Subset (EPGS), consisting of 6,949 pairs generated based on text instructions given target emotions. All images from EmoPair are of size 224×\times×224. Figure 3 shows the pipeline for curating our EmoPair dataset.

Emotion Predictor 𝒫𝒫\mathcal{P}caligraphic_P. To better construct our dataset, we first train an emotion predictor 𝒫𝒫\mathcal{P}caligraphic_P on the existing VEA dataset EmoSet Yang et al. (2023). All images are classified into 8 emotion categories, encompassing four positive (amusement, awe, contentment, excitement) and four negative (anger, disgust, fear, sadness) emotions. 𝒫𝒫\mathcal{P}caligraphic_P employs ResNet18 He et al. (2016) as a backbone and achieves a top-1 accuracy of 73% on EmoSet’s test set.

EmoPair-Annotated Subset (EPAS). The original Ip2p dataset lacks emotion labels, consisting only of image pairs with text instructions for editing. To address this, we use 𝒫𝒫\mathcal{P}caligraphic_P to classify the source and target images from the Ip2p dataset into eight distinct emotions. We augment the Ip2p dataset with emotion labels, resulting in 331,595 image pairs. Notably, the text instructions with these image pairs serve as human expectations to guide EmoEditor’s training but are not utilized during inference.

EmoPair-Generated Subset (EPGS). To address the lack of emotional cues in the Ip2p dataset, we create the EPGS subset using emotionally rich images from EmoSet Yang et al. (2023). However, EmoSet lacks paired images with consistent scenes and target emotions, and it doesn’t include text editing instructions. Single-word prompts like “[Target Emotion]" prove ineffective in existing text-to-image models due to their limited emotional reasoning. Single-emotion terms lack specificity, resulting in varied interpretations depending on the context of source images. For instance, enhancing an image for excitement can vary significantly based on context, such as prompting a happy smile for a person or adding fireworks to the sky for a landscape. Thus, to generate accurate image pairs, we develop 50 general instructions for transitioning to desired emotions of 8 categories using GPT-3 Brown et al. (2020). Human annotators rank these instructions, kee** the top ten per emotion. The pre-trained Ip2p model then uses these instructions to manipulate EmoSet source images, resulting in 6,949 high-quality image pairs after quality control measures (see Supp.Sec.S1).

3.2 Our Diffusion Model - EmoEditor

Refer to caption
Figure 4: Architecture of Our Proposed EmoEditor. EmoEditor is an image diffusion model, consisting of local (shaded in green) and global (shaded in blue) branches. The pre-trained VAE’s encoder \mathcal{E}caligraphic_E and decoder 𝒟𝒟\mathcal{D}caligraphic_D remain fixed throughout training and inference. Exclusively employed during inference, the fixed emotion predictor 𝒫𝒫\mathcal{P}caligraphic_P predicts the emotions on generated images for the iterative emotion inference. See Sec.3 for details.

Our EmoEditor, in Figure 4, takes a source image Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and a target emotion y𝑦yitalic_y to generate an image evoking the desired emotion. It employs the latent diffusion model (LDM) Rombach et al. (2022) with three key novelties. EmoEditor avoids using emotion labels for Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for two reasons: (1) Obtaining emotion labels for source images is often impractical in real-world scenarios. (2) Source images can be complex, conveying multiple emotions simultaneously. Utilizing a stereotypical source emotion as input potentially introduces biases and limits the model’s ability to generate diverse outputs.

Integration of Global and Local Emotion Cues. In the global branch of our EmoEditor, we convert the target emotion y𝑦yitalic_y into a binary one-hot vector eohsubscript𝑒𝑜e_{oh}italic_e start_POSTSUBSCRIPT italic_o italic_h end_POSTSUBSCRIPT, where the corresponding entry to y𝑦yitalic_y in eohsubscript𝑒𝑜e_{oh}italic_e start_POSTSUBSCRIPT italic_o italic_h end_POSTSUBSCRIPT is set to 1 and the rest are set to 0. This simplifies model design, enhancing comprehension and generation of complex emotional images, as one-hot encoding can capture intricate emotion combinations through a multi-label representation. Conversely, text inputs demand extra transformation steps, potentially introducing noise or errors. We introduce an emotion encoder τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to encode eohsubscript𝑒𝑜e_{oh}italic_e start_POSTSUBSCRIPT italic_o italic_h end_POSTSUBSCRIPT into e𝑒eitalic_e via fully connected layers: e=τθ(eoh)𝑒subscript𝜏𝜃subscript𝑒𝑜e=\tau_{\theta}(e_{oh})italic_e = italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_e start_POSTSUBSCRIPT italic_o italic_h end_POSTSUBSCRIPT ) (see Supp.Sec.S2.1 for details), trained from scratch to learn neuro-symbolic embeddings. Later on, we introduce novel losses to regularize the behaviors of τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. Notably, we intentionally avoid using explicit text editing instructions as conditional inputs to the model due to the varied interpretations of target emotions, where multiple solutions can elicit the same emotions. Including explicit text instructions may hinder the model’s creativity, restricting its capacity to produce diverse sets of images conveying the same target emotions.

In the local branch of our EmoEditor, we introduce an image encoder \mathcal{E}caligraphic_E to extract visual features of the source image, acquiring z𝑧zitalic_z for Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT: z=(Is)𝑧subscript𝐼𝑠z=\mathcal{E}(I_{s})italic_z = caligraphic_E ( italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). Similar to the LDM Rombach et al. (2022), \mathcal{E}caligraphic_E is based on a pre-trained VAEKingma, Welling (2022). The forward diffusion process gradually adds noise to z𝑧zitalic_z, generating a noisy latent ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, where the noise level increases over t𝑡titalic_t time steps (tT𝑡𝑇t\in Titalic_t ∈ italic_T). Specifically, we expressed zt=αtzt1+1αtϵsubscript𝑧𝑡subscript𝛼𝑡subscript𝑧𝑡11subscript𝛼𝑡italic-ϵz_{t}=\sqrt{\alpha_{t}}\cdot z_{t-1}+\sqrt{1-\alpha_{t}}\cdot\epsilonitalic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ⋅ italic_ϵ, where αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the diffusion coefficient controlling the rate of noise increase, and ϵitalic-ϵ\epsilonitalic_ϵ is a random noise sampled from a normal distribution.

In the reverse diffusion process, the denoising network ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT predicts noise added to the noisy latent ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using cross-attention with the target emotion condition e𝑒eitalic_e and source image condition z𝑧zitalic_z. The features in z𝑧zitalic_z and ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT enhance understanding of intrinsic information in the image, leading to emotion-specific content for Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. After T1𝑇1T-1italic_T - 1 denoising steps, decoder 𝒟𝒟\mathcal{D}caligraphic_D produces an emotion-evoked image. We minimize the latent diffusion loss, which is the expected error between the predicted noise by ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the actual noise ϵitalic-ϵ\epsilonitalic_ϵ sampled during training: Lnoise=Et,y,z,ϵ(zt),ϵN(0,1)ϵϵθ(zt,t,z,τθ(y))2subscript𝐿𝑛𝑜𝑖𝑠𝑒subscript𝐸similar-to𝑡𝑦𝑧italic-ϵsubscript𝑧𝑡italic-ϵ𝑁01superscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝑧𝑡𝑡𝑧subscript𝜏𝜃𝑦2L_{noise}=E_{t,y,z,\epsilon(z_{t}),\epsilon\sim N(0,1)}\left\|\epsilon-% \epsilon_{\theta}(z_{t},t,z,\tau_{\theta}(y))\right\|^{2}italic_L start_POSTSUBSCRIPT italic_n italic_o italic_i italic_s italic_e end_POSTSUBSCRIPT = italic_E start_POSTSUBSCRIPT italic_t , italic_y , italic_z , italic_ϵ ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_ϵ ∼ italic_N ( 0 , 1 ) end_POSTSUBSCRIPT ∥ italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_z , italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

Alignment Loss between Model and Human Judgements. A single emotion term lacks the depth to convey the nuanced causes underlying emotions, posing a challenge for diffusion models to grasp the diagnostic information triggering emotions. In contrast to existing text-to-image models, EmoEditor emulates human reasoning by aligning its thought process with text instructions in our EmoPair dataset, which leads the transition from the source emotion to the target emotion. To achieve this, we introduce an alignment loss in a neuro-symbolic space, between emotion embeddings e𝑒eitalic_e from τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and text embeddings c𝑐citalic_c of instructions in each image pair of our EmoPair dataset with pre-trained CLIP. Specifically, the alignment loss is formulated as the inverse of cosine similarity between e𝑒eitalic_e and c𝑐citalic_c: Lemb=1cos(e,c)subscript𝐿𝑒𝑚𝑏1𝑒𝑐L_{emb}=1-\cos(e,c)italic_L start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT = 1 - roman_cos ( italic_e , italic_c ). Overall, we conduct joint training of τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with the following loss: Ltotal=Lnoise+λLembsubscript𝐿𝑡𝑜𝑡𝑎𝑙subscript𝐿𝑛𝑜𝑖𝑠𝑒𝜆subscript𝐿𝑒𝑚𝑏L_{total}=L_{noise}+\lambda L_{emb}italic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_n italic_o italic_i italic_s italic_e end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT, where λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5 represents the weight for balancing different losses.

Importantly, we differentiate our alignment loss from the losses typically employed in classical language models, which focus on predicting specific word tokens. Our approach loosens the constraint of precisely predicting the exact text instructions. This flexibility empowers our model to explore a wider array of image editing solutions, that can augment the source emotions of human viewers to target emotions. Consequently, our model learns to reason in a neuro-symbolic manner akin to humans, and meanwhile, preserves its creativity and capacity for exploring novel editing solutions.

Iterative Emotion Inference. Editing emotions is a complex process often requiring iterative image edits. To address this, we introduce a recurrent emotion critic process during inference. This critic iteratively infers the evoked emotions from generated images while assessing their structural coherence and semantic consistency. Specifically, we employ the emotion predictor 𝒫𝒫\mathcal{P}caligraphic_P as the critic.

In the first iteration of image generation, the source image Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT serves as both the input and condition image, producing latent variables z𝑧zitalic_z and ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. These, along with the emotion vector e𝑒eitalic_e, are fed to the EmoEditor to obtain the generated image I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG. The critic evaluates the generated image quality based on two criteria: (1) the structural similarity (SSIM) between I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG and Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT falls within 0.3-0.8, and (2) the predicted emotion by 𝒫𝒫\mathcal{P}caligraphic_P on I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG matches the target emotion, with a confidence level exceeding 0.8. If both criteria are met, EmoEditor ceases image generation; otherwise, it continues to generate new images. In subsequent iterations, we use I^^𝐼\hat{I}over^ start_ARG italic_I end_ARG as the input image, generating its noisy latent variable z^tsubscript^𝑧𝑡\hat{z}_{t}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, with Issubscript𝐼𝑠I_{s}italic_I start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT remaining as the condition. Along with the emotion vector e𝑒eitalic_e, the EmoEditor generates new edited images for the critic. This process iterates until both criteria are satisfied, or the number of critic iterations exceeds a predefined limit of 30. In the latter case, we select the generated result with the highest confidence level predicted by 𝒫𝒫\mathcal{P}caligraphic_P given the target emotion among all past iterations.

See Supp.Sec.S2.2 for Implementation Details.

4 Experiment

We selected 504 images randomly from the EmoSet dataset Yang et al. (2023), ensuring they were distinct from those in our EmoPair dataset. According to the emotion labels of EmoSet, there are 63 images for each source emotion category. Subsequently, we evaluated both competitive baselines and our EmoEditor on these source images and presented the results in the following subsections.

Ideally, with source images and their classified emotions, our goal is to generate images evoking the other 7 target emotions. Considering emotional valence, we divided the test set into two subsets: within-valence emotion transfer and cross-valence emotion transfer. This division serves two purposes. Firstly, cross-valence emotion transfer is more common in real-world scenarios, like enhancing positive emotions in stressful environments or creating fear in lively settings during Halloween. Secondly, images of the same valence may contain subtle differences challenging for humans to discern. For instance, an amusement park scene might evoke a mixture of excitement and amusement.

Baselines. We compare our EmoEditor with five state-of-the-art methods: (1) Color-transferPitié et al. (2007) (CT); (2) Neural-Style-TransferGatys et al. (2015) (NST). Both methods require an additional reference image from the target emotion category as inputs. We randomly selected these reference images from the EmoSet dataset. We also include representative text-to-image editing models (3) CLIP-StylerKwon, Ye (2022) (Csty) and (4) Ip2p Brooks et al. (2023). We provide these models with the target emotion category as a one-word text prompt, such as “excitement". Emotion-evoked image generation involves three key steps: image understanding, instruction generation, and instruction-based image editing. To tackle these, we concatenate existing large language and vision models and introduce the baseline (5) “Large Model Series" (LMS). This includes BLIP Li et al. (2022) for image captioning, followed by GPT-3 Brown et al. (2020) for text instruction generation, and Ip2p for image editing based on the instructions (See Supp.Sec.S2.3).

[Uncaptioned image] Figure 5: Human Psychophysics Experiment Results. The average proportion of images that human participants prefer our EmoEditor over other methods is 56%. Chance is 50% (red dotted line). Error bars are standard errors.    Method EMR(%)↑ ESR(%)↑ ENRD↓ ESS↓ CT 6.89 79.32 33.29 7.36 NST 34.42 92.01 34.45 18.57 Csty 11.51 85.52 41.47 36.64 Ip2p 2.53 67.76 9.39 12.71 LMS 11.51 77.38 26.13 19.74 w/o 𝒫𝒫\mathcal{P}caligraphic_P 5.06 69.15 19.45 14.93 w/o Lembsubscript𝐿𝑒𝑚𝑏L_{emb}italic_L start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT 43.35 91.62 22.53 16.00 Ours 50.20 92.86 23.98 16.27 Table 1: Quantitative Evaluation of Generated Images for All Competitive Methods and Ablated Models of our EmoEditor. See Sec.4 for the methods and evaluation metrics. See Sec.5.1 for the result analysis.

Human Psychophysics Experiment. We evaluate the generated results of all methods on Amazon Mechanical Turk (MTurk)Turk (2012). We recruit 417 participants, with each participant undergoing 100 trials, yielding total 417,000 trials. All the experiments are conducted with the subjects’ informed consent and according to protocols approved by the Institutional Review Board of our institution. All participants are properly compensated. In each trial, participants engage in a “two-alternative forced choice" task. They are presented with two image stimuli and must select the one that most strongly evokes the target emotion in them. The two image stimuli consists of a result generated by our EmoEditor and a randomly sampled result from the five baselines introduced above. The trial presentation order and the binary choices for each trial are randomized. To assess the quality of data collection, we introduce control trials and use them as filtering criteria. We discard the data from those people who fail the control trials at least 17% of the time. See Supp.Sec.S2.4 for more details.

Evaluation Metrics. While CLIP Radford et al. (2021) is commonly used to assess text-image consistency, it doesn’t fully capture the complexity of emotions Widhoelzl, Takmaz (2024). Relying on CLIP for emotional alignment may overlook important nuances in images. Thus, we introduce four novel evaluation metrics to quantitatively assess emotion-evoked image generation performance.

Emotion Matching Ratio (EMR) and Emotion Similarity Ratio (ESR) assess the extent to which the generated images evoke target emotions. EMR: Similar to Goetschalckx et al. (2019), we also use our emotion predictor 𝒫𝒫\mathcal{P}caligraphic_P to assess the emotional category of generated images and determine their alignment with target emotions. We acknowledge that using 𝒫𝒫\mathcal{P}caligraphic_P as an evaluator may introduce favorable biases toward our model. However, the absence of existing emotion evaluators leaves us no choice but to rely on 𝒫𝒫\mathcal{P}caligraphic_P. ESR: 𝒫𝒫\mathcal{P}caligraphic_P generates emotion probability distributions egsubscript𝑒𝑔e_{g}italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT and essubscript𝑒𝑠e_{s}italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT for generated and source images. We compute ESR using Kullback-Leibler Divergence distance KLD()𝐾𝐿𝐷KLD(\cdot)italic_K italic_L italic_D ( ⋅ ), where KLD(eg,eoh)<KLD(es,eoh)𝐾𝐿𝐷subscript𝑒𝑔subscript𝑒𝑜𝐾𝐿𝐷subscript𝑒𝑠subscript𝑒𝑜KLD(e_{g},e_{oh})<KLD(e_{s},e_{oh})italic_K italic_L italic_D ( italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_o italic_h end_POSTSUBSCRIPT ) < italic_K italic_L italic_D ( italic_e start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_o italic_h end_POSTSUBSCRIPT ). eohsubscript𝑒𝑜e_{oh}italic_e start_POSTSUBSCRIPT italic_o italic_h end_POSTSUBSCRIPT is the binary one-hot target emotion vector, indicating the impact of edits on generated images transitioning from the source emotion to the target emotion.

Emotion-Neutral Region Deviation (ENRD) and Edge Structure Similarity (ESS) assess the structural coherence and semantic consistency between source and generated images. ENRD: Using Grad-CAM Selvaraju et al. (2017) with 𝒫𝒫\mathcal{P}caligraphic_P, we binarize Grad-CAM maps at a 0.5 threshold to identify emotionally neutral regions on source images, valued at 0. Then, we compute the pixel-level L1 distance between these regions on source and generated images. ESS: Employing the Canny edge detection algorithm with thresholds of 200 and 500 on both source and generated images, we determine the L1 norm between their edge maps to quantify structural disparities.

5 Results

5.1 Quantitative Evaluation in Cross-Valence Scenarios

Figure 5 shows human psychophysics experimental results (see Supp.Sec.S3.1 for more results). If all methods generate images that elicit target emotions equally effectively, human participants would make random choices, with 50% chance in the “two-alternative forced choice" tasks. Complimentary to psychophysics experiments, Table 1 presents quantitative results using four metrics from Sec.4. Observing Figure 5, we see that human participants consistently prefer the generated results of our EmoEditor over all competitive baselines, indicating its proficiency in evoking target emotions. This is supported by EmoEditor’s highest EMR and ESR scores in Table 1. Additionally, EmoEditor achieves moderately lower ENRD and ESS, suggesting maintained structural and semantic coherence in the generated images.

Refer to caption
Figure 6: Visualization of Generated Images from Different Methods. The target emotion is highlighted in red and the source image is framed with green. See Supp.Fig.S10 for more examples.

From Figure 5, we noted that our EmoEditor significantly outperforms Color Transfer (CT), indicating that emotion-evoked image generation involves more than global color and brightness adjustments. CT also has lower EMR and ESR than our EmoEditor in Table 1. As expected, CT only transfers colors, preserving the edges of the source image and hence, resulting in the lowest ESS scores. Similarly, our EmoEditor is significantly favored over Ip2p by human participants, indicating that current text-to-image diffusion models face challenges in understanding emotional cues and generating effective text instructions for editing emotion-evoked images. Despite Ip2p having comparable ENRD and ESS scores to our EmoEditor, its significantly lower EMR and ESR scores suggest that it makes minimal changes to source images, maintaining high structural and semantic similarity but failing to evoke target emotions. Additionally, our EmoEditor also outperforms Large Model Series (LMS) based on human preference scores in Figure 5, 39% in EMR, 15% in ESR, 2 in ENRD, and 3 in ESS in Table 1. This suggests that LMS accumulates biases and errors across all procedural steps, highlighting the importance of end-to-end training for emotion-evoked image generation.

Neural-Style-Transfer (NST) and CLIP-Styler (Csty) excel in inducing negative emotions through distortion and irregular textures on source images. They are preferred in positive-to-negative emotion manipulations, primarily in the upper quadrants of their confusion matrix (see Supp.Fig.S8). However, these methods render images challenging to comprehend, indicated by higher ENRD and ESS scores in Table 1, suggesting the distorted images lose structural similarities and semantic content. In contrast, our EmoEditor outperforms both methods in generating positively valenced images, with higher preference scores in the lower quadrants of the confusion matrix (see Supp.Fig.S8).

In ablation studies (Table 1), removing either the emotion predictor 𝒫𝒫\mathcal{P}caligraphic_P (Row 6) or the alignment loss λLemb𝜆subscript𝐿𝑒𝑚𝑏\lambda L_{emb}italic_λ italic_L start_POSTSUBSCRIPT italic_e italic_m italic_b end_POSTSUBSCRIPT (Row 7) significantly impairs performance. This leads to lower EMR and ESR scores, highlighting the crucial roles of both components in our EmoEditor for detecting emotional cues from source images, emulating human-like reasoning in generating editing instructions, and executing iterative refinements during inference. We also perform ablation studies on different max iterations for iterative emotion inference, showing that this iterative process improves the model’s performance, achieving stable and high-quality emotionally evocative images (see Supp.Sec.S3.3).

5.2 Visualization of Emotional Image Generation across Valence

Figure 1 and 6 present visualizations of generated images for all competitive methods in cross-valence scenarios (see Supp.Sec.S3.2 for more results). Color Transfer primarily replicates tonal characteristics from a reference image, often ineffective for significant emotional augmentation. Neural Style Transfer heavily relies on randomly selected reference images. While it produces artistic textures aligned with target emotions, it lacks in preserving essential semantic content in the source images. CLIPstyler excels in producing negative emotions but introduces challenging-to-interpret textures, as seen in Column 4. Ip2p demonstrates limited image generation abilities with single-word emotion prompts. For example, the fire in the source image remains almost identical to the generated image, due to the inability to comprehend emotions (Row 1, Column 5). Even after concatenating a series of large vision and language models before Ip2p, LMS does not always generate desirable outputs, triggering target emotions. This can be seen by the generated image of a smiling face, failing to evoke sadness in Row 2, Column 6.

In contrast, our EmoEditor needs only the target emotion and source image inputs, yet produces highly creative images that evoke target emotions while striving to maintain scene structures and semantic coherence as much as possible. For instance, in Row 1, Column 7, our method does not change the subject of the fire but turns the fire in the woods into a bonfire and generates a beautiful night sky in the background, thus evoking contentment. In Row 6, Column 7, it accurately adjusts the character’s facial expression while retaining the original content elements.

Refer to caption
Figure 7: Our EmoEditor can generalize to more challenging emotion editing scenarios. (a) Three examples are showcased, highlighting our EmoEditor’s ability to produce images evoking emotions of the same positive valence as the source images. (b) Two examples demonstrate how our EmoEditor can transform neutral real-world images to evoke either positive or negative emotions. Source images are framed in green. Target emotions are in red. See Sec.5.3 for the result analysis.

5.3 Generalization to Real-world Scenarios

Editing images within the same emotional valence presents greater challenges compared to cross-valence emotion image editing due to subtle differences in emotional cues. Figure 7(a) showcases our EmoEditor’s results within the same valence. Our EmoEditor displays a nuanced understanding of emotional cues and produces impressive creative outcomes. Example 1 transitions from awe to contentment, depicting a sunset scene with afterglow. Example 2 generates a blooming white flower with additional purple flowers to evoke excitement. Example 3 transforms a lively spring scene into a tranquil night scene with moonlight to evoke awe.

In real-world scenarios, some naturalistic images may evoke mixed emotions or no strong emotional responses at all. We challenge our EmoEditor to generate emotion-evoked images using neutral-valence source images randomly selected from the MSCOCO dataset Lin et al. (2014), commonly used for object detection. These images depict everyday life scenes with neutral emotions. Our EmoEditor applies positive and negative emotion transformations to these source images. Figure 7(b) illustrates the generated results. In Example 1, to inject amusement into an office space, our EmoEditor embellishes the walls with delightful watercolor paintings and infuses lively and vibrant colors into the room furniture. Conversely, to evoke fear, our model creates an aged desk and dimly lit walls. In Example 2, for an image featuring a courtyard with a teddy bear observing the garden, our method enriches the scene by adding a beautiful flower bed, creating an awe-inspiring effect. Conversely, to evoke feelings of disgust, our EmoEditor removes the flowers from the courtyard and transforms the view into a dilapidated scene with shabby clothes hanging on the walls and trash bags in the corner. It is crucial to emphasize that our EmoEditor is not trained in these neutral scenarios. All the generated results stem solely from the model’s own creativity and imagination, without any human-annotated editing instructions or scene interpretations.

6 Discussion

We present a new and important problem of emotion-evoked image generation. To address this, we present EmoEditor, an image diffusion model that understands emotional cues, creates implicit editing instructions aligned with human decisions, and manipulates image regions to evoke emotions, while maintaining coherent scene structures and semantics. Moreover, we contribute EmoPair dataset for model training. To benchmark all the methods, we introduce new evaluation metrics and establish standard protocols in emotion understanding, visualization, and reasoning. While our EmoEditor demonstrates superior quantitative performance and visually striking image generation results, we acknowledge several limitations in accurately handling fine-grained details of visual features on small faces within crowded scenes and generating emotion-evoked images without exacerbating semantic and structural disparities between source and target images. See Supp.Sec.S3.4 for details.

The problem of emotion-evoked image generation introduced here opens doors for research across AI, psychology, design, arts, and neuroscience. However, EmoEditor, like any technology, carries inherent risks. Misusing it, such as manipulating images to evoke undesirable emotions, could harm individuals’ mental well-being or mislead public feelings and opinions.

References

  • Abdal et al. (2022) Abdal Rameen, Zhu Peihao, Femiani John, Mitra Niloy, Wonka Peter. Clip2stylegan: Unsupervised extraction of stylegan edit directions // ACM SIGGRAPH 2022 conference proceedings. 2022. 1–9.
  • Baumeister et al. (2001) Baumeister Roy F, Bratslavsky Ellen, Finkenauer Catrin, Vohs Kathleen D. Bad is stronger than good // Review of general psychology. 2001. 5, 4. 323–370.
  • Brooks et al. (2023) Brooks Tim, Holynski Aleksander, Efros Alexei A. Instructpix2pix: Learning to follow image editing instructions // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 18392–18402.
  • Brown et al. (2020) Brown Tom, Mann Benjamin, Ryder Nick, Subbiah Melanie, Kaplan Jared D, Dhariwal Prafulla, Neelakantan Arvind, Shyam Pranav, Sastry Girish, Askell Amanda, others . Language models are few-shot learners // Advances in neural information processing systems. 2020. 33. 1877–1901.
  • Gal et al. (2022) Gal Rinon, Patashnik Or, Maron Haggai, Bermano Amit H, Chechik Gal, Cohen-Or Daniel. StyleGAN-NADA: CLIP-guided domain adaptation of image generators // ACM Transactions on Graphics (TOG). 2022. 41, 4. 1–13.
  • Gatys et al. (2015) Gatys Leon A., Ecker Alexander S., Bethge Matthias. A Neural Algorithm of Artistic Style. 2015.
  • Goetschalckx et al. (2019) Goetschalckx Lore, Andonian Alex, Oliva Aude, Isola Phillip. Ganalyze: Toward visual definitions of cognitive image properties // Proceedings of the ieee/cvf international conference on computer vision. 2019. 5744–5753.
  • Goodfellow et al. (2014) Goodfellow Ian, Pouget-Abadie Jean, Mirza Mehdi, Xu Bing, Warde-Farley David, Ozair Sherjil, Courville Aaron, Bengio Yoshua. Generative adversarial nets // Advances in neural information processing systems. 2014. 27.
  • He et al. (2016) He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian. Deep residual learning for image recognition // Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. 770–778.
  • Hertz et al. (2022) Hertz Amir, Mokady Ron, Tenenbaum Jay, Aberman Kfir, Pritch Yael, Cohen-Or Daniel. Prompt-to-Prompt Image Editing with Cross Attention Control. 2022.
  • Ito et al. (1998) Ito Tiffany A, Larsen Jeff T, Smith N Kyle, Cacioppo John T. Negative information weighs more heavily on the brain: the negativity bias in evaluative categorizations. // Journal of personality and social psychology. 1998. 75, 4. 887.
  • Kawar et al. (2023) Kawar Bahjat, Zada Shiran, Lang Oran, Tov Omer, Chang Huiwen, Dekel Tali, Mosseri Inbar, Irani Michal. Imagic: Text-based real image editing with diffusion models // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 6007–6017.
  • Kingma, Welling (2022) Kingma Diederik P, Welling Max. Auto-Encoding Variational Bayes. 2022.
  • Kwon, Ye (2022) Kwon Gihyun, Ye Jong Chul. Clipstyler: Image style transfer with a single text condition // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. 18062–18071.
  • Li et al. (2022) Li Junnan, Li Dongxu, Xiong Caiming, Hoi Steven. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation // International Conference on Machine Learning. 2022. 12888–12900.
  • Lin et al. (2014) Lin Tsung-Yi, Maire Michael, Belongie Serge, Hays James, Perona Pietro, Ramanan Deva, Dollár Piotr, Zitnick C Lawrence. Microsoft coco: Common objects in context // Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. 2014. 740–755.
  • Machajdik, Hanbury (2010) Machajdik Jana, Hanbury Allan. Affective image classification using features inspired by psychology and art theory // Proceedings of the 18th ACM international conference on Multimedia. 2010. 83–92.
  • Mikels et al. (2005) Mikels Joseph A, Fredrickson Barbara L, Larkin Gregory R, Lindberg Casey M, Maglio Sam J, Reuter-Lorenz Patricia A. Emotional category data on images from the International Affective Picture System // Behavior research methods. 2005. 37. 626–630.
  • Patashnik et al. (2021) Patashnik Or, Wu Zongze, Shechtman Eli, Cohen-Or Daniel, Lischinski Dani. Styleclip: Text-driven manipulation of stylegan imagery // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. 2085–2094.
  • Peng et al. (2015) Peng Kuan-Chuan, Chen Tsuhan, Sadovnik Amir, Gallagher Andrew C. A mixed bag of emotions: Model, predict, and transfer emotion distributions // Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. 860–868.
  • Pitié et al. (2007) Pitié François, Kokaram Anil C, Dahyot Rozenn. Automated colour grading using colour distribution transfer // Computer Vision and Image Understanding. 2007. 107, 1-2. 123–137.
  • Radford et al. (2021) Radford Alec, Kim Jong Wook, Hallacy Chris, Ramesh Aditya, Goh Gabriel, Agarwal Sandhini, Sastry Girish, Askell Amanda, Mishkin Pamela, Clark Jack, others . Learning transferable visual models from natural language supervision // International conference on machine learning. 2021. 8748–8763.
  • Rombach et al. (2022) Rombach Robin, Blattmann Andreas, Lorenz Dominik, Esser Patrick, Ommer Björn. High-resolution image synthesis with latent diffusion models // Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. 10684–10695.
  • Ruiz et al. (2023) Ruiz Nataniel, Li Yuanzhen, Jampani Varun, Pritch Yael, Rubinstein Michael, Aberman Kfir. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 22500–22510.
  • Selvaraju et al. (2017) Selvaraju Ramprasaath R, Cogswell Michael, Das Abhishek, Vedantam Ramakrishna, Parikh Devi, Batra Dhruv. Grad-cam: Visual explanations from deep networks via gradient-based localization // Proceedings of the IEEE international conference on computer vision. 2017. 618–626.
  • Turk (2012) Turk Amazon Mechanical. Amazon mechanical turk // Retrieved August. 2012. 17. 2012.
  • Wang et al. (2023) Wang Su, Saharia Chitwan, Montgomery Ceslee, Pont-Tuset Jordi, Noy Shai, Pellegrini Stefano, Onoe Yasumasa, Laszlo Sarah, Fleet David J, Soricut Radu, others . Imagen editor and editbench: Advancing and evaluating text-guided image inpainting // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. 18359–18369.
  • Wang et al. (2022) Wang Tengfei, Zhang Ting, Zhang Bo, Ouyang Hao, Chen Dong, Chen Qifeng, Wen Fang. Pretraining is All You Need for Image-to-Image Translation. 2022.
  • Wang et al. (2004) Wang Zhou, Bovik Alan C, Sheikh Hamid R, Simoncelli Eero P. Image quality assessment: from error visibility to structural similarity // IEEE transactions on image processing. 2004. 13, 4. 600–612.
  • Weng et al. (2023) Weng Shuchen, Zhang Peixuan, Chang Zheng, Wang Xinlong, Li Si, Shi Boxin. Affective Image Filter: Reflecting Emotions from Text to Images // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 10810–10819.
  • Widhoelzl, Takmaz (2024) Widhoelzl Hanna-Sophia, Takmaz Ece. Decoding Emotions in Abstract Art: Cognitive Plausibility of CLIP in Recognizing Color-Emotion Associations. 2024.
  • Yang et al. (2023) Yang **gyuan, Huang Qirui, Ding Tingting, Lischinski Dani, Cohen-Or Danny, Huang Hui. EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes // Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 20383–20394.
  • Yang et al. (2018) Yang Jufeng, She Dongyu, Sun Ming, Cheng Ming-Ming, Rosin Paul L, Wang Liang. Visual sentiment prediction based on automatic discovery of affective regions // IEEE Transactions on Multimedia. 2018. 20, 9. 2513–2525.
  • You et al. (2015) You Quanzeng, Luo Jiebo, ** Hailin, Yang Jianchao. Robust image sentiment analysis using progressively trained and domain transferred deep networks // Proceedings of the AAAI conference on Artificial Intelligence. 29. 2015. 381–388.
  • You et al. (2016) You Quanzeng, Luo Jiebo, ** Hailin, Yang Jianchao. Building a large scale dataset for image emotion recognition: The fine print and the benchmark // Proceedings of the AAAI conference on artificial intelligence. 30. 2016. 308–314.
  • Yu et al. (2018) Yu Jiahui, Lin Zhe, Yang Jimei, Shen Xiaohui, Lu Xin, Huang Thomas S. Generative Image Inpainting With Contextual Attention // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). June 2018. 5505–5514.
  • Zhang et al. (2018) Zhang Richard, Isola Phillip, Efros Alexei A, Shechtman Eli, Wang Oliver. The unreasonable effectiveness of deep features as a perceptual metric // Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 586–595.

Supplementary Material

Appendix S1 EmoPair Dataset

In response to the absence of an image dataset specifically designed for emotion-evoke image generation, we introduce EmoPair, comprising two distinct subsets: the EmoPair-Annotated Subset (EPAS), encompassing 331,595 image pairs sourced from Ip2pBrooks et al. (2023) and annotated with emotion labels; and the EmoPair-Generated Subset (EPGS), featuring 6,949 pairs generated through text instructions specifying target emotions.

For the EmoPair-Generated Subset (EPGS), we formulated 50 general instructions that are agnostic to source images, prompting transitions to desired emotions across 8 categories using GPT-3 Brown et al. (2020). Human annotators then ranked these instructions based on efficacy within each emotion category, ultimately retaining the top ten. Figure S1 illustrates examples of these instructions.

We employ the following selection criteria to control the quality of generated image pairs of EPGS: (1) Using our emotion predictor 𝒫𝒫\mathcal{P}caligraphic_P, we analyze the generated images and only select those with a Top-1 classification confidence over 90% for the target emotions. (2) To ensure the preservation of similar scene structures, we utilize Structural Similarity Index (SSIM) Wang et al. (2004) and Learned Perceptual Image Patch Similarity (LPIPS) Zhang et al. (2018) for filtering the remaining outcomes. SSIM measures structural similarity between images, while LPIPS quantifies perceptual differences. Specifically, we require the generated images x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG and the source image x𝑥xitalic_x to meet the conditions below: 0.3 < SSIM(x𝑥xitalic_x, x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG) < 0.6 and LPIPS(x𝑥xitalic_x, x^^𝑥\hat{x}over^ start_ARG italic_x end_ARG) > 0.1. Ultimately, EPGS retains 6,949 image pairs.

To provide a more comprehensive overview of our dataset, additional image pair examples are presented in Figure S2 and S3.

Refer to caption
Figure S1: Instruction examples of EmoPair-Generated Subset (EPGS). For each emotion category, we retained the top ten text instructions based on the rankings determined by human annotators according to the efficacy of each emotion category. We utilized these instructions for Ip2p in image editing to generate target images capable of evoking the desired emotions, thereby constructing our EPGS.
Refer to caption
Figure S2: Sample of EmoPair-Annotated Subset (EPAS). On the left side of each pair of images is the source image (framed in green), while the right side shows the target image. The emotion labels for the source and target images (highlighted in red) are indicated above the images.
Refer to caption
Figure S3: Sample of EmoPair-Generated Subset (EPGS). On the left side of each pair of images is the source image (framed in green), while the right side shows the target image. The emotion labels for the source and target images (highlighted in red) are indicated above the images.

Appendix S2 Experiment

S2.1 Emotion Encoder

Our Emotion Encoder τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is a fully connected network designed for transforming one-hot encoded emotion input vectors into the structured emotion embedding e𝑒eitalic_e. The architecture of the model is composed of a series of fully connected layers that progressively increase the dimensionality of the input. Figure S4 shows the network structure of our emotion encoder τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

The network begins with an input layer of size 8, which is then passed through a sequence of fully connected layers with increasing sizes: 256, 512, and 768 neurons, respectively. Each of these layers is followed by a ReLU activation function to introduce non-linearity and improve the model’s learning capacity. The final linear layer projects the output to a dimension of 77×7687776877\times 76877 × 768, effectively structuring the embedding to fit the input size of our denoising network ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Refer to caption
Figure S4: The network structure of our emotion encoder τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

S2.2 Implementation Details

Follow Wang et al. (2022), we initialize the weights of \mathcal{E}caligraphic_E, 𝒟𝒟\mathcal{D}caligraphic_D, and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with the pre-trained Ip2p Brooks et al. (2023) weights. Throughout the training process, we maintain the fixed parameters of \mathcal{E}caligraphic_E and 𝒟𝒟\mathcal{D}caligraphic_D, focusing on training τθsubscript𝜏𝜃\tau_{\theta}italic_τ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and ϵθsubscriptitalic-ϵ𝜃\epsilon_{\theta}italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. The frozen emotion predictor 𝒫𝒫\mathcal{P}caligraphic_P is only used during inference. We conduct experiments with NVIDIA RTX A5000 GPUs, implementing our PyTorch-based framework using the ADAM optimizer. The max time step T𝑇Titalic_T is set to 1,000.

S2.3 Baseline: Large Model Series

Emotion-evoked image generation involves three key steps: image understanding, instruction generation, and instruction-based image editing. To tackle these, we concatenate existing large language and vision models and introduce the baseline “Large Model Series" (LMS). This includes BLIP Li et al. (2022) for image captioning, followed by GPT-3 Brown et al. (2020) for text instruction generation, and Ip2p for image editing based on the instructions.

Figure S5 shows the workflow of LMS. First, we employ BLIP to comprehend the source image, generating an image caption corresponding to the source image. Subsequently, based on the generated caption, we utilize the sentence structure “There is an image with [Image caption]. To make me feel [Target Emotion], how to change the image? Give one instruction that starts with an action." to query GPT-3. Following this, GPT-3 generates an instruction to guide Ip2p in editing the source image, resulting in the final output.

Refer to caption
Figure S5: Workflow of Large Model Series (LMS). We decompose emotion-evoked image generation into three steps: employing BLIP for image understanding, utilizing GPT-3 for instruction generation, and employing Ip2p for image editing. We refer to this approach of chaining multiple large models as the Large Model Series.

S2.4 Human Psychophysics Experiment

Figure S6(a) shows the MTurk Experiment schematic and Figure S6(b) shows the experiment instruction pages given to the participants. To control the quality of data collected, we have implemented preventive measures to screen participants:

(a) Each participant undergoes 6 randomly dispersed dummy trials within the real experiments. To assess if participants are making random selections, we use 6 image pairs with prominent emotional differences as references. Considering individual emotional response variability, participants are allowed a maximum of one incorrect choice in these dummy trials. Subjects exceeding an error rate of 1/6 are excluded, resulting in 11,400 trials. The outcomes of the 6 dummy trials are also excluded from the final statistics. Figure S6(c) provides an example of a dummy trial.

(b) Each participant can only take part in the experiment once.

(c) Image pairs used in the entire experiment are drawn from the 2,016 result pairs generated by our model and five other SOTA methods. Participants view randomly sampled pairs from these 2,016 pairs, and all trials are presented in a random order. The order of the two images for selection is also randomized. An example trial in the real experiment is illustrated in Figure S6(d).

Refer to caption
Figure S6: MTurk Experiment. (a) Schematic of MTurk experiment. Participants are presented with a set of images and a target emotion. They must choose between two images, one generated by our model and the other by five state-of-the-art methods, selecting the one that more effectively evokes the target emotion. (b) Instructions of the MTurk experiment. (c) Dummy trial example. We chose six image pairs with prominent emotional distinctions as benchmarks to assess participants’ comprehension of the task and their attentiveness to the experiment. (d) Real trial example.

Appendix S3 Results

S3.1 Quantitative Evaluation in Cross-Valence Scenarios

We assess the generated outputs of all methods using Amazon Mechanical Turk (MTurk)Turk (2012) (Online). We recruited 417 participants, each undergoing 100 trials. Additionally, we conducted in-lab experiments, involving 10 volunteers. Similarly, each participant reviewed 100 trials. The consistency between online and in-lab experimental results demonstrates the superiority of our approach in producing results that evoke the target emotion compared to other state-of-the-art methods.

Refer to caption
Figure S7: Preference of Our Method Over SOTA Methods. We conducted both online experiments (417 participants) and in-lab experiments (10 participants). The left column represents the online results, while the right column displays the in-lab results. The first row shows the overall results, the second row only considers the trials with transitions from negative to positive emotions, and the third row focuses on the trials with transitions from positive to negative emotions. Chance is 50% in the red dotted line. Error bars are standard errors.

Figure S7 illustrates the preference of our method over other state-of-the-art methods in both online and in-lab experiments. From the overall results (Row 1), the in-lab experiment demonstrates a higher preference compared to the online experiment. This suggests that our method’s effectiveness in evoking the target emotion is more pronounced in a laboratory setting. Note that our in-lab and online experiments had identical setups. The observed difference may stem from the controlled environment provided by the lab conditions, where participants were more focused, reducing randomness and facilitating a more accurate assessment of emotion generation. This is further supported by the dummy trial success rates, with a 100% pass rate in the in-lab experiment compared to only 27% in the online experiment. Again, we emphasize that we only use the online data for result analysis, where the participants passed the dummy tests with less than 1/6 error rates.

Results from the “To positive" (Row 2) and “To negative" (Row 3) indicate that Neural-Style-TransferGatys et al. (2015) (NST) and CLIP-StylerKwon, Ye (2022) (Csty) exhibit an advantage in generating negative results, but struggle with positive outcomes. Additionally, our EmoEditor surpasses Color-transferPitié et al. (2007) (CT) and Ip2pBrooks et al. (2023) in generating results evoking positive emotions, while the improvement in generation performance is even more significant in generating results evoking negative emotions. This indirectly confirms the greater challenge of generating positive emotion-evoked images compared to negative ones. Furthermore, the performance of our EmoEditor and the Large Model Series (LMS) is comparable; however, our approach allows for end-to-end training and testing, avoiding the complexity of chaining large models.

In addition, we show the confusion matrix of different methods for both online and in-lab experiments in Figure S8 and S9. The confusion matrix provides more detailed insights into the emotion transition effects of our EmoEditor across specific emotion categories. From the upper quadrant of the confusion matrix, it can be observed that people prefer Neural-Style-Transfer Gatys et al. (2015) and CLIP-Styler Kwon, Ye (2022) more in the manipulation of emotions from positive to negative. This preference may arise because these methods excel in inducing negative emotions through distortions and irregular textures on the source image. In contrast, our EmoEditor achieves higher human preference scores in the lower quadrant of the confusion matrix, demonstrating our method’s significant superiority in generating positive-valence images compared to these two approaches.

Refer to caption
Figure S8: Confusion matrix of different methods (Online). The confusion matrix reflects the human preference for our EmoEditor over other methods based on source and target emotion pairs. The top-left matrix presents the average overall results of our EmoEditor compared to the other five SOTA methods. The remaining five matrices show individual comparisons between our EmoEditor and each of the five SOTA methods. Average preferences for our EmoEditor and other methods are indicated above each matrix. The vertical axis corresponds to the source emotion, while the horizontal axis represents the target emotion. Each cell in the matrix denotes the ratio at which our method is selected for the corresponding transition from the source to the target emotion.
Refer to caption
Figure S9: Confusion matrix of different methods (In-lab). The confusion matrix reflects the human preference for our EmoEditor over other methods based on source and target emotion pairs. The top-left matrix presents the average overall results of our EmoEditor compared to the other five SOTA methods. The remaining five matrices show individual comparisons between our EmoEditor and each of the five SOTA methods. Average preferences for our EmoEditor and other methods are indicated above each matrix. The vertical axis corresponds to the source emotion, while the horizontal axis represents the target emotion. Each cell in the matrix denotes the ratio at which our method is selected for the corresponding transition from the source to the target emotion.

S3.2 Visualization of Emotional Image Generation across Valence

Figure S10 illustrates the visualizations of images generated by all methods across various scenarios. Color-Transfer Pitié et al. (2007) (CT) proves ineffective for significant emotional enhancement because it primarily replicates the color features of the reference image. For instance, in Row 1, Column 2, it enhances brightness but fails to evoke amusement as it does not alter elements causing sadness, such as the crying woman. Neural-Style-Transfer Gatys et al. (2015) (NST) heavily relies on randomly selected reference images, lacking in preserving the fundamental semantic content of the source image. For example, in Row 4, Column 3, it generates multiple differently colored blocks but fails to evoke the emotion of sadness. CLIPstyler Kwon, Ye (2022) (Csty) introduces inexplicable textures. Ip2p Brooks et al. (2023), due to its limited understanding of emotions, often produces results identical to the source image, showcasing restricted image generation capabilities, as seen in Column 5.

Even with a series of large models in LMS, it does not consistently generate ideal outputs to trigger the target emotion, as evidenced by Row 1, Column 6, where it not only fails to change the crying woman but amplifies the crying details. In addition, it often greatly changes the scene of the source image to achieve the goal of changing the emotion, such as Row 2, Column 6, it produces a building that looks very technological but quite different from the source image.

In contrast, our EmoEditor, requiring only the target emotion and source image inputs, can generate highly creative images that evoke the target emotion while striving to maintain scene structure and semantic coherence. For example, in Row 1, Column 7, it changes the woman’s pose while retaining the main scene and character from the source image, generating a cartoon-like image that aligns with amusement. In Row 2, Column 7, our EmoEditor generates technological results that make people feel awe while almost maintaining most of the structure of the source image.

Refer to caption
Figure S10: Visualization of Generated Images from Different Methods. The target emotion is highlighted in red and the source image is framed with green.

S3.3 Ablation Study of Different Iterations

Due to the abstract and complex nature of emotions, we propose an iterative emotion inference approach. The maximum number of iterations significantly impacts the quality of images generated by the model. To assess this effect, we vary the maximum number of iterations from 1 to 30 and evaluate the results using four metrics.

Figure S11 presents our experimental findings. The Emotion Matching Ratio (EMR) shows a steady increase from 0% to approximately 50% over 30 iterations, indicating continuous improvement in the model’s ability to generate images that evoke the target emotions.

The Emotion Similarity Ratio (ESR) rises rapidly in the initial iterations, stabilizing around 90%. This suggests that the model quickly becomes proficient at producing images that closely match the target emotional content, with minimal improvements after the initial iterations.

The Emotion-Neutral Region Deviation (ENRD) also increases rapidly at first, plateauing around 22.5. This pattern indicates that the deviation in neutral regions of the generated images becomes consistent after a few iterations, reflecting the model’s stability in preserving non-emotional aspects.

The Edge Structure Similarity (ESS) follows a similar trend, rising quickly to around 16 and then stabilizing. This stability in ESS reflects the model’s ability to maintain structural coherence and semantic consistency between the source and generated images after the initial iterations.

In summary, these metrics demonstrate that the iterative process enhances the model’s performance significantly in the initial stages, reaching a stable state of high quality and consistency in generating emotionally evocative images.

Refer to caption
Figure S11: Performance Metrics Across Different Iterations.

S3.4 Limitations

Figure S12 shows some failure cases. Our EmoEditor encounters difficulties in handling fine details of individuals, especially when their faces in the images are small. For example, in Row 1, although our EmoEditor attempts to make the boy appear sad, the boy’s face seems somewhat twisted and distorted.

Moreover, significant disparities between source images and target emotions present obstacles, making it difficult to generate results aligning with the target emotion while preserving the structure of the source image. For example, in Row 2, the source image features a tiger, evoking the emotion of anger. In our EmoEditor’s attempt to transform it into awe, although the image’s structure is retained, it is altered to depict steep coastal rocks, changing the semantic content of the image.

Refer to caption
Figure S12: Failure Cases. The target emotion is highlighted in red and the source image is framed with green.