EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

Qu Yang Mang Ye ¹¹1Corresponding Author Bo Du
School of Computer Science, Wuhan University, Wuhan, China.
{yangqu, yemang, dubo}@whu.edu.cn
https://github.com/yan9qu/EmoLLM

Abstract

Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks, but their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. Thus, it impedes their ability to effectively understand and react to the intricate emotions expressed by humans through multimodal media. To bridge this gap, we introduce EmoBench, the first comprehensive benchmark designed specifically to evaluate the emotional capabilities of MLLMs across five popular emotional tasks, using a diverse dataset of ~287k images and videos paired with corresponding textual instructions. Meanwhile, we propose EmoLLM, a novel model for multimodal emotional understanding, incorporating with two core techniques. 1) Multi-perspective Visual Projection, it captures diverse emotional cues from visual data from multiple perspectives. 2) EmoPrompt, it guides MLLMs to reason about emotions in the correct direction. Experimental results demonstrate that EmoLLM significantly elevates multimodal emotional understanding performance, with an average improvement of 12.1% across multiple foundation models on EmoBench. Our work contributes to the advancement of MLLMs by facilitating a deeper and more nuanced comprehension of intricate human emotions, paving the way for the development of artificial emotional intelligence capabilities with wide-ranging applications in areas such as human-computer interaction, mental health support, and empathetic AI systems. Code, data, and model will be released.

1 Introduction

Do androids dream of electric sheep? This thought-provoking question from Philip K. Dick’s seminal novel underscores a fundamental divide between artificial intelligence and humanity – the capacity for genuine emotion. In our modern era, Multimodal Large Language Models (MLLMs) [1; 2; 3; 4; 5; 6] have achieved remarkable performance, even surpassing human capabilities in domains such as perception and cognition. However, when it comes to the realm of emotions, state-of-the-art MLLMs appear to be lacking in their ability to accurately interpret and respond to emotional cues. While existing MLLMs can generate basic responses to human queries regarding emotional aspects, the accuracy of their responses remains unsatisfactory, especially in nuanced categories such as fear and anger (Fig. 1(b)). Moreover, even LLMs that have been employed for text-based emotional analysis often fall short when confronted with the complexities of multimodal emotional tasks, which require the integration of visual, auditory, and textual cues. A primary factor contributing to this limitation is the scarcity of comprehensive emotional datasets for training MLLMs, as publicly available datasets generally focus on objective visual abilities [7]. This gap not only mirrors the philosophical questions raised by Dick’s narrative but also motivates us to explore the vast, uncharted territories of emotional intelligence within MLLMs.

To bridge this gap, we propose EmoBench, a comprehensive benchmark designed to serve two critical functions: providing a rich source of training materials to enhance the performance of MLLMs and evaluating their emotional understanding capabilities. EmoBench encompasses a diverse range of tasks (Fig. 1), which we categorize into Universal Emotional Tasks and Emotional Application Tasks. Universal Emotional Tasks include multimodal emotion recognition and intent understanding, both represented as a classification paradigm. Emotional Application Tasks, on the other hand, focus on specific challenges in social media applications, such as Hate, Sarcasm, and Humor Detection. To construct EmoBench, we first collected a diverse dataset for each subtask, as illustrated in Tab. 1. Subsequently, we employed GPT-4 [1] to generate a wide array of question templates for each subtask, ultimately compiling a dataset of approximately 287,000 multimodal instructions. By offering a large-scale, diverse, and carefully curated dataset, EmoBench enables rigorous enhancement and evaluation of the emotional understanding capabilities of MLLMs.

Refer to caption — Figure 1: Qualitative (a) and quantitative (b) comparison of EmoLLM with GPT4-Vision and other SOTA MLLMs. EmoLLM outperforms other models, particularly in recognizing nuanced emotions such as anger and sadness. (c) Overview of the diverse tasks in EmoBench, including emotional universal tasks, emotional application tasks (hate, sarcasm, and humor detection).

With our proposed EmoBench, existing MLLMs can be empowered with better emotional understanding capabilities with downstream fine-tuning. Current MLLMs typically follow a two-step process: modality projection and LLM reasoning. However, these models still struggle to effectively capture and reason about the complex and nuanced emotions present in multimodal data. To address this challenge, we propose EmoLLM, a novel model that incorporates two key techniques: Multi-perspective Visual Projection and EmoPrompt. Multi-perspective Visual Projection captures diverse emotional cues by considering multiple viewpoints. Specifically, we use the features of objects in different feature maps as content information and construct the objects and their relationships as graph-based relational information. By jointly mining these two aspects of information, we can extract features that are more suitable for emotional tasks.

In the reasoning stage, Chain-of-Thought (CoT) [8] is a common and effective method. Inspired by CoT, we first let EmoLLM observe objects in multimedia data and then infer emotions based on these observations. However, a significant problem arises, i.e., the correctness of the first-stage observations determines the accuracy of the final inference. To mitigate this issue, EmoPrompt incorporates specific examples stored for the current task. To ensure the correctness of these examples, we present GPT-4V with data samples and ground truth labels to obtain an accurate CoT process. Whenever a prompt is required, EmoPrompt selects one such example to guide the reasoning process.

We summarize our contributions as follows:

•

We introduce EmoBench, a comprehensive benchmark designed to enhance and evaluate the emotional understanding capabilities of MLLMs across a diverse range of tasks, providing a large-scale dataset of ~287k instructions.
•

We propose EmoLLM, which incorporates Multi-perspective Visual Projection to capture diverse emotional cues and EmoPrompt to guide the reasoning process.
•

We conduct extensive experiments on the EmoBench benchmark, demonstrating that EmoLLM achieves substantial improvements over baseline models, with an average improvement of 12.1% across multiple foundation models.

2 Related Works

2.1 Multi-modality Emotional Tasks and Methods

Multimodal emotion recognition, which analyzes feelings through speech, text, and visual cues, has been a growing area of research. Early datasets like IEMOCAP [9] provide vital audiovisual interaction data but are limited by their focus on scripted events and lack of speaker diversity. Subsequent datasets, such as CMU-MOSEI [10] and MELD [11], address these limitations by offering more naturalistic expressions from videos and television shows. Emotic [12; 13] and GoEmotions [14] further expand the scope of resources for emotion recognition. In the related field of intention understanding, contemporary datasets like CLINC150 [15], HWU64 [16], Intentonomy [17], Snips [18], MDID [19], MSED [20], and BANKING77 [21] are derived from a diverse range of sources, including online forums and social media to explore the user intent [22]. The MIntRec dataset [23] takes a unique approach by utilizing TV series clips to capture the complex intentions portrayed by actors.

Building upon these datasets, numerous methods [24; 25; 26] have been proposed to advance the field of multimodal emotion recognition. Lee et al. [27] introduce the Multimodal Transformer (MulT), which employs the vanilla Transformer [28] architecture and directional cross-modal attention to learn effective multimodal language representations. Hazarika et al. [29] propose Modality-Invariant and -Specific Representations (MISA), which differentiates modality features into invariant and specific subspaces to aid in fusion and prediction. Yang et al. [30] introduce MFSA, a transformer-based model that leverages adversarial learning to create modality-specific and -agnostic representations for sentiment recognition. Recently, Zhang et al. [31] attempted to use GPT [1] to convert multimodal emotion tasks into text emotion recognition. However, this approach relies on pre-processing by an MLLM and is not suitable for practical applications.

2.2 Multi-modality Large Language Models

Large language models (LLMs), such as GPT-4 [1], Gemini-Pro [32], and LLaVA [33], have demonstrated remarkable language abilities in capturing general knowledge. By incorporating visual and audio inputs into LLMs using techniques like CLIP [34] and additional adapting modules [35; 36], multi-modality large language models (MLLMs) [37; 38; 39] have been developed to tackle a variety of multi-modal tasks. These tasks include image captioning [40; 41], visual question answering (VQA) [42; 43], and other language-related capabilities [44]. However, as revealed by our previous research (Fig. 1 b), the emotional understanding abilities of MLLMs remain unsatisfactory, particularly when dealing with complex emotions such as anger and fear, or emotional categories that require reasoning. We attribute this limitation primarily to the lack of relevant data and specialized models. To address this issue, we introduce EmoBench, the first emotional instruction tuning dataset designed to enhance the emotional understanding capabilities of various MLLMs and enable them to better navigate the realm of emotional comprehension.

Table 1: The statistics of various data sources in EmoBench.

Category	Sub-task	Dataset	Modality				Sampled Size (k)
Category	Sub-task	Dataset	\faFileTextO	\faFileImageO	\faFileMovieO	\faFileAudioO	Train	Val & Test
Universal Emotional Tasks	Emotion	Emotic [12; 13]	✗	✓	✗	✗	16.2	6.4
	Emotion	Caer-S [45]	✗	✓	✗	✗	42.0	21.0
	Emotion	Meld [11]	✗	✗	✓	✓	11.1	2.6
	Emotion	Emotion_6 [46]	✗	✓	✗	✗	1.3	0.6
	Intention	MintRec [23]	✗	✓	✗	✗	1.7	0.4
Emotional Application Tasks	Humor	SMILE [47]	✗	✗	✓	✓	8.6	1.0
	Hate	MMHS [48]	✓	✓	✗	✗	139.8	10
	Sarcasm	MMSD [49]	✓	✗	✓	✓	22.2	2.4

3 EmoBench

As a cornerstone of emotional tasks, we introduce EmoBench, a pioneering large-scale dataset comprised of conversations focused on emotional dimensions. Initially, we explore the rationale and provide a detailed definition of the tasks associated with EmoBench in Sec. 3.1. Following the task definition, we ensure a diverse and balanced representation of emotional content by sub-sampling from eight distinct datasets. We organize these samples into conversations using generative models and automated scripts, as detailed in Sec. 3.2.

3.1 Data Preparation and Task Definition

The data in EmoBench is sourced from various emotion [12; 13; 11; 45; 46] and intention [23; 47; 48; 49] datasets. As outlined in Tab. 1, we define two major categories of tasks: universal emotional tasks and emotional application tasks. For the former, which involves multi-modal emotion and intention understanding, we select well-known, data-rich works [12; 45; 11; 46; 23] from the community. From a classification perspective, LLMs are expected to choose the label that best matches the data content. However, considering real-world applications where a predefined label list may not be available, we also explore open-set understanding, where LLMs directly provide the predicted category without a predefined label list. For emotional application tasks, we identify sub-tasks with significant applications in the industry, particularly those frequently encountered or crucial in social media, such as humor [47], hate [48], and sarcasm [49] detection.

3.2 Instruction Construction

With the assistance of LLMs, data annotation has become increasingly streamlined. We adopt a similar approach and utilize a GPT-participated pipeline to establish a paradigm akin to visual (multimodal) question answering. For the Universal Emotion Tasks outlined in Sec. 3.1, we manually create a question template, e.g., "Question: Question_base + [LABEL_SET]. <DATA> Answer: [LABEL]". The Question_base is derived from the diverse questions generated by GPT, such as: “Identify the only emotion depicted in the given image from the following options”; [LABEL_SET] represents the label set of the current subtask, such as [anger, disgust, fear, joy, sadness, surprise] in the emotion recognition task; <DATA> is the multi-modal data placeholder; and [LABEL] corresponds to the ground-truth label in the original sub-task dataset, reflecting the emotion category of the multi-modal data. For Emotional Application Subtasks, we modify the question format to a binary choice, e.g., “Does the given multi-modal data contain sarcasm? Please answer Yes or No”.

4 Methodology

In this section, we provide a detailed overview of EmoLLM. We begin by describing the architecture of the model, then delve into each component of EmoLLM.

4.1 Model Overview

We present an overview of EmoLLM in this section, as shown in Fig. 3. There are three major modules in EmoLLM as follows:

Modality Encoding: To incorporate additional modalities such as visual and audio data, we integrate extra modality encoders into EmoLLM. This enhancement enables our model to effectively handle multiple modalities.

Multi-perspective Visual Projection: To effectively capture diverse emotional cues from visual data, we propose the MVP module. Unlike traditional methods that rely on a single perspective, MVP employs a multifaceted approach, analyzing visual data from multiple viewpoints. By constructing a graph-based representation of the relationships between object features, MVP enables EmoLLM to extract a rich set of emotionally relevant features.

EmoPrompt Reasoning: EmoPrompt leverages the capabilities of GPT-4V [1] to generate accurate and contextually appropriate prompts. By providing GPT-4V with carefully curated data samples and their corresponding ground truth labels, EmoPrompt facilitates a reliable Chain-of-Thought (CoT) process. This CoT process serves as a blueprint for EmoLLM’s reasoning, ensuring that it stays on track and arrives at emotionally coherent conclusions.

4.2 Modality Encoding

We design the corresponding modal encoding module for common modalities in emotional tasks, including the following three parts:

Visual Modality Encoding: To encode visual information, including images and video frames, we employ the CLIP-VIT-L/14 model proposed by Radford et al. [34]. CLIP is a novel framework that learns directly from unprocessed textual data related to images, enabling it to exploit a significantly wider range of supervision. The details of the visual encoding process are described in Sec. 4.3.

Audio Modality Encoding: For encoding audio signals and extracting meaningful representations from audio data, we utilize the WHISPER-BASE model introduced by Radford et al. [36]. WHISPER is a multilingual speech recognition model trained on a vast audio dataset with weak supervision, making it well-suited for capturing rich information from audio inputs.

Textual Modality Encoding: Large Language Models (LLMs) are typically pre-trained on massive text corpora, enabling instruction-tuned LLMs to effectively process textual information. In EmoLLM, we use LLaMA2-7B [2] as the foundation model for textual modality encoding, leveraging its strong language understanding capabilities.

Given a video $\boldsymbol{x}_{v}\in\mathbb{R}^{L_{v}\times d_{v}}$ , an image $\boldsymbol{x}_{i}\in\mathbb{R}^{L_{i}\times d_{i}}$ , an audio signal $\boldsymbol{x}_{a}\in\mathbb{R}^{L_{a}\times d_{a}}$ , and a user input text $\boldsymbol{x}_{t}\in\mathbb{R}^{L_{t}\times d_{t}}$ , we employ pre-trained models to encode the multimodal features. Specifically, we use the Multi-perspective Visual Projection (MVP) module, to encode the visual features. For the audio signal, we first apply the WHISPER model and then use a multilayer perceptron (MLP) to transform them into the desired dimension. The encoding process can be formulated as follows:

\boldsymbol{h}_{i}=\operatorname{MVP}\left(\boldsymbol{x}_{i}\right),% \boldsymbol{h}_{v}=\operatorname{MVP}\left(\boldsymbol{x}_{v}\right),% \boldsymbol{h}_{a}=\operatorname{MLP}(\operatorname{WHISPER}\left(\boldsymbol{% x}_{a}\right)),

(1)

where $\boldsymbol{h}_{i}\in\mathbb{R}^{L_{i}\times d_{h}}$ , $\boldsymbol{h}_{v}\in\mathbb{R}^{L_{v}\times d_{h}}$ and $\boldsymbol{h}_{a}\in\mathbb{R}^{L_{a}\times d_{h}}$ denote the encoded image, video, and audio features, respectively. The dimension of the modality-specific features is represented by $d_{h}$

4.3 Multi-perspective Visual Projection

In this section, we introduce Multi-perspective Visual Projections designed for emotional tasks. We consider two important aspects of multimodal emotional tasks: (1) mining objective object information in multimodal data, which we call content-based perspective, and (2) observing the connections and relationships between objects, which we refer to as relation-based perspective. To better understand the emotional aspects highlighted in the data, we believe that MLLMs should consider both the content-based and relation-based perspectives to deepen their understanding of emotional factors.

Given an input image (or a frame of video) $\boldsymbol{x}_{i}$ , we adopt the vision encoder of CLIP [34] to extract the original visual tokens $\boldsymbol{Z}=\left\{z_{i}\right\}_{i=1}^{L}$ , where $L$ is the number of visual tokens. Following ** et al. [37], we then utilize DPC-KNN [50], a k-nearest neighbor-based density peaks clustering algorithm, to cluster the visual tokens and obtain the content-based representation. The local density $\rho_{i}$ and distance index $\delta_{i}$ of each token $z_{i}$ are computed as follows:

\displaystyle\rho_{i}

\displaystyle=\textrm{exp}\big{(}-\frac{1}{K}\sum_{z_{k}\in\textrm{KNN}(z_{i},% \boldsymbol{Z})}\|z_{k}-z_{i}\|^{2}\big{)},\ \delta_{i}

\displaystyle=\begin{cases}\min\limits_{j:\rho_{j}>\rho_{i}}\|z_{j}-z_{i}\|^{2% },&\text{if\ $\exists j$\ s.t.\ $\rho_{j}>\rho_{i}$,}\\ \max\limits_{j}\|z_{j}-z_{i}\|^{2},&\text{otherwise,}\end{cases}

(2)

where $\textrm{KNN}(z_{i},\boldsymbol{Z})$ denotes the K-nearest neighbors of $z_{i}$ in $\boldsymbol{Z}$ after removing $z_{i}$ . Tokens with relatively high $\rho_{i}\times\delta_{i}$ are identified as cluster centers, and other tokens are allocated to their nearest cluster center based on Euclidean distances. The average token within each cluster represents the corresponding cluster $z^{\prime}_{i}$ .

To obtain the relation-based representation, we construct a graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ using the cluster centers. Each cluster center $z^{\prime}{i}$ becomes a node $v_{i}\in\mathcal{V}$ , with the feature of each cluster center used as the node’s value. To determine the edge weights, we first calculate the Euclidean distance between all cluster centers:

d_{ij}=\|z^{\prime}_{i}-z^{\prime}_{j}\|_{2}.

(3)

We then normalize the distances to the range [0, 1] using min-max normalization:

\tilde{d}_{ij}=\frac{d_{ij}-\min_{i,j}(d_{ij})}{\max_{i,j}(d_{ij})-\min_{i,j}(% d_{ij})}.

(4)

To determine the adjacency matrix $\boldsymbol{A}$ , we set a threshold $\tau$ and consider nodes $i$ and $j$ as adjacent if their normalized distance $\tilde{d}_{ij}$ is less than or equal to $\tau$ :

\boldsymbol{A}{ij}=\begin{cases}1,&\text{if }\tilde{d}_{ij}\leq\tau,\\ 0,&\text{otherwise.}\end{cases}

(5)

We apply a multi-layer graph convolutional network (GCN) [51] to the constructed graph. The graph convolution operation at layer $l$ can be formulated as:

\boldsymbol{H}^{(l+1)}=\sigma(\hat{\boldsymbol{D}}^{-\frac{1}{2}}\hat{% \boldsymbol{A}}\hat{\boldsymbol{D}}^{-\frac{1}{2}}\boldsymbol{H}^{(l)}% \boldsymbol{W}^{(l)}),

(6)

where $\hat{\boldsymbol{A}}=\boldsymbol{A}+\boldsymbol{I}$ is the adjacency matrix with added self-connections, $\hat{\boldsymbol{D}}$ is the degree matrix of $\hat{\boldsymbol{A}}$ , $\boldsymbol{H}^{(l)}$ is the feature matrix at layer $l$ , $\boldsymbol{W}^{(l)}$ is the trainable weight matrix at layer $l$ , and $\sigma$ is the activation function. The output of the last layer $\boldsymbol{H}^{(m)}$ serves as the relation-based representation, where $m$ is the number of layers.

For a video with the $m$ -th frame $\boldsymbol{Z}^{m}=\{z_{i}^{m}\}_{i=1}^{L}$ , following [37], we apply mean-pooling over all tokens to obtain the frame-level representation $f^{m}$ :

f^{m}=\frac{1}{L}\sum_{i=1}^{L}z_{i}^{m}.

(7)

We then use DPC-KNN [50; 37] to cluster the frames and identify critical events. The set of visual tokens within the $n$ -th event $\boldsymbol{F}_{n}$ is denoted as $\tilde{\boldsymbol{Z}}_{n}=\{z_{i}^{m}|m\in\boldsymbol{F}_{n},\ i\in{1,2,...,L}\}$ . To make the visual tokens expand over frames within each event, we adjust the local density and distance index calculations according to eq. 2. The expanded visual tokens are concatenated together in order of events to ensure temporal understanding. To provide multi-scale visual features, we adopt a three-step aggregation process for each input image or video. The outputs from each merging step are concatenated and transformed using a trainable projection matrix $\boldsymbol{W}$ to obtain the content-based representation $\boldsymbol{R}_{content}$ . The relation-based representation $\boldsymbol{R}_{relation}$ is obtained from the aggregation of the GCN output in each stage $\boldsymbol{H}^{(m)}$ . The final feature representation $\boldsymbol{h}_{i}$ is the linear combination of the content-based and relation-based representations with a coefficient $\alpha$ :

\boldsymbol{h}_{i}=(\alpha\times\boldsymbol{R}_{content})\oplus\boldsymbol{R}_% {relation}.

(8)

By integrating content-based and relation-based representations, MVP aims to enhance the ability of model to reason about the relationships between visual elements and improve its performance on downstream emotional tasks. The resulting feature representation provides a comprehensive understanding of the visual input, incorporating both local and global relationships.

4.4 EmoPrompt Reasoning

Chain-of-Thought (CoT) [8] is a popular and efficient technique for enhancing the reasoning power of LLMs without fine-tuning. It involves adding step-by-step reasoning instructions to the user’s prompt, guiding the LLM through a logical thought process. Given the delicate and unintuitive nature of emotional tasks, this kind of reasoning is crucial for accurate emotion understanding.

For emotional tasks, we first design a task-specific CoT as a baseline. Drawing inspiration from how humans identify emotions in images and videos, we observe that people often focus on the content of objects first, such as facial expressions, atmospheres, and other visual cues. Intuitively, we guide the MLLM to reason about the objective content in the data first, and then reason about the emotional task based on the obtained conclusion combined with the data. The advantage of this approach is that it guides the observation of LLM, leading to more robust reasoning.

However, this step-by-step thinking heavily depends on the observations made in the first step. If the LLM hallucinates or generates inaccurate observations during the initial stage, it can greatly affect the judgment of the emotional task. To address this issue, we propose EmoPrompt, which aims to provide correct guidance for the reasoning process.

To achieve this goal, we first collect data on a subset of emotional tasks along with their corresponding ground truth labels. By presenting both the “question” (emotion data) and “answer” (ground truth label) to GPT-4V, we obtain objective-to-subjective reasoning in the correct direction, as shown in Fig. 4. This ensures the correctness of the step-by-step reasoning process. Using this methodology, we collect hundreds of examples of reasoning for each emotional task. These examples serve as demonstrations of correct reasoning during the EmoLLM reasoning process.

By incorporating EmoPrompt, we guide EmoLLM to follow a correct reasoning path, mitigating the impact of potential hallucinations or inaccuracies in the initial observation stage. This approach enhances the ability of LLMs to accurately understand and interpret emotions in multimodal data.

Table 2: Comparison of the emotional ability between baseline MLLMs and our EmoLLM, on EmoBench-test set.

Methods	EmoBench Testing (30K)
Methods	Emo-C	Emo-O	Intention	Hate	Humor	Sarcasm	Overall
Vicuna [52] zero-shot	29.21	21.55	17.48	45.39	49.68	55.23	28.63
ChatUniVi [37] fine-tune	47.62	39.26	57.85	63.03	63.85	77.87	46.66
MacawLLM [38] fine-tune	42.42	31.05	52.91	57.54	55.60	71.75	40.28
OneLLM [39] fine-tune	51.16	40.30	56.95	59.01	60.89	73.93	48.20
EmoLLM fine-tune	64.06	52.58	73.99	67.43	75.69	86.67	60.36

5 Experiments

5.1 Experimental Setup

We adopt CLIP (ViT-L/14) [34] and WHISPER [36] as the visual and acoustic encoders, respectively. For the language foundation model, we choose the Vicuna-v1.5 model [52], which consists of 7B parameters. During the emotional fine-tuning stage, we utilize the data from EmoBench. EmoLLM is trained for 5 epochs with a batch size of 16, using the AdamW [53; 54] optimizer with a cosine learning rate schedule. The learning rate is set to 2e-5, and the warmup rate is 0.03. All input images or frames are resized to 224 $\times$ 224. Training one epoch on 4 $\times$ RTX 4090 GPUs takes approximately 5 hours using LoRA [55]. Hyperparameters are determined on the validation set, and final results are obtained on the test set. Each result is the average of three runs with various random seeds.

Table 3: Comparison of the emotional ability between SOTA MLLMs and EmoLLM.

Method	#Param	Emo-C	Emo-O
GPT-4V	$\sim{10}^{12}$	57.90	45.10
Gemini1.0	$\sim{10}^{11}$	45.47	44.83
Gemini1.5	$\sim{10}^{11}$	45.47	44.83
EmoLLM	$\sim{10}^{10}$	75.03	67.14

5.2 Main Results

To quantitatively measure the emotional capability of EmoLLM, we evaluate its performance on six sub-tasks from EmoBench, including close-set and open-set emotion classification, intention recognition, and three special emotional application tasks. As shown in Tab. 2, EmoLLM achieves superior performance compared to baselines with the same 7B parameter scale, demonstrating the effectiveness of our proposed approach.

We also compare the emotional understanding abilities of state-of-the-art MLLMs on an emotion sub-test set. Considering that some MLLMs do not support video and audio, we take a subset of pure images from EmoBench test set. It contains 6 emotion categories with hundreds of images in each category. As presented in Tab. 3, EmoLLM outperforms GPT-4V, Gemini-1.0, and Gemini-1.5 on both close-set (Emo-C) and open-set (Emo-O) emotion classification tasks while maintaining a smaller parameter count.

5.3 Ablation Studies

We conduct ablation studies to explore the key design choices in EmoLLM. All experiments are conducted on the Emo-C part of EmoBench test set, with other settings unchanged unless specified.

Multi-perspective Visual Projection We investigate the impact of the hyperparameter $\tau$ in the Multi-perspective Visual Projection module by varying its value from 0.05 to 0.5. As shown in Fig. 5 (left), performance of EmoLLM is sensitive to the choice of $\tau$ , with the highest accuracy of 64.06% achieved when $\tau$ is set to 0.1. The accuracy tends to decline as $\tau$ increases, indicating that a suitable value of $\tau$ is beneficial for emotional understanding capabilities of EmoLLM.

Quantity Effects in EmoPrompt To examine the impact of the number of EmoPrompts on the performance of EmoLLM, we vary the number of prompts from 100 to 1000 and evaluate the emotional capability. As depicted in Fig. 5 (right), increasing the number of EmoPrompts generally leads to improved performance, with the peak accuracy of 64.06% achieved when all prompts are used. This finding highlights the importance of utilizing a diverse set of prompts to enhance the emotional reasoning ability of LLMs. However, the performance gains diminish as the number of prompts exceeds 600, suggesting an optimal range for balancing computational efficiency and emotional understanding.

Effect of the Tuning Strategy We investigate whether different objective and affective training sequences affect the emotional understanding ability of LLMs. In Tab. 4, we compare the performance of three training strategies: emo (training with only EmoBench), mix (training with objective fine-tuned data mixed with EmoBench), and sequential (fine-tuning with objective data first and then with emotional task). The results suggest that sequential training substantially benefits emotional understanding. A possible explanation is that it simulates the way humans learn, starting with easy tasks and progressing to more difficult ones, while also moving from general knowledge to domain-specific knowledge.

Table 4: Various training strategies affect emotional understanding ability of LLMs. Training on traditional tasks first and then emotional tasks (sequential) leads to the best results.

Training Strategy	Emo-C	Emo-O	Intention	Hate	Humor	Sarcasm	Overall
emo	61.65	47.40	67.71	62.44	70.82	80.22	56.24
mix	63.05 ${}_{\text{{\color[rgb]{1,0,0}+1.40}}}$	49.32 ${}_{\text{{\color[rgb]{1,0,0}+1.92}}}$	73.54 ${}_{\text{{\color[rgb]{1,0,0}+5.83}}}$	65.90 ${}_{\text{{\color[rgb]{1,0,0}+3.46}}}$	67.65 ${}_{\text{{\color[rgb]{.5,.5,.5}-3.17}}}$	83.74 ${}_{\text{{\color[rgb]{1,0,0}+3.52}}}$	58.11 ${}_{\text{{\color[rgb]{1,0,0}+1.87}}}$
sequential	64.06 ${}_{\text{{\color[rgb]{1,0,0}+2.41}}}$	52.58 ${}_{\text{{\color[rgb]{1,0,0}+5.18}}}$	73.99 ${}_{\text{{\color[rgb]{1,0,0}+6.28}}}$	67.43 ${}_{\text{{\color[rgb]{1,0,0}+4.99}}}$	75.69 ${}_{\text{{\color[rgb]{1,0,0}+4.87}}}$	86.67 ${}_{\text{{\color[rgb]{1,0,0}+6.45}}}$	60.36 ${}_{\text{{\color[rgb]{1,0,0}+4.12}}}$

6 Conclusion

In this work, we introduce EmoBench, a comprehensive benchmark for enhancing and evaluating the emotional understanding capabilities of Multimodal Large Language Models (MLLMs), and propose EmoLLM, a novel model incorporating Multi-perspective Visual Projection and EmoPrompt techniques. Through extensive experiments on EmoBench, we demonstrated substantial improvements of EmoLLM over baselines, with an average improvement of 12.1% across multiple foundation models.

Limitations. One notable limitation is that the answers to the instructions in EmoBench may lack diversity since they were generated by GPT-4 and automated scripts rather than collected from human annotators. Maybe the future of work combining automation with manual labeling is a promising direction. Another limitation is the inherent vulnerabilities of LLMs, such as hallucination and sensitivity to prompts, which may affect the performance of EmoLLM.

Future Work. Despite these limitations, we believe our work takes a significant step towards enabling MLLMs to achieve a deeper understanding of complex emotions in multimodal data, paving the way for emotionally intelligent AI systems. Future work could focus on addressing the limitations mentioned above, such as increasing the diversity of EmoBench through a combination of automated and manual labeling, and mitigating the vulnerabilities of LLMs. Furthermore, exploring the application of emotionally intelligent AI systems in real-world scenarios and evaluating their impact on user experience and well-being could be valuable avenues for future research.

References

[1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint, 2023.
[2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint, 2023.
[3] R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv preprint, 2023.
[4] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient visual instruction model,” arXiv preprint, 2023.
[5] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” NIPS, 2024.
[6] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” JMLR, 2024.
[7] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S3fd: Single shot scale-invariant face detector,” in ICCV, 2017.
[8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” NIPS, 2022.
[9] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, 2008.
[10] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in ACL, 2018.
[11] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” in ACL, 2019.
[12] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotion recognition in context,” in CVPR, 2017.
[13] R. Kosti, J. M. Alvarez, A. Recesens, and A. Lapedriza, “Context based emotion recognition using emotic dataset,” IEEE TAPMI, 2019.
[14] D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, and S. Ravi, “Goemotions: A dataset of fine-grained emotions,” arXiv preprint, 2020.
[15] S. Larson, A. Mahendran, J. J. Peper, C. Clarke, A. Lee, P. Hill, J. K. Kummerfeld, K. Leach, M. A. Laurenzano, L. Tang et al., “An evaluation dataset for intent classification and out-of-scope prediction,” in EMNLP, 2019.
[16] X. Liu, A. Eshghi, P. Swietojanski, and V. Rieser, “Benchmarking natural language understanding services for building conversational agents,” arXiv preprint, 2019.
[17] M. Jia, Z. Wu, A. Reiter, C. Cardie, S. Belongie, and S.-N. Lim, “Intentonomy: a dataset and study towards human intent understanding,” in CVPR, 2021.
[18] A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril et al., “Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces,” arXiv preprint, 2018.
[19] J. Kruk, J. Lubin, K. Sikka, X. Lin, D. Jurafsky, and A. Divakaran, “Integrating text and image: Determining multimodal document intent in Instagram posts,” in EMNLP, 2019.
[20] A. Jia, Y. He, Y. Zhang, S. Uprety, D. Song, and C. Lioma, “Beyond emotion: A multi-modal dataset for human desire understanding,” in NACCL, 2022.
[21] I. Casanueva, T. Temcinas, D. Gerz, M. Henderson, and I. Vulic, “Efficient intent detection with dual sentence encoders,” in ACL WorkShop, 2020.
[22] Q. Yang, M. Ye, Z. Cai, K. Su, and B. Du, “Composed image retrieval via cross relation network with hierarchical aggregation transformer,” IEEE Transactions on Image Processing, 2023.
[23] H. Zhang, H. Xu, X. Wang, Q. Zhou, S. Zhao, and J. Teng, “Mintrec: A new dataset for multimodal intent recognition,” in ACM MM, 2022.
[24] M. Ye, Q. Shi, K. Su, and B. Du, “Cross-modality pyramid alignment for visual intention understanding,” IEEE Transactions on Image Processing.
[25] Q. Shi, M. Ye, Z. Zhang, and B. Du, “Learnable hierarchical label embedding and grou** for visual intention understanding,” IEEE Transactions on Affective Computing.
[26] K. Sun, Z. Xie, M. Ye, and H. Zhang, “Contextual augmented global contrast for multimodal intent recognition,” in CVPR, 2024.
[27] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in ACL, 2019.
[28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
[29] D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality-invariant and-specific representations for multimodal sentiment analysis,” in ACM MM, 2020.
[30] D. Yang, H. Kuang, S. Huang, and L. Zhang, “Learning modality-specific and-agnostic representations for asynchronous multimodal language sequences,” in ACM MM, 2022.
[31] Y. Zhang, M. Wang, P. Tiwari, Q. Li, B. Wang, and J. Qin, “Dialoguellm: Context and emotion knowledge-tuned llama models for emotion recognition in conversations,” arXiv preprint, 2023.
[32] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint, 2023.
[33] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
[34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
[35] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NIPS, 2020.
[36] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
[37] P. **, R. Takanobu, C. Zhang, X. Cao, and L. Yuan, “Chat-univi: Unified visual representation empowers large language models with image and video understanding,” arXiv preprint, 2023.
[38] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and Z. Tu, “Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration,” arXiv preprint, 2023.
[39] J. Han, K. Gong, Y. Zhang, J. Wang, K. Zhang, D. Lin, Y. Qiao, P. Gao, and X. Yue, “Onellm: One framework to align all modalities with language,” arXiv preprint, 2023.
[40] S. Bianco, L. Celona, M. Donzella, and P. Napoletano, “Improving image captioning descriptiveness by ranking and llm-based fusion,” arXiv preprint arXiv:2306.11593, 2023.
[41] M. Dzabraev, A. Kunitsyn, and A. Ivaniuta, “Vlrm: Vision-language models act as reward models for image captioning,” arXiv preprint arXiv:2404.01911, 2024.
[42] W. Hu, Y. Xu, Y. Li, W. Li, Z. Chen, and Z. Tu, “Bliva: A simple multimodal llm for better handling of text-rich visual questions,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2256–2264.
[43] I. Sterner, W. Lin, J. Chen, and B. Byrne, “Few-shot vqa with frozen llms: A tale of two approaches,” arXiv preprint arXiv:2403.11317, 2024.
[44] H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, K. Xu, C. Li, J. Hou, G. Zhai et al., “Q-instruct: Improving low-level visual abilities for multi-modality foundation models,” arXiv preprint arXiv:2311.06783, 2023.
[45] J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context-aware emotion recognition networks,” in ICCV, 2019.
[46] K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” in CVPR, 2015.
[47] L. Hyun, K. Sung-Bin, S. Han, Y. Yu, and T.-H. Oh, “Smile: Multimodal dataset for understanding laughter in video with language models,” arXiv preprint, 2023.
[48] M.-H. Van and X. Wu, “Detecting and correcting hate speech in multimodal memes with large visual language model,” arXiv preprint, 2023.
[49] L. Qin, S. Huang, Q. Chen, C. Cai, Y. Zhang, B. Liang, W. Che, and R. Xu, “Mmsd2. 0: Towards a reliable multi-modal sarcasm detection system,” arXiv preprint, 2023.
[50] J. Jiang, Y. Chen, X. Meng, L. Wang, and K. Li, “A novel density peaks clustering algorithm based on k nearest neighbors for improving assignment process,” Physica A: Statistical Mechanics and its Applications, 2019.
[51] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint, 2016.
[52] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” https://vicuna.lmsys.org, 2023.
[53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[54] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint, 2017.
[55] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint, 2021.