EmoLLM: Multimodal Emotional Understanding Meets Large Language Models

Qu Yang     Mang Ye 111Corresponding Author       Bo Du
School of Computer Science, Wuhan University, Wuhan, China.
{yangqu, yemang, dubo}@whu.edu.cn
https://github.com/yan9qu/EmoLLM
Abstract

Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks, but their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. Thus, it impedes their ability to effectively understand and react to the intricate emotions expressed by humans through multimodal media. To bridge this gap, we introduce EmoBench, the first comprehensive benchmark designed specifically to evaluate the emotional capabilities of MLLMs across five popular emotional tasks, using a diverse dataset of ~287k images and videos paired with corresponding textual instructions. Meanwhile, we propose EmoLLM, a novel model for multimodal emotional understanding, incorporating with two core techniques. 1) Multi-perspective Visual Projection, it captures diverse emotional cues from visual data from multiple perspectives. 2) EmoPrompt, it guides MLLMs to reason about emotions in the correct direction. Experimental results demonstrate that EmoLLM significantly elevates multimodal emotional understanding performance, with an average improvement of 12.1% across multiple foundation models on EmoBench. Our work contributes to the advancement of MLLMs by facilitating a deeper and more nuanced comprehension of intricate human emotions, paving the way for the development of artificial emotional intelligence capabilities with wide-ranging applications in areas such as human-computer interaction, mental health support, and empathetic AI systems. Code, data, and model will be released.

1 Introduction

Do androids dream of electric sheep? This thought-provoking question from Philip K. Dick’s seminal novel underscores a fundamental divide between artificial intelligence and humanity – the capacity for genuine emotion. In our modern era, Multimodal Large Language Models (MLLMs) [1; 2; 3; 4; 5; 6] have achieved remarkable performance, even surpassing human capabilities in domains such as perception and cognition. However, when it comes to the realm of emotions, state-of-the-art MLLMs appear to be lacking in their ability to accurately interpret and respond to emotional cues. While existing MLLMs can generate basic responses to human queries regarding emotional aspects, the accuracy of their responses remains unsatisfactory, especially in nuanced categories such as fear and anger (Fig. 1(b)). Moreover, even LLMs that have been employed for text-based emotional analysis often fall short when confronted with the complexities of multimodal emotional tasks, which require the integration of visual, auditory, and textual cues. A primary factor contributing to this limitation is the scarcity of comprehensive emotional datasets for training MLLMs, as publicly available datasets generally focus on objective visual abilities [7]. This gap not only mirrors the philosophical questions raised by Dick’s narrative but also motivates us to explore the vast, uncharted territories of emotional intelligence within MLLMs.

To bridge this gap, we propose EmoBench, a comprehensive benchmark designed to serve two critical functions: providing a rich source of training materials to enhance the performance of MLLMs and evaluating their emotional understanding capabilities. EmoBench encompasses a diverse range of tasks (Fig. 1), which we categorize into Universal Emotional Tasks and Emotional Application Tasks. Universal Emotional Tasks include multimodal emotion recognition and intent understanding, both represented as a classification paradigm. Emotional Application Tasks, on the other hand, focus on specific challenges in social media applications, such as Hate, Sarcasm, and Humor Detection. To construct EmoBench, we first collected a diverse dataset for each subtask, as illustrated in Tab. 1. Subsequently, we employed GPT-4 [1] to generate a wide array of question templates for each subtask, ultimately compiling a dataset of approximately 287,000 multimodal instructions. By offering a large-scale, diverse, and carefully curated dataset, EmoBench enables rigorous enhancement and evaluation of the emotional understanding capabilities of MLLMs.

Refer to caption
Figure 1: Qualitative (a) and quantitative (b) comparison of EmoLLM with GPT4-Vision and other SOTA MLLMs. EmoLLM outperforms other models, particularly in recognizing nuanced emotions such as anger and sadness. (c) Overview of the diverse tasks in EmoBench, including emotional universal tasks, emotional application tasks (hate, sarcasm, and humor detection).

With our proposed EmoBench, existing MLLMs can be empowered with better emotional understanding capabilities with downstream fine-tuning. Current MLLMs typically follow a two-step process: modality projection and LLM reasoning. However, these models still struggle to effectively capture and reason about the complex and nuanced emotions present in multimodal data. To address this challenge, we propose EmoLLM, a novel model that incorporates two key techniques: Multi-perspective Visual Projection and EmoPrompt. Multi-perspective Visual Projection captures diverse emotional cues by considering multiple viewpoints. Specifically, we use the features of objects in different feature maps as content information and construct the objects and their relationships as graph-based relational information. By jointly mining these two aspects of information, we can extract features that are more suitable for emotional tasks.

In the reasoning stage, Chain-of-Thought (CoT) [8] is a common and effective method. Inspired by CoT, we first let EmoLLM observe objects in multimedia data and then infer emotions based on these observations. However, a significant problem arises, i.e., the correctness of the first-stage observations determines the accuracy of the final inference. To mitigate this issue, EmoPrompt incorporates specific examples stored for the current task. To ensure the correctness of these examples, we present GPT-4V with data samples and ground truth labels to obtain an accurate CoT process. Whenever a prompt is required, EmoPrompt selects one such example to guide the reasoning process.

We summarize our contributions as follows:

  • We introduce EmoBench, a comprehensive benchmark designed to enhance and evaluate the emotional understanding capabilities of MLLMs across a diverse range of tasks, providing a large-scale dataset of ~287k instructions.

  • We propose EmoLLM, which incorporates Multi-perspective Visual Projection to capture diverse emotional cues and EmoPrompt to guide the reasoning process.

  • We conduct extensive experiments on the EmoBench benchmark, demonstrating that EmoLLM achieves substantial improvements over baseline models, with an average improvement of 12.1% across multiple foundation models.

2 Related Works

2.1 Multi-modality Emotional Tasks and Methods

Multimodal emotion recognition, which analyzes feelings through speech, text, and visual cues, has been a growing area of research. Early datasets like IEMOCAP [9] provide vital audiovisual interaction data but are limited by their focus on scripted events and lack of speaker diversity. Subsequent datasets, such as CMU-MOSEI [10] and MELD [11], address these limitations by offering more naturalistic expressions from videos and television shows. Emotic [12; 13] and GoEmotions [14] further expand the scope of resources for emotion recognition. In the related field of intention understanding, contemporary datasets like CLINC150 [15], HWU64 [16], Intentonomy [17], Snips [18], MDID [19], MSED [20], and BANKING77 [21] are derived from a diverse range of sources, including online forums and social media to explore the user intent [22]. The MIntRec dataset [23] takes a unique approach by utilizing TV series clips to capture the complex intentions portrayed by actors.

Building upon these datasets, numerous methods [24; 25; 26] have been proposed to advance the field of multimodal emotion recognition. Lee et al. [27] introduce the Multimodal Transformer (MulT), which employs the vanilla Transformer [28] architecture and directional cross-modal attention to learn effective multimodal language representations. Hazarika et al. [29] propose Modality-Invariant and -Specific Representations (MISA), which differentiates modality features into invariant and specific subspaces to aid in fusion and prediction. Yang et al. [30] introduce MFSA, a transformer-based model that leverages adversarial learning to create modality-specific and -agnostic representations for sentiment recognition. Recently, Zhang et al. [31] attempted to use GPT [1] to convert multimodal emotion tasks into text emotion recognition. However, this approach relies on pre-processing by an MLLM and is not suitable for practical applications.

2.2 Multi-modality Large Language Models

Large language models (LLMs), such as GPT-4 [1], Gemini-Pro [32], and LLaVA [33], have demonstrated remarkable language abilities in capturing general knowledge. By incorporating visual and audio inputs into LLMs using techniques like CLIP [34] and additional adapting modules [35; 36], multi-modality large language models (MLLMs) [37; 38; 39] have been developed to tackle a variety of multi-modal tasks. These tasks include image captioning [40; 41], visual question answering (VQA) [42; 43], and other language-related capabilities [44]. However, as revealed by our previous research (Fig. 1 b), the emotional understanding abilities of MLLMs remain unsatisfactory, particularly when dealing with complex emotions such as anger and fear, or emotional categories that require reasoning. We attribute this limitation primarily to the lack of relevant data and specialized models. To address this issue, we introduce EmoBench, the first emotional instruction tuning dataset designed to enhance the emotional understanding capabilities of various MLLMs and enable them to better navigate the realm of emotional comprehension.

Table 1: The statistics of various data sources in EmoBench.
Category Sub-task Dataset Modality Sampled Size (k)
\faFileTextO \faFileImageO \faFileMovieO \faFileAudioO Train Val & Test
Universal Emotional Tasks Emotion Emotic [12; 13] 16.2 6.4
Emotion Caer-S [45] 42.0 21.0
Emotion Meld [11] 11.1 2.6
Emotion Emotion_6 [46] 1.3 0.6
Intention MintRec [23] 1.7 0.4
Emotional Application Tasks Humor SMILE [47] 8.6 1.0
Hate MMHS [48] 139.8 10
Sarcasm MMSD [49] 22.2 2.4
Refer to caption
Figure 2: Overview of the EmoBench benchmark and its applications. (a) EmoBench is built upon a diverse content database. (c) The process of creating EmoBench involves expert template definition, diverse template set generation, and instruction generation. (d) The proposed EmoLLM is designed to leverage the EmoBench for improving the multi-modal emotional understanding capabilities.

3 EmoBench

As a cornerstone of emotional tasks, we introduce EmoBench, a pioneering large-scale dataset comprised of conversations focused on emotional dimensions. Initially, we explore the rationale and provide a detailed definition of the tasks associated with EmoBench in Sec. 3.1. Following the task definition, we ensure a diverse and balanced representation of emotional content by sub-sampling from eight distinct datasets. We organize these samples into conversations using generative models and automated scripts, as detailed in Sec. 3.2.

3.1 Data Preparation and Task Definition

The data in EmoBench is sourced from various emotion [12; 13; 11; 45; 46] and intention [23; 47; 48; 49] datasets. As outlined in Tab. 1, we define two major categories of tasks: universal emotional tasks and emotional application tasks. For the former, which involves multi-modal emotion and intention understanding, we select well-known, data-rich works [12; 45; 11; 46; 23] from the community. From a classification perspective, LLMs are expected to choose the label that best matches the data content. However, considering real-world applications where a predefined label list may not be available, we also explore open-set understanding, where LLMs directly provide the predicted category without a predefined label list. For emotional application tasks, we identify sub-tasks with significant applications in the industry, particularly those frequently encountered or crucial in social media, such as humor [47], hate [48], and sarcasm [49] detection.

3.2 Instruction Construction

With the assistance of LLMs, data annotation has become increasingly streamlined. We adopt a similar approach and utilize a GPT-participated pipeline to establish a paradigm akin to visual (multimodal) question answering. For the Universal Emotion Tasks outlined in Sec. 3.1, we manually create a question template, e.g., "Question: Question_base + [LABEL_SET]. <DATA> Answer: [LABEL]". The Question_base is derived from the diverse questions generated by GPT, such as: “Identify the only emotion depicted in the given image from the following options”; [LABEL_SET] represents the label set of the current subtask, such as [anger, disgust, fear, joy, sadness, surprise] in the emotion recognition task; <DATA> is the multi-modal data placeholder; and [LABEL] corresponds to the ground-truth label in the original sub-task dataset, reflecting the emotion category of the multi-modal data. For Emotional Application Subtasks, we modify the question format to a binary choice, e.g., “Does the given multi-modal data contain sarcasm? Please answer Yes or No”.

4 Methodology

In this section, we provide a detailed overview of EmoLLM. We begin by describing the architecture of the model, then delve into each component of EmoLLM.

Refer to caption
Figure 3: Overview of the EmoLLM framework. (a) EmoLLM takes a user query and multimodal data as input, which are processed by a LLM and modality-specific encoders, respectively. (b) The Multi-perspective Visual Projection consists of various stages, each extracting features from visual tokens and building a graph connecting cluster centers. The combined representations form a comprehensive understanding of the emotional aspects.

4.1 Model Overview

We present an overview of EmoLLM in this section, as shown in Fig. 3. There are three major modules in EmoLLM as follows:

Modality Encoding: To incorporate additional modalities such as visual and audio data, we integrate extra modality encoders into EmoLLM. This enhancement enables our model to effectively handle multiple modalities.

Multi-perspective Visual Projection: To effectively capture diverse emotional cues from visual data, we propose the MVP module. Unlike traditional methods that rely on a single perspective, MVP employs a multifaceted approach, analyzing visual data from multiple viewpoints. By constructing a graph-based representation of the relationships between object features, MVP enables EmoLLM to extract a rich set of emotionally relevant features.

EmoPrompt Reasoning: EmoPrompt leverages the capabilities of GPT-4V [1] to generate accurate and contextually appropriate prompts. By providing GPT-4V with carefully curated data samples and their corresponding ground truth labels, EmoPrompt facilitates a reliable Chain-of-Thought (CoT) process. This CoT process serves as a blueprint for EmoLLM’s reasoning, ensuring that it stays on track and arrives at emotionally coherent conclusions.

4.2 Modality Encoding

We design the corresponding modal encoding module for common modalities in emotional tasks, including the following three parts:

Visual Modality Encoding: To encode visual information, including images and video frames, we employ the CLIP-VIT-L/14 model proposed by Radford et al. [34]. CLIP is a novel framework that learns directly from unprocessed textual data related to images, enabling it to exploit a significantly wider range of supervision. The details of the visual encoding process are described in Sec. 4.3.

Audio Modality Encoding: For encoding audio signals and extracting meaningful representations from audio data, we utilize the WHISPER-BASE model introduced by Radford et al. [36]. WHISPER is a multilingual speech recognition model trained on a vast audio dataset with weak supervision, making it well-suited for capturing rich information from audio inputs.

Textual Modality Encoding: Large Language Models (LLMs) are typically pre-trained on massive text corpora, enabling instruction-tuned LLMs to effectively process textual information. In EmoLLM, we use LLaMA2-7B [2] as the foundation model for textual modality encoding, leveraging its strong language understanding capabilities.

Given a video 𝒙vLv×dvsubscript𝒙𝑣superscriptsubscript𝐿𝑣subscript𝑑𝑣\boldsymbol{x}_{v}\in\mathbb{R}^{L_{v}\times d_{v}}bold_italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, an image 𝒙iLi×disubscript𝒙𝑖superscriptsubscript𝐿𝑖subscript𝑑𝑖\boldsymbol{x}_{i}\in\mathbb{R}^{L_{i}\times d_{i}}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, an audio signal 𝒙aLa×dasubscript𝒙𝑎superscriptsubscript𝐿𝑎subscript𝑑𝑎\boldsymbol{x}_{a}\in\mathbb{R}^{L_{a}\times d_{a}}bold_italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and a user input text 𝒙tLt×dtsubscript𝒙𝑡superscriptsubscript𝐿𝑡subscript𝑑𝑡\boldsymbol{x}_{t}\in\mathbb{R}^{L_{t}\times d_{t}}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we employ pre-trained models to encode the multimodal features. Specifically, we use the Multi-perspective Visual Projection (MVP) module, to encode the visual features. For the audio signal, we first apply the WHISPER model and then use a multilayer perceptron (MLP) to transform them into the desired dimension. The encoding process can be formulated as follows:

𝒉i=MVP(𝒙i),𝒉v=MVP(𝒙v),𝒉a=MLP(WHISPER(𝒙a)),formulae-sequencesubscript𝒉𝑖MVPsubscript𝒙𝑖formulae-sequencesubscript𝒉𝑣MVPsubscript𝒙𝑣subscript𝒉𝑎MLPWHISPERsubscript𝒙𝑎\boldsymbol{h}_{i}=\operatorname{MVP}\left(\boldsymbol{x}_{i}\right),% \boldsymbol{h}_{v}=\operatorname{MVP}\left(\boldsymbol{x}_{v}\right),% \boldsymbol{h}_{a}=\operatorname{MLP}(\operatorname{WHISPER}\left(\boldsymbol{% x}_{a}\right)),bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_MVP ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , bold_italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = roman_MVP ( bold_italic_x start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ) , bold_italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = roman_MLP ( roman_WHISPER ( bold_italic_x start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) ) , (1)

where 𝒉iLi×dhsubscript𝒉𝑖superscriptsubscript𝐿𝑖subscript𝑑\boldsymbol{h}_{i}\in\mathbb{R}^{L_{i}\times d_{h}}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT,𝒉vLv×dhsubscript𝒉𝑣superscriptsubscript𝐿𝑣subscript𝑑\boldsymbol{h}_{v}\in\mathbb{R}^{L_{v}\times d_{h}}bold_italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and 𝒉aLa×dhsubscript𝒉𝑎superscriptsubscript𝐿𝑎subscript𝑑\boldsymbol{h}_{a}\in\mathbb{R}^{L_{a}\times d_{h}}bold_italic_h start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the encoded image, video, and audio features, respectively. The dimension of the modality-specific features is represented by dhsubscript𝑑d_{h}italic_d start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT

4.3 Multi-perspective Visual Projection

In this section, we introduce Multi-perspective Visual Projections designed for emotional tasks. We consider two important aspects of multimodal emotional tasks: (1) mining objective object information in multimodal data, which we call content-based perspective, and (2) observing the connections and relationships between objects, which we refer to as relation-based perspective. To better understand the emotional aspects highlighted in the data, we believe that MLLMs should consider both the content-based and relation-based perspectives to deepen their understanding of emotional factors.

Given an input image (or a frame of video) 𝒙isubscript𝒙𝑖\boldsymbol{x}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we adopt the vision encoder of CLIP [34] to extract the original visual tokens 𝒁={zi}i=1L𝒁superscriptsubscriptsubscript𝑧𝑖𝑖1𝐿\boldsymbol{Z}=\left\{z_{i}\right\}_{i=1}^{L}bold_italic_Z = { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, where L𝐿Litalic_L is the number of visual tokens. Following ** et al. [37], we then utilize DPC-KNN [50], a k-nearest neighbor-based density peaks clustering algorithm, to cluster the visual tokens and obtain the content-based representation. The local density ρisubscript𝜌𝑖\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and distance index δisubscript𝛿𝑖\delta_{i}italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of each token zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are computed as follows:

ρisubscript𝜌𝑖\displaystyle\rho_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =exp(1KzkKNN(zi,𝒁)zkzi2),δiabsentexp1𝐾subscriptsubscript𝑧𝑘KNNsubscript𝑧𝑖𝒁superscriptnormsubscript𝑧𝑘subscript𝑧𝑖2subscript𝛿𝑖\displaystyle=\textrm{exp}\big{(}-\frac{1}{K}\sum_{z_{k}\in\textrm{KNN}(z_{i},% \boldsymbol{Z})}\|z_{k}-z_{i}\|^{2}\big{)},\ \delta_{i}= exp ( - divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ KNN ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_Z ) end_POSTSUBSCRIPT ∥ italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ={minj:ρj>ρizjzi2,if j s.t. ρj>ρi,maxjzjzi2,otherwise,absentcasessubscript:𝑗subscript𝜌𝑗subscript𝜌𝑖superscriptnormsubscript𝑧𝑗subscript𝑧𝑖2if j s.t. ρj>ρi,subscript𝑗superscriptnormsubscript𝑧𝑗subscript𝑧𝑖2otherwise,\displaystyle=\begin{cases}\min\limits_{j:\rho_{j}>\rho_{i}}\|z_{j}-z_{i}\|^{2% },&\text{if\ $\exists j$\ s.t.\ $\rho_{j}>\rho_{i}$,}\\ \max\limits_{j}\|z_{j}-z_{i}\|^{2},&\text{otherwise,}\end{cases}= { start_ROW start_CELL roman_min start_POSTSUBSCRIPT italic_j : italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL if ∃ italic_j s.t. italic_ρ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL end_ROW start_ROW start_CELL roman_max start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , end_CELL start_CELL otherwise, end_CELL end_ROW (2)

where KNN(zi,𝒁)KNNsubscript𝑧𝑖𝒁\textrm{KNN}(z_{i},\boldsymbol{Z})KNN ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_Z ) denotes the K-nearest neighbors of zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝒁𝒁\boldsymbol{Z}bold_italic_Z after removing zisubscript𝑧𝑖z_{i}italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Tokens with relatively high ρi×δisubscript𝜌𝑖subscript𝛿𝑖\rho_{i}\times\delta_{i}italic_ρ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT × italic_δ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are identified as cluster centers, and other tokens are allocated to their nearest cluster center based on Euclidean distances. The average token within each cluster represents the corresponding cluster zisubscriptsuperscript𝑧𝑖z^{\prime}_{i}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

To obtain the relation-based representation, we construct a graph 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ) using the cluster centers. Each cluster center zisuperscript𝑧𝑖z^{\prime}{i}italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_i becomes a node vi𝒱subscript𝑣𝑖𝒱v_{i}\in\mathcal{V}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V, with the feature of each cluster center used as the node’s value. To determine the edge weights, we first calculate the Euclidean distance between all cluster centers:

dij=zizj2.subscript𝑑𝑖𝑗subscriptnormsubscriptsuperscript𝑧𝑖subscriptsuperscript𝑧𝑗2d_{ij}=\|z^{\prime}_{i}-z^{\prime}_{j}\|_{2}.italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = ∥ italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (3)

We then normalize the distances to the range [0, 1] using min-max normalization:

d~ij=dijmini,j(dij)maxi,j(dij)mini,j(dij).subscript~𝑑𝑖𝑗subscript𝑑𝑖𝑗subscript𝑖𝑗subscript𝑑𝑖𝑗subscript𝑖𝑗subscript𝑑𝑖𝑗subscript𝑖𝑗subscript𝑑𝑖𝑗\tilde{d}_{ij}=\frac{d_{ij}-\min_{i,j}(d_{ij})}{\max_{i,j}(d_{ij})-\min_{i,j}(% d_{ij})}.over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT - roman_min start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG roman_max start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - roman_min start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG . (4)

To determine the adjacency matrix 𝑨𝑨\boldsymbol{A}bold_italic_A, we set a threshold τ𝜏\tauitalic_τ and consider nodes i𝑖iitalic_i and j𝑗jitalic_j as adjacent if their normalized distance d~ijsubscript~𝑑𝑖𝑗\tilde{d}_{ij}over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is less than or equal to τ𝜏\tauitalic_τ:

𝑨ij={1,if d~ijτ,0,otherwise.𝑨𝑖𝑗cases1if subscript~𝑑𝑖𝑗𝜏0otherwise.\boldsymbol{A}{ij}=\begin{cases}1,&\text{if }\tilde{d}_{ij}\leq\tau,\\ 0,&\text{otherwise.}\end{cases}bold_italic_A italic_i italic_j = { start_ROW start_CELL 1 , end_CELL start_CELL if over~ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ italic_τ , end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise. end_CELL end_ROW (5)

We apply a multi-layer graph convolutional network (GCN) [51] to the constructed graph. The graph convolution operation at layer l𝑙litalic_l can be formulated as:

𝑯(l+1)=σ(𝑫^12𝑨^𝑫^12𝑯(l)𝑾(l)),superscript𝑯𝑙1𝜎superscript^𝑫12^𝑨superscript^𝑫12superscript𝑯𝑙superscript𝑾𝑙\boldsymbol{H}^{(l+1)}=\sigma(\hat{\boldsymbol{D}}^{-\frac{1}{2}}\hat{% \boldsymbol{A}}\hat{\boldsymbol{D}}^{-\frac{1}{2}}\boldsymbol{H}^{(l)}% \boldsymbol{W}^{(l)}),bold_italic_H start_POSTSUPERSCRIPT ( italic_l + 1 ) end_POSTSUPERSCRIPT = italic_σ ( over^ start_ARG bold_italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT over^ start_ARG bold_italic_A end_ARG over^ start_ARG bold_italic_D end_ARG start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT bold_italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ) , (6)

where 𝑨^=𝑨+𝑰^𝑨𝑨𝑰\hat{\boldsymbol{A}}=\boldsymbol{A}+\boldsymbol{I}over^ start_ARG bold_italic_A end_ARG = bold_italic_A + bold_italic_I is the adjacency matrix with added self-connections, 𝑫^^𝑫\hat{\boldsymbol{D}}over^ start_ARG bold_italic_D end_ARG is the degree matrix of 𝑨^^𝑨\hat{\boldsymbol{A}}over^ start_ARG bold_italic_A end_ARG, 𝑯(l)superscript𝑯𝑙\boldsymbol{H}^{(l)}bold_italic_H start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the feature matrix at layer l𝑙litalic_l, 𝑾(l)superscript𝑾𝑙\boldsymbol{W}^{(l)}bold_italic_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the trainable weight matrix at layer l𝑙litalic_l, and σ𝜎\sigmaitalic_σ is the activation function. The output of the last layer 𝑯(m)superscript𝑯𝑚\boldsymbol{H}^{(m)}bold_italic_H start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT serves as the relation-based representation, where m𝑚mitalic_m is the number of layers.

For a video with the m𝑚mitalic_m-th frame 𝒁m={zim}i=1Lsuperscript𝒁𝑚superscriptsubscriptsuperscriptsubscript𝑧𝑖𝑚𝑖1𝐿\boldsymbol{Z}^{m}=\{z_{i}^{m}\}_{i=1}^{L}bold_italic_Z start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT, following [37], we apply mean-pooling over all tokens to obtain the frame-level representation fmsuperscript𝑓𝑚f^{m}italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT:

fm=1Li=1Lzim.superscript𝑓𝑚1𝐿superscriptsubscript𝑖1𝐿superscriptsubscript𝑧𝑖𝑚f^{m}=\frac{1}{L}\sum_{i=1}^{L}z_{i}^{m}.italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT . (7)

We then use DPC-KNN [50; 37] to cluster the frames and identify critical events. The set of visual tokens within the n𝑛nitalic_n-th event 𝑭nsubscript𝑭𝑛\boldsymbol{F}_{n}bold_italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is denoted as 𝒁~n={zim|m𝑭n,i1,2,,L}subscript~𝒁𝑛conditional-setsuperscriptsubscript𝑧𝑖𝑚formulae-sequence𝑚subscript𝑭𝑛𝑖12𝐿\tilde{\boldsymbol{Z}}_{n}=\{z_{i}^{m}|m\in\boldsymbol{F}_{n},\ i\in{1,2,...,L}\}over~ start_ARG bold_italic_Z end_ARG start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_m ∈ bold_italic_F start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_i ∈ 1 , 2 , … , italic_L }. To make the visual tokens expand over frames within each event, we adjust the local density and distance index calculations according to eq. 2. The expanded visual tokens are concatenated together in order of events to ensure temporal understanding. To provide multi-scale visual features, we adopt a three-step aggregation process for each input image or video. The outputs from each merging step are concatenated and transformed using a trainable projection matrix 𝑾𝑾\boldsymbol{W}bold_italic_W to obtain the content-based representation 𝑹contentsubscript𝑹𝑐𝑜𝑛𝑡𝑒𝑛𝑡\boldsymbol{R}_{content}bold_italic_R start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT. The relation-based representation 𝑹relationsubscript𝑹𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛\boldsymbol{R}_{relation}bold_italic_R start_POSTSUBSCRIPT italic_r italic_e italic_l italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT is obtained from the aggregation of the GCN output in each stage 𝑯(m)superscript𝑯𝑚\boldsymbol{H}^{(m)}bold_italic_H start_POSTSUPERSCRIPT ( italic_m ) end_POSTSUPERSCRIPT. The final feature representation 𝒉isubscript𝒉𝑖\boldsymbol{h}_{i}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the linear combination of the content-based and relation-based representations with a coefficient α𝛼\alphaitalic_α:

𝒉i=(α×𝑹content)𝑹relation.subscript𝒉𝑖direct-sum𝛼subscript𝑹𝑐𝑜𝑛𝑡𝑒𝑛𝑡subscript𝑹𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛\boldsymbol{h}_{i}=(\alpha\times\boldsymbol{R}_{content})\oplus\boldsymbol{R}_% {relation}.bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_α × bold_italic_R start_POSTSUBSCRIPT italic_c italic_o italic_n italic_t italic_e italic_n italic_t end_POSTSUBSCRIPT ) ⊕ bold_italic_R start_POSTSUBSCRIPT italic_r italic_e italic_l italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT . (8)

By integrating content-based and relation-based representations, MVP aims to enhance the ability of model to reason about the relationships between visual elements and improve its performance on downstream emotional tasks. The resulting feature representation provides a comprehensive understanding of the visual input, incorporating both local and global relationships.

4.4 EmoPrompt Reasoning

Chain-of-Thought (CoT) [8] is a popular and efficient technique for enhancing the reasoning power of LLMs without fine-tuning. It involves adding step-by-step reasoning instructions to the user’s prompt, guiding the LLM through a logical thought process. Given the delicate and unintuitive nature of emotional tasks, this kind of reasoning is crucial for accurate emotion understanding.

Refer to caption
Figure 4: Illustration of EmoPrompt. We utilize visual data and label pairs in EmoBench, and prompt GPT-4V [1] to generate logical chains.

For emotional tasks, we first design a task-specific CoT as a baseline. Drawing inspiration from how humans identify emotions in images and videos, we observe that people often focus on the content of objects first, such as facial expressions, atmospheres, and other visual cues. Intuitively, we guide the MLLM to reason about the objective content in the data first, and then reason about the emotional task based on the obtained conclusion combined with the data. The advantage of this approach is that it guides the observation of LLM, leading to more robust reasoning.

However, this step-by-step thinking heavily depends on the observations made in the first step. If the LLM hallucinates or generates inaccurate observations during the initial stage, it can greatly affect the judgment of the emotional task. To address this issue, we propose EmoPrompt, which aims to provide correct guidance for the reasoning process.

To achieve this goal, we first collect data on a subset of emotional tasks along with their corresponding ground truth labels. By presenting both the “question” (emotion data) and “answer” (ground truth label) to GPT-4V, we obtain objective-to-subjective reasoning in the correct direction, as shown in  Fig. 4. This ensures the correctness of the step-by-step reasoning process. Using this methodology, we collect hundreds of examples of reasoning for each emotional task. These examples serve as demonstrations of correct reasoning during the EmoLLM reasoning process.

By incorporating EmoPrompt, we guide EmoLLM to follow a correct reasoning path, mitigating the impact of potential hallucinations or inaccuracies in the initial observation stage. This approach enhances the ability of LLMs to accurately understand and interpret emotions in multimodal data.

Table 2: Comparison of the emotional ability between baseline MLLMs and our EmoLLM, on EmoBench-test set.
Methods EmoBench Testing (30K)
Emo-C Emo-O Intention Hate Humor Sarcasm Overall
Vicuna [52] zero-shot 29.21 21.55 17.48 45.39 49.68 55.23 28.63
ChatUniVi [37] fine-tune 47.62 39.26 57.85 63.03 63.85 77.87 46.66
MacawLLM [38] fine-tune 42.42 31.05 52.91 57.54 55.60 71.75 40.28
OneLLM [39] fine-tune 51.16 40.30 56.95 59.01 60.89 73.93 48.20
EmoLLM fine-tune 64.06 52.58 73.99 67.43 75.69 86.67 60.36

5 Experiments

5.1 Experimental Setup

We adopt CLIP (ViT-L/14) [34] and WHISPER [36] as the visual and acoustic encoders, respectively. For the language foundation model, we choose the Vicuna-v1.5 model [52], which consists of 7B parameters. During the emotional fine-tuning stage, we utilize the data from EmoBench. EmoLLM is trained for 5 epochs with a batch size of 16, using the AdamW [53; 54] optimizer with a cosine learning rate schedule. The learning rate is set to 2e-5, and the warmup rate is 0.03. All input images or frames are resized to 224 ×\times× 224. Training one epoch on 4 ×\times× RTX 4090 GPUs takes approximately 5 hours using LoRA [55]. Hyperparameters are determined on the validation set, and final results are obtained on the test set. Each result is the average of three runs with various random seeds.

Table 3: Comparison of the emotional ability between SOTA MLLMs and EmoLLM.
Method #Param Emo-C Emo-O
GPT-4V 1012similar-toabsentsuperscript1012\sim{10}^{12}∼ 10 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT 57.90 45.10
Gemini1.0 1011similar-toabsentsuperscript1011\sim{10}^{11}∼ 10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT 45.47 44.83
Gemini1.5 1011similar-toabsentsuperscript1011\sim{10}^{11}∼ 10 start_POSTSUPERSCRIPT 11 end_POSTSUPERSCRIPT 45.47 44.83
EmoLLM 1010similar-toabsentsuperscript1010\sim{10}^{10}∼ 10 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT 75.03 67.14

5.2 Main Results

To quantitatively measure the emotional capability of EmoLLM, we evaluate its performance on six sub-tasks from EmoBench, including close-set and open-set emotion classification, intention recognition, and three special emotional application tasks. As shown in Tab. 2, EmoLLM achieves superior performance compared to baselines with the same 7B parameter scale, demonstrating the effectiveness of our proposed approach.

We also compare the emotional understanding abilities of state-of-the-art MLLMs on an emotion sub-test set. Considering that some MLLMs do not support video and audio, we take a subset of pure images from EmoBench test set. It contains 6 emotion categories with hundreds of images in each category. As presented in Tab. 3, EmoLLM outperforms GPT-4V, Gemini-1.0, and Gemini-1.5 on both close-set (Emo-C) and open-set (Emo-O) emotion classification tasks while maintaining a smaller parameter count.

5.3 Ablation Studies

We conduct ablation studies to explore the key design choices in EmoLLM. All experiments are conducted on the Emo-C part of EmoBench test set, with other settings unchanged unless specified.

Multi-perspective Visual Projection We investigate the impact of the hyperparameter τ𝜏\tauitalic_τ in the Multi-perspective Visual Projection module by varying its value from 0.05 to 0.5. As shown in Fig. 5 (left), performance of EmoLLM is sensitive to the choice of τ𝜏\tauitalic_τ, with the highest accuracy of 64.06% achieved when τ𝜏\tauitalic_τ is set to 0.1. The accuracy tends to decline as τ𝜏\tauitalic_τ increases, indicating that a suitable value of τ𝜏\tauitalic_τ is beneficial for emotional understanding capabilities of EmoLLM.

Quantity Effects in EmoPrompt To examine the impact of the number of EmoPrompts on the performance of EmoLLM, we vary the number of prompts from 100 to 1000 and evaluate the emotional capability. As depicted in Fig. 5 (right), increasing the number of EmoPrompts generally leads to improved performance, with the peak accuracy of 64.06% achieved when all prompts are used. This finding highlights the importance of utilizing a diverse set of prompts to enhance the emotional reasoning ability of LLMs. However, the performance gains diminish as the number of prompts exceeds 600, suggesting an optimal range for balancing computational efficiency and emotional understanding.

Effect of the Tuning Strategy We investigate whether different objective and affective training sequences affect the emotional understanding ability of LLMs. In Tab. 4, we compare the performance of three training strategies: emo (training with only EmoBench), mix (training with objective fine-tuned data mixed with EmoBench), and sequential (fine-tuning with objective data first and then with emotional task). The results suggest that sequential training substantially benefits emotional understanding. A possible explanation is that it simulates the way humans learn, starting with easy tasks and progressing to more difficult ones, while also moving from general knowledge to domain-specific knowledge.

Table 4: Various training strategies affect emotional understanding ability of LLMs. Training on traditional tasks first and then emotional tasks (sequential) leads to the best results.
Training Strategy Emo-C Emo-O Intention Hate Humor Sarcasm Overall
emo 61.65 47.40 67.71 62.44 70.82 80.22 56.24
mix 63.05+1.40+1.40{}_{\text{{\color[rgb]{1,0,0}+1.40}}}start_FLOATSUBSCRIPT +1.40 end_FLOATSUBSCRIPT 49.32+1.92+1.92{}_{\text{{\color[rgb]{1,0,0}+1.92}}}start_FLOATSUBSCRIPT +1.92 end_FLOATSUBSCRIPT 73.54+5.83+5.83{}_{\text{{\color[rgb]{1,0,0}+5.83}}}start_FLOATSUBSCRIPT +5.83 end_FLOATSUBSCRIPT 65.90+3.46+3.46{}_{\text{{\color[rgb]{1,0,0}+3.46}}}start_FLOATSUBSCRIPT +3.46 end_FLOATSUBSCRIPT 67.65-3.17-3.17{}_{\text{{\color[rgb]{.5,.5,.5}-3.17}}}start_FLOATSUBSCRIPT -3.17 end_FLOATSUBSCRIPT 83.74+3.52+3.52{}_{\text{{\color[rgb]{1,0,0}+3.52}}}start_FLOATSUBSCRIPT +3.52 end_FLOATSUBSCRIPT 58.11+1.87+1.87{}_{\text{{\color[rgb]{1,0,0}+1.87}}}start_FLOATSUBSCRIPT +1.87 end_FLOATSUBSCRIPT
sequential 64.06+2.41+2.41{}_{\text{{\color[rgb]{1,0,0}+2.41}}}start_FLOATSUBSCRIPT +2.41 end_FLOATSUBSCRIPT 52.58+5.18+5.18{}_{\text{{\color[rgb]{1,0,0}+5.18}}}start_FLOATSUBSCRIPT +5.18 end_FLOATSUBSCRIPT 73.99+6.28+6.28{}_{\text{{\color[rgb]{1,0,0}+6.28}}}start_FLOATSUBSCRIPT +6.28 end_FLOATSUBSCRIPT 67.43+4.99+4.99{}_{\text{{\color[rgb]{1,0,0}+4.99}}}start_FLOATSUBSCRIPT +4.99 end_FLOATSUBSCRIPT 75.69+4.87+4.87{}_{\text{{\color[rgb]{1,0,0}+4.87}}}start_FLOATSUBSCRIPT +4.87 end_FLOATSUBSCRIPT 86.67+6.45+6.45{}_{\text{{\color[rgb]{1,0,0}+6.45}}}start_FLOATSUBSCRIPT +6.45 end_FLOATSUBSCRIPT 60.36+4.12+4.12{}_{\text{{\color[rgb]{1,0,0}+4.12}}}start_FLOATSUBSCRIPT +4.12 end_FLOATSUBSCRIPT
Refer to caption
Figure 5: Hyperparameter ablation in Multi-perspective Visual Projection and EmoPrompts. EmoLLM has the best performance when τ𝜏\tauitalic_τ is 0.1. For EmoPrompts, diversified prompts can enhance the emotional reasoning ability of LLM.

6 Conclusion

In this work, we introduce EmoBench, a comprehensive benchmark for enhancing and evaluating the emotional understanding capabilities of Multimodal Large Language Models (MLLMs), and propose EmoLLM, a novel model incorporating Multi-perspective Visual Projection and EmoPrompt techniques. Through extensive experiments on EmoBench, we demonstrated substantial improvements of EmoLLM over baselines, with an average improvement of 12.1% across multiple foundation models.

Limitations. One notable limitation is that the answers to the instructions in EmoBench may lack diversity since they were generated by GPT-4 and automated scripts rather than collected from human annotators. Maybe the future of work combining automation with manual labeling is a promising direction. Another limitation is the inherent vulnerabilities of LLMs, such as hallucination and sensitivity to prompts, which may affect the performance of EmoLLM.

Future Work. Despite these limitations, we believe our work takes a significant step towards enabling MLLMs to achieve a deeper understanding of complex emotions in multimodal data, paving the way for emotionally intelligent AI systems. Future work could focus on addressing the limitations mentioned above, such as increasing the diversity of EmoBench through a combination of automated and manual labeling, and mitigating the vulnerabilities of LLMs. Furthermore, exploring the application of emotionally intelligent AI systems in real-world scenarios and evaluating their impact on user experience and well-being could be valuable avenues for future research.

References

  • [1] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint, 2023.
  • [2] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint, 2023.
  • [3] R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv preprint, 2023.
  • [4] P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient visual instruction model,” arXiv preprint, 2023.
  • [5] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” NIPS, 2024.
  • [6] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” JMLR, 2024.
  • [7] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S3fd: Single shot scale-invariant face detector,” in ICCV, 2017.
  • [8] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” NIPS, 2022.
  • [9] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, 2008.
  • [10] A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in ACL, 2018.
  • [11] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” in ACL, 2019.
  • [12] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotion recognition in context,” in CVPR, 2017.
  • [13] R. Kosti, J. M. Alvarez, A. Recesens, and A. Lapedriza, “Context based emotion recognition using emotic dataset,” IEEE TAPMI, 2019.
  • [14] D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, and S. Ravi, “Goemotions: A dataset of fine-grained emotions,” arXiv preprint, 2020.
  • [15] S. Larson, A. Mahendran, J. J. Peper, C. Clarke, A. Lee, P. Hill, J. K. Kummerfeld, K. Leach, M. A. Laurenzano, L. Tang et al., “An evaluation dataset for intent classification and out-of-scope prediction,” in EMNLP, 2019.
  • [16] X. Liu, A. Eshghi, P. Swietojanski, and V. Rieser, “Benchmarking natural language understanding services for building conversational agents,” arXiv preprint, 2019.
  • [17] M. Jia, Z. Wu, A. Reiter, C. Cardie, S. Belongie, and S.-N. Lim, “Intentonomy: a dataset and study towards human intent understanding,” in CVPR, 2021.
  • [18] A. Coucke, A. Saade, A. Ball, T. Bluche, A. Caulier, D. Leroy, C. Doumouro, T. Gisselbrecht, F. Caltagirone, T. Lavril et al., “Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces,” arXiv preprint, 2018.
  • [19] J. Kruk, J. Lubin, K. Sikka, X. Lin, D. Jurafsky, and A. Divakaran, “Integrating text and image: Determining multimodal document intent in Instagram posts,” in EMNLP, 2019.
  • [20] A. Jia, Y. He, Y. Zhang, S. Uprety, D. Song, and C. Lioma, “Beyond emotion: A multi-modal dataset for human desire understanding,” in NACCL, 2022.
  • [21] I. Casanueva, T. Temcinas, D. Gerz, M. Henderson, and I. Vulic, “Efficient intent detection with dual sentence encoders,” in ACL WorkShop, 2020.
  • [22] Q. Yang, M. Ye, Z. Cai, K. Su, and B. Du, “Composed image retrieval via cross relation network with hierarchical aggregation transformer,” IEEE Transactions on Image Processing, 2023.
  • [23] H. Zhang, H. Xu, X. Wang, Q. Zhou, S. Zhao, and J. Teng, “Mintrec: A new dataset for multimodal intent recognition,” in ACM MM, 2022.
  • [24] M. Ye, Q. Shi, K. Su, and B. Du, “Cross-modality pyramid alignment for visual intention understanding,” IEEE Transactions on Image Processing.
  • [25] Q. Shi, M. Ye, Z. Zhang, and B. Du, “Learnable hierarchical label embedding and grou** for visual intention understanding,” IEEE Transactions on Affective Computing.
  • [26] K. Sun, Z. Xie, M. Ye, and H. Zhang, “Contextual augmented global contrast for multimodal intent recognition,” in CVPR, 2024.
  • [27] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in ACL, 2019.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
  • [29] D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality-invariant and-specific representations for multimodal sentiment analysis,” in ACM MM, 2020.
  • [30] D. Yang, H. Kuang, S. Huang, and L. Zhang, “Learning modality-specific and-agnostic representations for asynchronous multimodal language sequences,” in ACM MM, 2022.
  • [31] Y. Zhang, M. Wang, P. Tiwari, Q. Li, B. Wang, and J. Qin, “Dialoguellm: Context and emotion knowledge-tuned llama models for emotion recognition in conversations,” arXiv preprint, 2023.
  • [32] G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint, 2023.
  • [33] H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
  • [34] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  • [35] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in NIPS, 2020.
  • [36] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in ICML, 2023.
  • [37] P. **, R. Takanobu, C. Zhang, X. Cao, and L. Yuan, “Chat-univi: Unified visual representation empowers large language models with image and video understanding,” arXiv preprint, 2023.
  • [38] C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and Z. Tu, “Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration,” arXiv preprint, 2023.
  • [39] J. Han, K. Gong, Y. Zhang, J. Wang, K. Zhang, D. Lin, Y. Qiao, P. Gao, and X. Yue, “Onellm: One framework to align all modalities with language,” arXiv preprint, 2023.
  • [40] S. Bianco, L. Celona, M. Donzella, and P. Napoletano, “Improving image captioning descriptiveness by ranking and llm-based fusion,” arXiv preprint arXiv:2306.11593, 2023.
  • [41] M. Dzabraev, A. Kunitsyn, and A. Ivaniuta, “Vlrm: Vision-language models act as reward models for image captioning,” arXiv preprint arXiv:2404.01911, 2024.
  • [42] W. Hu, Y. Xu, Y. Li, W. Li, Z. Chen, and Z. Tu, “Bliva: A simple multimodal llm for better handling of text-rich visual questions,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 3, 2024, pp. 2256–2264.
  • [43] I. Sterner, W. Lin, J. Chen, and B. Byrne, “Few-shot vqa with frozen llms: A tale of two approaches,” arXiv preprint arXiv:2403.11317, 2024.
  • [44] H. Wu, Z. Zhang, E. Zhang, C. Chen, L. Liao, A. Wang, K. Xu, C. Li, J. Hou, G. Zhai et al., “Q-instruct: Improving low-level visual abilities for multi-modality foundation models,” arXiv preprint arXiv:2311.06783, 2023.
  • [45] J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context-aware emotion recognition networks,” in ICCV, 2019.
  • [46] K.-C. Peng, T. Chen, A. Sadovnik, and A. C. Gallagher, “A mixed bag of emotions: Model, predict, and transfer emotion distributions,” in CVPR, 2015.
  • [47] L. Hyun, K. Sung-Bin, S. Han, Y. Yu, and T.-H. Oh, “Smile: Multimodal dataset for understanding laughter in video with language models,” arXiv preprint, 2023.
  • [48] M.-H. Van and X. Wu, “Detecting and correcting hate speech in multimodal memes with large visual language model,” arXiv preprint, 2023.
  • [49] L. Qin, S. Huang, Q. Chen, C. Cai, Y. Zhang, B. Liang, W. Che, and R. Xu, “Mmsd2. 0: Towards a reliable multi-modal sarcasm detection system,” arXiv preprint, 2023.
  • [50] J. Jiang, Y. Chen, X. Meng, L. Wang, and K. Li, “A novel density peaks clustering algorithm based on k nearest neighbors for improving assignment process,” Physica A: Statistical Mechanics and its Applications, 2019.
  • [51] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint, 2016.
  • [52] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” https://vicuna.lmsys.org, 2023.
  • [53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [54] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint, 2017.
  • [55] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint, 2021.