Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations
Abstract
The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52% and 35%, respectively. In addition, ELR-GNN can effectively improve the accuracy of the MERC task by capturing and fusing the latent semantic relationships between utterances.
keywords:
Graph Neural Network, Multi-modal Emotion Recognition, Relation-aware, Efficiency, Information Fusion1 Introduction
Multi-modal emotion recognition in conversations (ME-RC) has received research attention [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] due to its wide application in the fields of intelligent customer service and emotion analysis [11], human-computer interaction (HCI) [12], and security monitoring [13]. For instance, in HCI, MERC can help computers better understand the emotional state of human users, thereby enabling more intelligent interactions and improving user experience. Unlike traditional non-conversational or unimodal emotion recognition tasks [14], MERC requires identifying the speaker’s genuine emotions based on textual, auditory, and visual information in the conversation utterances [15].
The current mainstream research methods mainly use RNN, Transformer, and GCN to model conversation context and multi-modal information in MERC. For example, DialogueRNN [16] uses a sequential approach to track conversation context and captures the most important emotional features through a memory mechanism. Although RNN-based methods can model the speaker’s contextual information, they have limited memory ability for long-distance conversations, which limits the application of RNN in MERC tasks [17]. To solve the above problems, the Transformer architecture [18] is proposed to model long-distance context dependencies in MERC. For instance, CTNet [19] builds a Single Transformer and Cross Transformer to capture long-distance context dependencies and realize intra-module and inter-module information interaction for emotion recognition. However, methods based on Transformer architecture ignore conversational relationship information between speakers, which limits the model’s emotion recognition performance [20, 21]. To tackle this limitation, many GCN methods have been proposed to model interaction information between speakers. For example, DialogueGCN [22] uses a graph structure to model conversation context and uses GCN to learn conversation graphs to achieve semantic understanding and emotional recognition of conversations. In addition, LR-GCN [23] believes that the context latent dependencies of utterences should also be considered. LR-GCN uses multi-head attention to construct multiple full association graphs to model potential conversational relationships, and then uses GCN to learn latent relationships to achieve emotion recognition. However, limited by the complexity of GCN, these methods usually adopt a fixed window size strategy and then fully connect the utterances within the window to construct a conversation graph, which significantly limits the ability to obtain long-distance contextual information.
Inspired by LR-GCN, we also use GCN to model dialogue relationship information between speakers for MERC. Furthermore, long-distance context potential dependencies can provide more information for emotion classification and help reveal the genuine emotion of utterances. Therefore, how to comprehensively consider long-distance contextual dependencies while ensuring that the number of model parameters does not increase dramatically remains a challenge.
In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversation. Specifically, we first use RoBERTa, 3D-CNN, and openSMILE to perform pre-feature extraction of text, video, and audio features, respectively. Next, we use Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. We then use low-level utterence features to construct a speaker graph. In the constructed speaker relationship graph, low-level utterence features are used as node features, while dialogue relationship information between speakers is used for edge construction. To capture the latent dependency information between long-distance contexts, we use the graph random neural network algorithm to randomly sample top- nodes for information extraction. In addition, we combine early fusion and adaptive late fusion mechanisms to simultaneously fuse speaker relationship information and latent dependency information between contexts. Finally, we fine-grained obtained high-level utterance features and fed them into the MLP and softmax function for emotion prediction.
-
1.
We propose a novel Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for MER. ELR-GCN not only considers conversational relationship information between speakers, but also captures long-distance context latent dependency information.
-
2.
We propose a graph random neural network architecture in which long-distance latent dependencies between utterances are captured by randomly sampling top- node features. Furthermore, we combine early fusion and adaptive late fusion mechanisms to simultaneously exploit speaker information and context’s latent dependency information during information propagation.
-
3.
We perform extensive experiments on two publicly available datasets to verify the effectiveness of the ELR-GNN method.
2 Related Work
2.1 Multi-modal Emotion Recognition
Single-modality emotion recognition may be limited. For example, text-based emotion recognition alone may not capture the emotional cues in speech and facial expressions [24]. Multimodal emotion recognition (MER) can integrate multiple information sources to improve the accuracy and robustness of emotion recognition.
Current mainstream MER research mainly focuses on RNN, Transformer and GCN. For instance, DialogueRNN [16] modeled individual speakers and uses three different GRUs to achieve more effective correlations between speakers. DialogueGCN [22] was proposed to solve the problem that RNN-based methods cannot consider the dialogue relationship between speakers. DialogueGCN improves the performance of MER by modeling the interactive relationship between speakers through the inherent properties of the graph structure and using graph convolution operations to transfer contextual semantic information. TL-ERC [25] used transfer learning methods to solve problems in supervised learning that require large amounts of high-quality annotated data. CTNet [19] proposed a multi-modal learning framework, which achieves cross-modal contextual semantic information interaction by building a single Transformer and a cross Transformer.
However, current mainstream methods only consider contextual semantic information, latent dependencies of local utterances, and conversational relationships between speakers, and their focus is on exploring the semantic information between utterances and the correlation between speakers. The above approach ignores latent dependencies of the global context, which limits the performance of MER.
![Refer to caption](x1.png)
2.2 Scalable Graph Neural Network
The current mainstream scalable GNN methods include three types of methods: 1) Node sampling strategy: Accelerate the aggregation process of node features by sampling nodes. The representative methods include GraphSAGE [26], FastGCN [27] and LADIES [28]. The main idea of GraphSage is to update the representation of each node through multiple rounds of neighbor sampling and aggregation of neighbor node information, thereby capturing the structure and relationships between nodes in graph data. FastGCN proposes a structured graph node sampling strategy, which selects sampling nodes by considering the structural information of the graph to retain important structural features of the graph. This sampling strategy can preserve the graph information as much as possible while ensuring sampling efficiency. LADIES adopts an adaptive density modeling method to capture local and global information by learning the density distribution of neighbors around a node. LADIES can effectively update the representation of nodes to one that takes into account both local and global information. 2) Graph partitioning method: Divide the original large graph into several small subgraphs and run GNN on the subgraphs. The mainstream graph partitioning methods include Cluster-GCN [29] and GraphSAINT [30]. The mainstream graph partitioning methods include Cluster-GCN and GraphSA-INT. Cluster-GCN divides the original large-scale graph data into multiple subgraphs, each subgraph contains a part of nodes and corresponding edges, thereby reducing computational and memory overhead. GraphSAINT processes large-scale graph data through graph sampling and iterative coarsening. 3) Matrix approximation method: Accelerate feature propagation by decoupling feature propagation and nonlinear transformation. SGC simplifies the nonlinear activation function in traditional graph convolutional networks, retaining only graph convolution operations.
2.3 Multi-head Attention
The multi-head attention in MER can help the model effectively capture the correlation information between different modalities and adaptively focus on the most important parts for the emotion classification task. For example, TEMMA [31] proposesd a multi-modal multi-head attention for MER to comprehensively consider the complementarity and redundancy between modalities. TE-MMA can realize the semantic information interaction between modalities and capture the temporal dependence within the modalities. GA2MIF [32] constructed a multi-head directed graph attention network and a multi-head pairwise cross-modal attention network respectively to ach-ieve contextual semantic information extraction and cross-modal information fusion. EEANet [33] used a multi-head self-attention mechanism to capture the discriminative features in contextual semantic information that are most suitable for emotion classification.
We apply an attention mechanism to the utterance features obtained through graph convolution operations to calculate the correlation between contexts and capture the utterances with the strongest emotional features among the global context latent dependencies. Our method ELR-GNN can simultaneously consider contextual semantic information, interaction information between speakers, and latent dependency information of the global context.
3 Proposed Method
The overall framework of the ELR-GNN proposed in this paper is shown in Fig. 1. ELR-GNN consists of four stages, including sequential contextual feature extraction, graph construction, long-distance contextual latent relationship exploration, and information fusion. In the following subsections, we describe these four key parts in detail.
3.1 Sequential context information extraction
The speaker’s emotional state is not only related to the textual semantic information at the current moment, but also related to the previous contextual semantic information. Therefore, we use Bi-LSTM to capture contextual semantic information in multi-modal features to more accurately understand the speaker’s emotional changes. The formula of LSTM is defined as follows:
(1) | ||||
where represents the concatenated multi-modal features, represents the hidden layer state, represents the input gate, represents the forgetting gate, represents the cell state, and represents Hadamard product, is a learnable network parameter.
Bi-LSTM is composed of forward and reverse LSTM, and its formula is defined as follows:
(2) |
where is obtained by concatenating the contextual semantic features extracted by forward and reverse LSTM.
3.2 Graph Construction
We use the inherent properties of the graph structure to construct a speaker relationship graph, in which the contextual semantic features extracted through Bi-LSTM are used as node features of the graph, and the dialogue relationships between speakers are used as edges. Specifically, given a speaker dialogue graph , where the node is composed of contextual semantic features (i.e., ), the edge indicates that there is a conversation relationship between node and node , otherwise , represents the weight of edge , and represents the edge relationship.
3.3 Long-distance Latent Context Relationship Extraction
Unlike previous work that set the context window size to 10 (i.e., the number of nodes), to capture long-distance latent dependencies of contexts, we adopt a larger context window to explore potential correlations between contexts. Specifically, we first construct an original graph with a larger context window, the generalized forward push algorithm is then used to calculate the propagation matrix of the row vectors, and top- sparsification is used to further reduce the training time of the network, so as to comprehensively consider the latent correlation of the context.
3.3.1 Propagation Matrix
We use a mixed-order matrix of feature propagation to aggregate neighbor node information of different orders in the graph to obtain long-distance contextual latent dependency information. The formula of the propagation matrix is defined as follows:
(3) |
where and , is the adjacency matrix, and is the degree matrix. The propagation matrix can fuse different orders of neighbor node information and capture important contextual potential dependency information by adjusting the weights.
Then we aggregate the node features and update the node features, defined as follows:
(4) |
where , is the row vectors of the node , is the indices of non-zero value of . Through Eq. 4, we can solve the problem of slow inference speed caused by the high computational complexity of GCN and achieve rapid training of the model. Therefore, we can construct larger graphs to capture long-distance context latent dependencies.
However, is actually a difficult estimation problem. To address the problem, We use a two-stage estimation step for calculation, which includes Generalized Forward Push (GFP) and Top- sparsification. First, GFP gives the error bound of , and then Top- sparsification only retains top- elements to achieve faster calculation speed.
3.3.2 Generalized Forward Push
Since the row-normalized adjacency matrix is also an inverse random walk transition probability matrix on , we design an efficient GFP estimation algorithm to estimate . The key step of GFP is to accelerate the random walk probability diffusion process through pruning operation. Specifically, we first give two initial vectors and , and both and are initialized to , where and for . Furthermore, . Then, the GFP algorithm begins to iteratively update the and vectors through and until node satisfies , where . When the GFP iteration is complete, we get .
3.3.3 Top- Sparsification
To reduce the computational complexity of GCN, we perform top- sparsification on to accelerate model training. The core idea of Top- sparsification is to retain only the top- largest elements of , and set other elements to 0. Therefore, has only non-zero elements, which preserves the most important emotion features in the latent dependencies of the context.
3.3.4 Learnable Information Propagation
Therefore, we introduce a learnable parameter to achieve dimensionality reduction of multi-modal features while improving the learning ability of the model. The formula is defined as follows:
(5) |
3.4 Auxiliary Information Module
Graph random neural networks can effectively extract dialogue relationship information between speakers and long-distance context potential dependency information, but it is easy to ignore some discriminative original full-emotion features. Therefore, we use AIM to extract and fuse higher-level emotional features, adaptively aggregating original emotional features, speaker relationship information, and long-distance context potential dependency information.
3.4.1 Feature Extractor (AIM-FE)
Multimodal data are characterized by noise and high dimensionality. To achieve denoising and capture discriminative emotional features in multi-modal data, we introduce gated convolutional networks to capture auxiliary information. In the gated convolutional network, we use sigmoid and tanh functions, which can retain the most important emotional feature information and improve the nonlinear fitting ability of the model. The formula of the gated convolutional network is defined as follows:
(6) | ||||
where represents 1D convolution operations, represents Hadama product.
3.4.2 Late Adaptive Fusion (AIM-LAF)
To capture finer-grained semantic information in multi-modal data, early and late adaptive fusion mechanisms are combined to capture auxiliary information with fine-grained emotional features. Specifically, late fusion fuses highly abstract time and space information, ignoring detailed information. Therefore, the combination of early and late adaptive fusion mechanisms proposed in this paper can more effectively capture more discriminative emotional features adaptively from multi-modal data.
In the early fusion process, we map the contextual features through the gated convolutional network and the latent features through the graph random neural network to the same dimension, obtain and and fuse them. The formula is defined as follows:
(7) |
where represents summation average operation. Then we use a FCN to achieve feature dimensionality reduction and obtain , , and . Then we use the attention mechanism to obtain the corresponding attention score as follows:
(8) |
where represents the query matrix, and are the learnable parameters. Likewise, and are calculated using Eq. 8. Then we use the softmax function to normalize the attention coefficient as follows:
(9) |
Finally, we perform a weighted sum of and to obtain the final emotional feature vector representation. The formula is defined as follows:
(10) |
3.5 Model Training
The final emotional feature vector with contextual semantic information, dialogue relationship information between speakers, and long-distance latent dependency information is fed into the MLP with residual connections for feature conversion, and then use the softmax layer to get the probability of C-class emotion category:
(11) | ||||
where , , , is the learnable parameters. We then obtain the index of the maximum emotion probability by using the argmax function.
(12) |
Finally, we use cross-entropy loss to complete the optimization of the model:
(13) |
where represents the number of dialogues, and represents the number of utterances in the -th dialogue.
4 Experiments
4.1 Datasets
We evaluate the ELR-GNN model proposed in this paper on two benchmark datasets, IEMOCAP [34] and MELD [35]. All these data sets contain three modal data sets: text, video, and audio.
IEMOCAP is a public dataset widely used in emotion recognition research. This dataset was created by the Sippy team at the University of Southern California and aims to provide detailed annotations of emotional interactions and speech/non-verbal behaviors. The IEMOCAP dataset emotionally annotates speech and video, including six emotion categories: happy, sad, angry, excited, frustrated, and neutral. Emotional annotation is accomplished through consistent annotation of data by multiple evaluators. The IEMOCAP dataset contains text, audio, and video data from 10 different actors. Each actor participated in a series of emotional interaction tasks.
MELD is an open multi-modal dataset for emotion analysis research. It was created by researchers at the University of Toronto to advance research into natural language and speech emotion recognition. The MELD data set contains data in three modalities: text, video and audio. The text is the script text from the movie dialogue. The MELD dataset contains annotations for six emotion categories: joy, sadness, anger, fear, surprise and neutral. These emotion annotations are performed independently by multiple annotators.
Methods | IEMOCAP | ||||||
---|---|---|---|---|---|---|---|
Happy | Sad | Neutral | Angry | Excited | Frustrated | Average(w) | |
Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | |
TextCNN | 27.7 29..8 | 57.1 53.8 | 34.3 40.1 | 61.1 52.4 | 46.1 50.0 | 62.9 55.7 | 48.9 48.1 |
bc-LSTM | 29.1 34.4 | 57.1 60.8 | 54.1 51.8 | 57.0 56.7 | 51.1 57.9 | 67.1 58.9 | 55.2 54.9 |
MFN | 24.0 34.1 | 65.6 70.5 | 55.5 52.1 | 72.3 66.8 | 64.3 62.1 | 67.9 62.5 | 60.1 59.9 |
CMN | 25.0 30.3 | 55.9 62.4 | 52.8 52.3 | 61.7 59.8 | 55.5 60.2 | 71.1 60.6 | 56.5 56.1 |
LFM | 25.6 33.1 | 75.1 78.8 | 58.5 59.2 | 64.7 65.2 | 80.2 71.8 | 61.1 58.9 | 63.4 62.7 |
ICON | 22.2 29.9 | 58.8 64.6 | 62.8 57.4 | 64.7 63.0 | 58.9 63.4 | 67.2 60.8 | 59.1 58.5 |
A-DMN | 43.1 50.6 | 69.4 76.8 | 63.0 62.9 | 63.5 56.5 | 88.3 77.9 | 53.3 55.7 | 64.6 64.3 |
DialogueGCN | 40.6 42.7 | 89.1 84.5 | 62.0 63.5 | 67.5 64.1 | 65.5 63.1 | 64.1 66.9 | 65.2 64.1 |
RGAT | 60.1 51.6 | 78.8 77.3 | 60.1 65.4 | 70.7 63.0 | 78.0 68.0 | 64.3 61.2 | 65.0 65.2 |
AGHMN | 48.3 52.1 | 68.3 73.3 | 61.6 58.4 | 57.5 61.9 | 68.1 69.7 | 67.1 62.3 | 63.5 63.5 |
BiERU | 54.2 31.5 | 80.6 84.2 | 64.7 60.2 | 67.9 65.7 | 62.8 74.1 | 61.9 61.3 | 66.1 64.7 |
CoMPM | 59.9 60.7 | 78.0 82.2 | 60.4 63.0 | 70.2 59.9 | 85.8 78.2 | 62.9 59.5 | 67.7 67.2 |
EmoBERTa | 56.9 56.4 | 79.1 83.0 | 64.0 61.5 | 70.6 69.6 | 86.0 78.0 | 63.8 68.7 | 67.3 67.3 |
COGMEN | 57.4 51.9 | 81.4 81.7 | 65.4 68.6 | 69.5 66.0 | 83.3 75.3 | 63.8 68.2 | 68.2 67.6 |
CTNet | 47.9 51.3 | 78.0 79.9 | 69.0 65.8 | 72.9 67.2 | 85.3 78.7 | 52.2 58.8 | 68.0 67.5 |
LR-GCN | 54.2 55.5 | 81.6 79.1 | 59.1 63.8 | 69.4 69.0 | 76.3 74.0 | 68.2 68.9 | 68.5 68.3 |
DER-GCN | 60.7 58.8 | 75.9 79.8 | 66.5 61.5 | 71.3 72.1 | 71.1 73.3 | 66.1 67.8 | 69.7 69.4 |
ELR-GCN | 64.7 62.9 | 75.7 80.8 | 66.2 62.4 | 70.7 70.0 | 76.8 78.6 | 67.9 68.1 | 70.6 70.9 |
4.2 Baselines and Evaluation Metrics
bc-LSTM [36] performs final emotion recognition by extracting the sequential context information of the utterance, which is context-sensitive.
Text-CNN [37] uses convolution filters to extract local semantic information from utterances, which is context-independent.
MFN [38] designs a multi-view learning mechanism to capture view-specific and cross-view semantic information, but MFN does not consider contextual information.
CMN [39] achieves the fusion of speaker information and multi-modal features by introducing an attention mechanism.
ICON [40] uses GRU to extract contextual information of multi-modal features and uses attention layers to achieve the fusion of multi-modal semantic information.
DialogueRNN [16] constructs three different gating units to achieve the extraction and fusion of speaker information, emotional information and global information.
DialogueGCN [22] DialogueGCN constructs a speaker relationship graph by using contextual semantic features, and utilizes contextual semantic information and speaker relationship information to achieve emotion classification.
Methods | MELD | |||||||
---|---|---|---|---|---|---|---|---|
Neutral | Surprise | Fear | Sadness | Joy | Disgust | Anger | Average(w) | |
Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | Acc. F1 | |
TextCNN | 76.2 74.9 | 43.3 45.5 | 4.6 3.7 | 18.2 21.1 | 46.1 49.4 | 8.9 8.3 | 35.3 34.5 | 56.3 55.0 |
bc-LSTM | 78.4 73.8 | 46.8 47.7 | 3.8 5.4 | 22.4 25.1 | 51.6 51.3 | 4.3 5.2 | 36.7 38.4 | 57.5 55.9 |
DialogueRNN | 72.1 73.5 | 54.4 49.4 | 1.6 1.2 | 23.9 23.8 | 52.0 50.7 | 1.5 1.7 | 41.0 41.5 | 56.1 55.9 |
DialogueGCN | 70.3 72.1 | 42.4 41.7 | 3.0 2.8 | 20.9 21.8 | 44.7 44.2 | 6.5 6.7 | 39.0 36.5 | 54.9 54.7 |
RGAT | 76.0 78.1 | 40.1 41.5 | 3.0 2.4 | 32.1 30.7 | 68.1 58.6 | 4.5 2.2 | 40.0 44.6 | 60.3 61.1 |
CoMPM | 78.3 82.0 | 48.3 49.2 | 1.7 2.9 | 35.9 32.3 | 71.4 61.5 | 3.1 2.8 | 42.2 45.8 | 64.1 65.3 |
EmoBERTa | 78.9 82.5 | 50.2 50.2 | 1.8 1.9 | 33.3 31.2 | 72.1 61.7 | 9.1 2.5 | 43.3 46.4 | 64.1 65.2 |
ConGCN | 46.8 45.4 | 10.6 8.8 | 8.7 8.1 | 53.1 54.6 | 76.7 75.2 | 28.5 26.3 | 50.3 48.4 | 59.4 58.7 |
A-DMN | 76.5 78.9 | 56.2 55.3 | 8.2 8.6 | 22.1 24.9 | 59.8 57.4 | 1.2 3.4 | 41.3 40.9 | 61.5 60.4 |
LR-GCN | 76.7 80.0 | 53.3 55.2 | 0.0 0.0 | 49.6 35.1 | 68.0 64.4 | 10.7 2.7 | 48.0 51.0 | 65.7 65.6 |
DER-GCN | 76.8 80.6 | 50.5 51.0 | 14.8 10.4 | 56.7 41.5 | 69.3 64.3 | 17.2 10.3 | 52.5 57.4 | 66.8 66.1 |
ELR-GCN | 80.2 83.6 | 36.8 35.4 | 19.2 13.1 | 80.2 83.6 | 76.5 69.7 | 55.6 13.0 | 52.1 57.7 | 68.7 69.9 |
ConGCN [41] treats multimodal features as node features in the graph and utilizes heterogeneous graphs to model conversational relationship information between spe-akers.
LR-GCN [23] captures the latent dependencies between contexts by constructing multiple graphs and constructs densely connected layers to extract speaker relationship information and structural information of the graph.
AGHMN [42] uses BiGRU to fuse the correlation information between historical contexts and uses the attention mechanism to give higher weight to important context information.
BiERU [43] uses emotion recurrent units and emotion feature extractors to extract contextual semantic information respectively. and refine contextual emotion feature vectors.
EmoBERTa [44] uses RoBERTa to extract sequential contextual semantic information from text. This method does not use multi-modal data.
LFM [45] uses low-rank decomposition to effectively reduce the dimensionality disaster problem that occurs during the fusion process of multi-modal features.
RGAT [46] integrates position encoding information into the graph attention network to improve the model’s context understanding ability.
CoMPM [47] uses a pre-trained model to extract pre-trained context memory information and combines it with the context model to understand the global contextual emotional features in a fine-grained manner.
COGMEN [48] improves the representation ability of emotional feature vectors by building context GCN to extract global and local context information and fuse them.
DER-GCN [49] improves the model’s emotional representation capabilities by constructing speaker relationship graphs and event graphs.
A-DMN [50] A-DMN comprehensively considers the intra-speaker and inter-speaker contextual information, and uses GRU to achieve cross-modal feature fusion.
CTNet [19] realizes semantic information interaction within and between modalities by building Single Transformer and Cross Transformer.
![Refer to caption](x2.png)
4.3 Comparison with the State-of-the-Art Methods
To verify the superiority of the ELR-GNN method proposed in this paper, we report the experimental results of ELR-GNN and other comparative methods on the IEMOCAP and MELD data sets. Experimental results are presented in Tables 1 and 2.
IEMOCAP: As shown in Table 1, the multi-modal emotion recognition method proposed in this paper achie-ved the best emotion recognition effect on the IEMOCAP data set, with an average accuracy of 70.6% and an average F1 value of 70.9%. ELR-GCN proposes an effective modeling method of long-distance context latent dependencies for multi-modal emotion recognition. In addition, ELR-GCN also combines early and adaptive late fusion methods to achieve the capture of fine-grained emotional features. Among other comparison methods, the emotion recognition effect of DER-GCN is slightly lower than that of ELR-GNN, with an average accuracy of 69.7% and an average F1 value of 69.4%. Although DER-GCN comprehensively considers event relationships and dialogue relationships between speakers to enhance the model’s emotional understanding, it ignores latent context dependencies. The emotion recognition effect of LR-GCN is lower than ELR-GNN and DER-GCN, with an average accuracy of 68.5% and an average F1 value of 68.3%. Although LR-GCN considers latent dependencies between contexts, due to the high computational complexity of GCN, LR-GCN can only capture local latent dependencies. The emotion recognition effects of other comparison methods are lower than ELR-GNN. Likewise, none of them take into account potential dependencies on context. Overall, the accuracy of ELR-GNN on the happy emotion analogy is much higher than that of other comparison algorithms, while the accuracy of other emotion categories is also relatively close to that of other comparison algorithms. In addition, the F1 value of ELR-GNN on the happy and excited emotional analogies is much higher than that of other comparison algorithms. At the same time, the F1 value of ELR-GNN on other emotional categories is also relatively close to other comparison algorithms. The experimental results prove the superiority of the ELR-GNN method proposed in this paper.
MELD: As shown in Table 2, The ELR-GNN method proposed in this article has the best emotion recognition effect on the MELD data set, with an average accuracy of 68.7% and an average F1 value of 69.9%. The emotion recognition effect of DER-GCN is second, with an average accuracy of 69.7% and an average F1 value of 69.4%. The emotion recognition effect of LR-GCN is lower than that of ELR-GNN and DER-GCN, with an average accuracy of 68.5% and an average F1 value of 68.3%. The emotion recognition effects of other comparison methods are relatively poor, and the average accuracy and F1 value are lower than ELR-GNN. The performance improvement may be attributed to ELR-GNN’s ability to capture long-distance contextual latent dependencies and fine-grained fusion of dialogue relationships between speakers, contextual latent dependencies and contextual semantic information. Overall, the accuracy of ELR-GNN on the neutral, fear, sadness, joy, and disgust emotion analogy is much higher than that of other comparison algorithms, while the accuracy of other emotion categories is also relatively close to that of other comparison algorithms. In addition, the F1 value of ELR-GNN on the neutral, fear, sadness, joy, and anger emotional analogies is much higher than that of other comparison algorithms. At the same time, the F1 value of ELR-GNN on other emotional categories is also relatively close to other comparison algorithms. In addition, we find that ELR-GNN has better emotion recognition effects on the minority emotions fear and disgust, with relatively high accuracy and F1 value. The experimental results prove the superiority of the ELR-GNN method proposed in this paper.
In addition, to intuitively illustrate that the running time of the ELR-GNN method proposed in this paper is better than other comparative methods, we statistics in Table 3 the running time of other comparative methods of the ELR-GNN method on the IEMOCAP and MELD data sets. As shown in Table 3, the running time of the ELR-GNN method proposed in this paper on the IEMOCAP and MELD data sets is 41s and 91s respectively, which is significantly better than other comparison methods. The running times of DialogueGCN are 58s and 127s respectively, which are lower than LR-GCN and DER-GCN, but the emotion recognition effect is relatively poor. The running times of LR-GCN are 87s and 142s respectively. The running times of DER-GCN are 125s and 189s respectively. The experimental results prove the efficiency and effectiveness of the ELR-GNN method proposed in this paper.
Methods | Running time (s) | |
---|---|---|
IEMOCAP | MELD | |
DialogueGCN | 58 | 127 |
LR-GCN | 87 | 142 |
DER-GCN | 125 | 189 |
ELR-GNN | 41 | 91 |
4.4 Analysis of the Experimental Results
To intuitively understand the ability of the feeling model for each emotion category, we analyzed the emotion classification of ELR-GNN and LR-GCN on the test set. Fig. 2 shows the confusion matrix of ELR-GNN and LR-GCN for emotion classification on IEMOCAP and MELD data sets.
![Refer to caption](x3.png)
Overall, on the IEMOCAP data set, ELR-GCN has a higher number of correct classifications for each emotion category than LR-GCN. On the MELD dataset, ELR-GCN has a higher number of correct classifications than LR-GCN in most emotional categories. The performance improvement may be attributed to ELR-GNN’s ability to understand the semantic representation of each emotion category in a fine-grained manner.
On the IEMOCAP dataset, the confusion matrix shows that ELR-GNN easily misclassifies happy emotions into excited emotions. Similarly, LR-GCN also easily misclassifies happy emotions into excited emotions, and even the number of misclassifications is greater than that of ELR-GNN. We speculate that this is because the semantics of happy emotions and excited emotions are relatively similar, and the model cannot differentiate between these two types of emotions in a fine-grained manner. In addition, we also find that ELR-GNN easily misclassifies neutral emotions into frustated emotions.
On the MELD data set, the confusion matrix shows that ELR-GNN has a very poor classification effect on disgust and fear emotions, and can only correctly classify a few samples. This is because the number of disgust and fear emotion categories is relatively small, and the data set has a serious imbalance problem, which leads to deviations in the model’s emotional understanding ability. The number of correct classifications of ELR-GNN on neutral emotions is very large, and there are very few misclassified samples. Experimental results prove that ELR-GNN has a relatively strong ability to understand neutral emotional categories.
4.5 Ablation Study
4.5.1 Importance of the Modalities
To verify the importance of the three modal features of text, video and audio for ELR-GNN, we conducted ablation experiments on the IEMOCAP and MELD data sets to compare the performance of the combination of different modal features. The experimental results are shown in Table 4. In single-modal experiments, ELR-GNN with text modality features has the best emotion recognition effect. The average accuracy on the IEMOCAP and MELD data sets are 64.1% and 63.5%, respectively, and the average F1 value is 63.9% and 62.4%, respectively. The emotion recognition effect of ELR-GNN with audio modal features is second, with average accuracy rates of 61.1% and 62.7% on the IEMOCAP and MELD data sets, and average F1 values of 60.8% and 62.0% respectively. ELR-GNN with video modality features has the worst emotion recognition effect, with average accuracy rates of 59.4% and 60.1% on the IEMOCAP and MELD data sets, and average F1 values of 59.7% and 61.4% respectively. Experimental results show that text features contain the most emotional semantic information. In the dual-modal experiment, ELR-GNN with text and audio modal features has the best emotion recognition effect. The average accuracy on the IEMOCAP and MELD data sets are 65.0% and 64.1%, respectively, and the average F1 values are are 64.4% and 63.2%, respectively. Experimental results demonstrate the effectiveness of multimodal features.
Modality | IEMOCAP | MELD | ||
---|---|---|---|---|
Acc. | F1 | Acc | F1 | |
T | 64.1 | 63.9 | 63.5 | 62.4 |
A | 61.1 | 60.8 | 62.7 | 62.0 |
V | 59.4 | 59.7 | 60.1 | 61.4 |
T+A | 65.0 | 64.4 | 64.1 | 63.2 |
T+V | 64.3 | 64.6 | 64.0 | 62.9 |
V+A | 63.0 | 62.7 | 61.5 | 61.9 |
T+A+V | 70.6 | 70.9 | 68.7 | 69.9 |
4.5.2 Parameter Analysis
We tested the impact of the maximum neighborhood size and parameter in ELR-GNN on the accuracy and running time of emotion recognition. As shown in Figs. 3(a), and 3(b), we tested the impact of different neighborhood sizes and on emotion recognition accuracy on the IEMOCAP and MELD datasets. Experimental results show that when , ELR-GNN has the best emotion recognition effect. When , the emotion recognition effect of ELR-GNN is second. When , ELR-GNN has the worst emotion recognition effect. Furthermore, as the size of the neighborhood continues to increase, the model’s emotion recognition performance also improves. Experimental results demonstrate the necessity of capturing long-range latent context dependencies.
As shown in Figs. 3(c), and 3(d), We also calculated the impact of different neighborhood sizes on running time and emotion recognition accuracy. Experimental results show that as the neighborhood size increases, the running time of the model also increases, but it is lower than the running time of LR-GCN and DER-GCN. In addition, as the neighborhood size increases, the emotion recognition effect of the model also improves.
5 Conclusions
In this paper, we propose a novel Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition. Specifically, we first use RoBERTa, 3D-CNN and openSMILE to perform pre-feature extraction of text, video and audio features respectively. Next, we use Bi-LSTM to capture contextual semantic information and obtain low-level utterence features. We then use low-level utterence features to construct a speaker graph. In the constructed speaker relationship graph, low-level utterence features are used as node features, while dialogue relationship information between speakers is used for edge construction. To capture the latent dependency information between long-distance contexts, we use the graph random neural network algorithm to randomly sample top- nodes for information extraction. In addition, we combine early fusion and adaptive late fusion mechanisms to simultaneously fuse speaker relationship information and latent dependency information between contexts. On the IEMOCAP and MELD data sets, the ELR-GNN method proposed in this paper is better than other comparative methods, and the experimental results prove the superiority of the ELR-GNN method.
CRediT authorship contribution statement
Yuntao Shou: Conceptualization, Methodology, Software, Data curation,Visualization, Validation, Writing - original draft, Writing - review & editing. Wei Ai: Conceptualization, Methodology, Writing - review & editing. Jiayi Du: Conceptualization, Methodology, Writing - review & editing. Haiyan Liu: Conceptualization, Methodology, Writing - review & editing. Tao Meng: Conceptualization, Methodology, Software, Visualization, Validation, Writing - original draft, Writing - review & editing, Funding acquisition.
Data availability
Data will be made available on request.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by National Natural Science Foundation of China (Grant No. 61802444), the Research Foundation of Education Bureau of Hunan Province of China (GrantNo. 22B0275).
References
- [1] Z. Lian, B. Liu, J. Tao, Smin: Semi-supervised multi-modal interaction network for conversational emotion recognition, IEEE Transactions on Affective Computing (2022).
- [2] Y. Shou, T. Meng, W. Ai, S. Yang, K. Li, Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis, Neurocomputing 501 (2022) 629–639.
- [3] Y. Shou, X. Cao, D. Meng, B. Dong, Q. Zheng, A low-rank matching attention based cross-modal feature fusion method for conversational emotion recognition, arXiv preprint arXiv:2306.17799 (2023).
- [4] T. Meng, Y. Shou, W. Ai, N. Yin, K. Li, Deep imbalanced learning for multimodal emotion recognition in conversations, arXiv preprint arXiv:2312.06337 (2023).
- [5] T. Meng, Y. Shou, W. Ai, J. Du, H. Liu, K. Li, A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition, Neurocomputing 569 (2024) 127109.
- [6] W. Ai, Y. Shou, T. Meng, K. Li, Der-gcn: Dialog and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition, IEEE Transactions on Neural Networks and Learning Systems (2024).
- [7] Y. Shou, T. Meng, W. Ai, K. Li, Adversarial representation with intra-modal and inter-modal graph contrastive learning for multimodal emotion recognition, arXiv preprint arXiv:2312.16778 (2023).
- [8] Y. Shou, T. Meng, F. Zhang, N. Yin, K. Li, Revisiting multi-modal emotion learning with broad state space models and probability-guidance fusion, arXiv preprint arXiv:2404.17858 (2024).
- [9] W. Ai, F. Zhang, T. Meng, Y. Shou, H. Shao, K. Li, A two-stage multimodal emotion recognition model based on graph contrastive learning, in: 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, 2023, pp. 397–404.
- [10] T. Meng, F. Zhang, Y. Shou, W. Ai, N. Yin, K. Li, Revisiting multimodal emotion recognition in conversation from the perspective of graph spectrum, arXiv preprint arXiv:2404.17862 (2024).
- [11] D. Priyasad, T. Fernando, S. Denman, S. Sridharan, C. Fookes, Attention driven fusion for multi-modal emotion recognition, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 3227–3231.
- [12] J. Zhang, Z. Yin, P. Chen, S. Nichele, Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review, Information Fusion 59 (2020) 103–126.
- [13] V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, N. Onoe, M2fnet: Multi-modal fusion network for emotion recognition in conversation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4652–4661.
- [14] C. Li, Z. Bao, L. Li, Z. Zhao, Exploring temporal representations by leveraging attention-based bidirectional lstm-rnns for multi-modal emotion recognition, Information Processing & Management 57 (3) (2020) 102185.
- [15] Y. Shou, T. Meng, W. Ai, C. Xie, H. Liu, Y. Wang, Object detection in medical images based on hierarchical transformer and mask mechanism, Computational Intelligence and Neuroscience 2022 (1) (2022) 5863782.
- [16] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, E. Cambria, Dialoguernn: An attentive rnn for emotion detection in conversations, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 33, 2019, pp. 6818–6825.
- [17] Z. Yang, X. Li, Y. Cheng, T. Zhang, X. Wang, Emotion recognition in conversation based on a dynamic complementary graph convolutional network, IEEE Transactions on Affective Computing (2024).
- [18] H. Ma, J. Wang, H. Lin, B. Zhang, Y. Zhang, B. Xu, A transformer-based model with self-distillation for multimodal emotion recognition in conversations, IEEE Transactions on Multimedia (2023).
- [19] Z. Lian, B. Liu, J. Tao, Ctnet: Conversational transformer network for emotion recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021) 985–1000.
- [20] J. Li, X. Wang, G. Lv, Z. Zeng, Graphcfc: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition, IEEE Transactions on Multimedia (2023).
- [21] Y. Shou, W. Ai, T. Meng, F. Zhang, K. Li, Graphunet: Graph make strong encoders for remote sensing segmentation, in: 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, 2023, pp. 2734–2737.
- [22] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, A. Gelbukh, Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation, in: EMNLP-IJCNLP 2019-2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2020.
- [23] M. Ren, X. Huang, W. Li, D. Song, W. Nie, Lr-gcn: Latent relation-aware graph convolutional network for conversational emotion recognition, IEEE Transactions on Multimedia 24 (2021) 4422–4432.
- [24] D. Zhang, F. Chen, J. Chang, X. Chen, Q. Tian, Structure aware multi-graph network for multi-modal emotion recognition in conversations, IEEE Transactions on Multimedia (2023).
- [25] D. Hazarika, S. Poria, R. Zimmermann, R. Mihalcea, Emotion recognition in conversations with transfer learning from generative conversation modeling, arXiv preprint arXiv:1910.04980 (2019).
- [26] W. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, Advances in neural information processing systems 30 (2017).
- [27] J. Chen, T. Ma, C. Xiao, Fastgcn: Fast learning with graph convolutional networks via importance sampling, in: International Conference on Learning Representations, 2018.
- [28] D. Zou, Z. Hu, Y. Wang, S. Jiang, Y. Sun, Q. Gu, Layer-dependent importance sampling for training deep and large graph convolutional networks, Advances in neural information processing systems 32 (2019).
- [29] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, C.-J. Hsieh, Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 257–266.
- [30] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, V. Prasanna, Graphsaint: Graph sampling based inductive learning method, in: International Conference on Learning Representations, 2019.
- [31] H. Chen, D. Jiang, H. Sahli, Transformer encoder with multi-modal multi-head attention for continuous affect recognition, IEEE Transactions on Multimedia 23 (2021) 4171–4183. doi:10.1109/TMM.2020.3037496.
- [32] J. Li, X. Wang, G. Lv, Z. Zeng, Ga2mif: Graph and attention based two-stage multi-source information fusion for conversational emotion detection, IEEE Transactions on Affective Computing (2023).
- [33] Z. Yang, D. Li, F. Hou, Y. Song, Q. Gao, Deep feature extraction and attention fusion for multimodal emotion recognition, IEEE Transactions on Circuits and Systems II: Express Briefs (2023).
- [34] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan, Iemocap: Interactive emotional dyadic motion capture database, Language resources and evaluation 42 (2008) 335–359.
- [35] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, R. Mihalcea, Meld: A multimodal multi-party dataset for emotion recognition in conversations, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 527–536.
- [36] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, L.-P. Morency, Context-dependent sentiment analysis in user-generated videos, in: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), 2017, pp. 873–883.
- [37] Y. Kim, Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014.
- [38] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, L.-P. Morency, Memory fusion network for multi-view sequential learning, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 32, 2018.
- [39] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, R. Zimmermann, Conversational memory network for emotion recognition in dyadic dialogue videos, in: Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2018, NIH Public Access, 2018, p. 2122.
- [40] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, R. Zimmermann, Icon: Interactive conversational memory network for multimodal emotion detection, in: Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2594–2604.
- [41] D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, G. Zhou, Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations., in: IJCAI, 2019, pp. 5415–5421.
- [42] W. Jiao, M. Lyu, I. King, Real-time emotion recognition via attention gated hierarchical memory network, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 34, 2020, pp. 8002–8009.
- [43] W. Li, W. Shao, S. Ji, E. Cambria, Bieru: Bidirectional emotional recurrent unit for conversational sentiment analysis, Neurocomputing 467 (2022) 73–82.
- [44] T. Kim, P. Vossen, Emoberta: Speaker-aware emotion recognition in conversation with roberta, Computing Research Repository 2108 (12009) (2021).
- [45] Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. Bagher Zadeh, L.-P. Morency, Efficient low-rank multimodal fusion with modality-specific factors, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, 2018, pp. 2247–2256.
- [46] T. Ishiwatari, Y. Yasuda, T. Miyazaki, J. Goto, Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7360–7370.
- [47] J. Lee, W. Lee, Compm: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 5669–5679.
- [48] A. Joshi, A. Bhat, A. Jain, A. Singh, A. Modi, Cogmen: Contextualized gnn based multimodal emotion recognition, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 4148–4164.
- [49] W. Ai, Y. Shou, T. Meng, K. Li, Der-gcn: Dialogue and event relation-aware graph convolutional neural network for multimodal dialogue emotion recognition, arXiv preprint arXiv:2312.10579 (2023).
- [50] S. Xing, S. Mai, H. Hu, Adapted dynamic memory network for emotion recognition in conversation, IEEE Transactions on Affective Computing 13 (3) (2020) 1426–1439.