Efficient Long-distance Latent Relation-aware Graph Neural Network for Multi-modal Emotion Recognition in Conversations

Yuntao Shou [email protected] Wei Ai [email protected] Jiayi Du [email protected] Tao Meng [email protected] Haiyan Liu [email protected] College of Computer and Mathematics, Central South University of Forestry and Technology, ChangSha, Hunan 410004, China College of Information Engineering, Changsha Medical University, ChangSha, Hunan 410203, China
Abstract

The task of multi-modal emotion recognition in conversation (MERC) aims to analyze the genuine emotional state of each utterance based on the multi-modal information in the conversation, which is crucial for conversation understanding. Existing methods focus on using graph neural networks (GNN) to model conversational relationships and capture contextual latent semantic relationships. However, due to the complexity of GNN, existing methods cannot efficiently capture the potential dependencies between long-distance utterances, which limits the performance of MERC. In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversations. Specifically, we first use pre-extracted text, video and audio features as input to Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. Then, we use low-level utterance features to construct a conversational emotion interaction graph. To efficiently capture the potential dependencies between long-distance utterances, we use the dilated generalized forward push algorithm to precompute the emotional propagation between global utterances and design an emotional relation-aware operator to capture the potential semantic associations between different utterances. Furthermore, we combine early fusion and adaptive late fusion mechanisms to fuse latent dependency information between speaker relationship information and context. Finally, we obtain high-level discourse features and feed them into MLP for emotion prediction. Extensive experimental results show that ELR-GNN achieves state-of-the-art performance on the benchmark datasets IEMOCAP and MELD, with running times reduced by 52% and 35%, respectively. In addition, ELR-GNN can effectively improve the accuracy of the MERC task by capturing and fusing the latent semantic relationships between utterances.

keywords:
Graph Neural Network, Multi-modal Emotion Recognition, Relation-aware, Efficiency, Information Fusion
journal: Journal of Parallel and Distributed Computing

1 Introduction

Multi-modal emotion recognition in conversations (ME-RC) has received research attention [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] due to its wide application in the fields of intelligent customer service and emotion analysis [11], human-computer interaction (HCI) [12], and security monitoring [13]. For instance, in HCI, MERC can help computers better understand the emotional state of human users, thereby enabling more intelligent interactions and improving user experience. Unlike traditional non-conversational or unimodal emotion recognition tasks [14], MERC requires identifying the speaker’s genuine emotions based on textual, auditory, and visual information in the conversation utterances [15].

The current mainstream research methods mainly use RNN, Transformer, and GCN to model conversation context and multi-modal information in MERC. For example, DialogueRNN [16] uses a sequential approach to track conversation context and captures the most important emotional features through a memory mechanism. Although RNN-based methods can model the speaker’s contextual information, they have limited memory ability for long-distance conversations, which limits the application of RNN in MERC tasks [17]. To solve the above problems, the Transformer architecture [18] is proposed to model long-distance context dependencies in MERC. For instance, CTNet [19] builds a Single Transformer and Cross Transformer to capture long-distance context dependencies and realize intra-module and inter-module information interaction for emotion recognition. However, methods based on Transformer architecture ignore conversational relationship information between speakers, which limits the model’s emotion recognition performance [20, 21]. To tackle this limitation, many GCN methods have been proposed to model interaction information between speakers. For example, DialogueGCN [22] uses a graph structure to model conversation context and uses GCN to learn conversation graphs to achieve semantic understanding and emotional recognition of conversations. In addition, LR-GCN [23] believes that the context latent dependencies of utterences should also be considered. LR-GCN uses multi-head attention to construct multiple full association graphs to model potential conversational relationships, and then uses GCN to learn latent relationships to achieve emotion recognition. However, limited by the complexity of GCN, these methods usually adopt a fixed window size strategy and then fully connect the utterances within the window to construct a conversation graph, which significantly limits the ability to obtain long-distance contextual information.

Inspired by LR-GCN, we also use GCN to model dialogue relationship information between speakers for MERC. Furthermore, long-distance context potential dependencies can provide more information for emotion classification and help reveal the genuine emotion of utterances. Therefore, how to comprehensively consider long-distance contextual dependencies while ensuring that the number of model parameters does not increase dramatically remains a challenge.

In this paper, we propose an Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition in conversation. Specifically, we first use RoBERTa, 3D-CNN, and openSMILE to perform pre-feature extraction of text, video, and audio features, respectively. Next, we use Bi-LSTM to capture contextual semantic information and obtain low-level utterance features. We then use low-level utterence features to construct a speaker graph. In the constructed speaker relationship graph, low-level utterence features are used as node features, while dialogue relationship information between speakers is used for edge construction. To capture the latent dependency information between long-distance contexts, we use the graph random neural network algorithm to randomly sample top-k𝑘kitalic_k nodes for information extraction. In addition, we combine early fusion and adaptive late fusion mechanisms to simultaneously fuse speaker relationship information and latent dependency information between contexts. Finally, we fine-grained obtained high-level utterance features and fed them into the MLP and softmax function for emotion prediction.

  • 1.

    We propose a novel Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for MER. ELR-GCN not only considers conversational relationship information between speakers, but also captures long-distance context latent dependency information.

  • 2.

    We propose a graph random neural network architecture in which long-distance latent dependencies between utterances are captured by randomly sampling top-k𝑘kitalic_k node features. Furthermore, we combine early fusion and adaptive late fusion mechanisms to simultaneously exploit speaker information and context’s latent dependency information during information propagation.

  • 3.

    We perform extensive experiments on two publicly available datasets to verify the effectiveness of the ELR-GNN method.

2 Related Work

2.1 Multi-modal Emotion Recognition

Single-modality emotion recognition may be limited. For example, text-based emotion recognition alone may not capture the emotional cues in speech and facial expressions [24]. Multimodal emotion recognition (MER) can integrate multiple information sources to improve the accuracy and robustness of emotion recognition.

Current mainstream MER research mainly focuses on RNN, Transformer and GCN. For instance, DialogueRNN [16] modeled individual speakers and uses three different GRUs to achieve more effective correlations between speakers. DialogueGCN [22] was proposed to solve the problem that RNN-based methods cannot consider the dialogue relationship between speakers. DialogueGCN improves the performance of MER by modeling the interactive relationship between speakers through the inherent properties of the graph structure and using graph convolution operations to transfer contextual semantic information. TL-ERC [25] used transfer learning methods to solve problems in supervised learning that require large amounts of high-quality annotated data. CTNet [19] proposed a multi-modal learning framework, which achieves cross-modal contextual semantic information interaction by building a single Transformer and a cross Transformer.

However, current mainstream methods only consider contextual semantic information, latent dependencies of local utterances, and conversational relationships between speakers, and their focus is on exploring the semantic information between utterances and the correlation between speakers. The above approach ignores latent dependencies of the global context, which limits the performance of MER.

Refer to caption
Figure 1: The overall architecture of ELR-GCN for milti-modal emotion recognition. ELR-GCN contains auxiliary information module and graph random neural network module. The auxiliary information module is used to achieve further extraction of contextual semantic information and fusion of speaker relationships and long-distance latent relationships through early and adaptive late fusion. The graph random neural network module is used to model speaker relationships and long-distance contextual latent dependencies.

2.2 Scalable Graph Neural Network

The current mainstream scalable GNN methods include three types of methods: 1) Node sampling strategy: Accelerate the aggregation process of node features by sampling nodes. The representative methods include GraphSAGE [26], FastGCN [27] and LADIES [28]. The main idea of GraphSage is to update the representation of each node through multiple rounds of neighbor sampling and aggregation of neighbor node information, thereby capturing the structure and relationships between nodes in graph data. FastGCN proposes a structured graph node sampling strategy, which selects sampling nodes by considering the structural information of the graph to retain important structural features of the graph. This sampling strategy can preserve the graph information as much as possible while ensuring sampling efficiency. LADIES adopts an adaptive density modeling method to capture local and global information by learning the density distribution of neighbors around a node. LADIES can effectively update the representation of nodes to one that takes into account both local and global information. 2) Graph partitioning method: Divide the original large graph into several small subgraphs and run GNN on the subgraphs. The mainstream graph partitioning methods include Cluster-GCN [29] and GraphSAINT [30]. The mainstream graph partitioning methods include Cluster-GCN and GraphSA-INT. Cluster-GCN divides the original large-scale graph data into multiple subgraphs, each subgraph contains a part of nodes and corresponding edges, thereby reducing computational and memory overhead. GraphSAINT processes large-scale graph data through graph sampling and iterative coarsening. 3) Matrix approximation method: Accelerate feature propagation by decoupling feature propagation and nonlinear transformation. SGC simplifies the nonlinear activation function in traditional graph convolutional networks, retaining only graph convolution operations.

2.3 Multi-head Attention

The multi-head attention in MER can help the model effectively capture the correlation information between different modalities and adaptively focus on the most important parts for the emotion classification task. For example, TEMMA [31] proposesd a multi-modal multi-head attention for MER to comprehensively consider the complementarity and redundancy between modalities. TE-MMA can realize the semantic information interaction between modalities and capture the temporal dependence within the modalities. GA2MIF [32] constructed a multi-head directed graph attention network and a multi-head pairwise cross-modal attention network respectively to ach-ieve contextual semantic information extraction and cross-modal information fusion. EEANet [33] used a multi-head self-attention mechanism to capture the discriminative features in contextual semantic information that are most suitable for emotion classification.

We apply an attention mechanism to the utterance features obtained through graph convolution operations to calculate the correlation between contexts and capture the utterances with the strongest emotional features among the global context latent dependencies. Our method ELR-GNN can simultaneously consider contextual semantic information, interaction information between speakers, and latent dependency information of the global context.

3 Proposed Method

The overall framework of the ELR-GNN proposed in this paper is shown in Fig. 1. ELR-GNN consists of four stages, including sequential contextual feature extraction, graph construction, long-distance contextual latent relationship exploration, and information fusion. In the following subsections, we describe these four key parts in detail.

3.1 Sequential context information extraction

The speaker’s emotional state is not only related to the textual semantic information at the current moment, but also related to the previous contextual semantic information. Therefore, we use Bi-LSTM to capture contextual semantic information in multi-modal features to more accurately understand the speaker’s emotional changes. The formula of LSTM is defined as follows:

[C~tOtrtzt]=[tanhsigmoidsigmoidsigmoid]WT[uithit1]delimited-[]subscript~𝐶𝑡subscript𝑂𝑡subscript𝑟𝑡subscript𝑧𝑡delimited-[]𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝑠𝑖𝑔𝑚𝑜𝑖𝑑subscript𝑊𝑇delimited-[]superscriptsubscript𝑢𝑖𝑡superscriptsubscript𝑖𝑡1\displaystyle\left.\left[\begin{array}[]{c}\widetilde{C}_{t}\\ O_{t}\\ r_{t}\\ z_{t}\end{array}\right.\right]\left.=\left[\begin{array}[]{c}\tanh\\ sigmoid\\ sigmoid\\ sigmoid\end{array}\right.\right]W_{T}\left[\begin{array}[]{c}u_{i}^{t}\\ h_{i}^{t-1}\end{array}\right][ start_ARRAY start_ROW start_CELL over~ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW end_ARRAY ] = [ start_ARRAY start_ROW start_CELL roman_tanh end_CELL end_ROW start_ROW start_CELL italic_s italic_i italic_g italic_m italic_o italic_i italic_d end_CELL end_ROW start_ROW start_CELL italic_s italic_i italic_g italic_m italic_o italic_i italic_d end_CELL end_ROW start_ROW start_CELL italic_s italic_i italic_g italic_m italic_o italic_i italic_d end_CELL end_ROW end_ARRAY ] italic_W start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT [ start_ARRAY start_ROW start_CELL italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARRAY ] (1)
Ct=Ctrt+Ct1ztsubscript𝐶𝑡direct-productsubscript𝐶𝑡subscript𝑟𝑡direct-productsubscript𝐶𝑡1subscript𝑧𝑡\displaystyle\begin{aligned} C_{t}=C_{t}\odot r_{t}+C_{t-1}\odot z_{t}\end{aligned}start_ROW start_CELL italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ⊙ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_CELL end_ROW
hit=Ottanh(Ct)superscriptsubscript𝑖𝑡direct-productsubscript𝑂𝑡tanhsubscript𝐶𝑡\displaystyle h_{i}^{t}=O_{t}\odot\mathrm{tanh}\left(C_{t}\right)italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = italic_O start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⊙ roman_tanh ( italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )

where uitsuperscriptsubscript𝑢𝑖𝑡u_{i}^{t}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT represents the concatenated multi-modal features, hitsubscriptsuperscript𝑡𝑖h^{t}_{i}italic_h start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the hidden layer state, rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the input gate, ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the forgetting gate, Ctsubscript𝐶𝑡C_{t}italic_C start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the cell state, and direct-product\odot represents Hadamard product, W𝑊Witalic_W is a learnable network parameter.

Bi-LSTM is composed of forward and reverse LSTM, and its formula is defined as follows:

h~it=[hit,hit]superscriptsubscript~𝑖𝑡superscriptsubscript𝑖𝑡superscriptsubscript𝑖𝑡\tilde{h}_{i}^{t}=[\overrightarrow{h}_{i}^{t},\overleftarrow{h}_{i}^{t}]over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = [ over→ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT , over← start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ] (2)

where h~itsuperscriptsubscript~𝑖𝑡\tilde{h}_{i}^{t}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is obtained by concatenating the contextual semantic features extracted by forward and reverse LSTM.

3.2 Graph Construction

We use the inherent properties of the graph structure to construct a speaker relationship graph, in which the contextual semantic features extracted through Bi-LSTM are used as node features of the graph, and the dialogue relationships between speakers are used as edges. Specifically, given a speaker dialogue graph 𝒢={𝒲,𝒱,,}𝒢𝒲𝒱\mathcal{G}=\{\mathcal{W},\mathcal{V},\mathcal{E},\mathcal{R}\}caligraphic_G = { caligraphic_W , caligraphic_V , caligraphic_E , caligraphic_R }, where the node vi(vi𝒱)subscript𝑣𝑖subscript𝑣𝑖𝒱v_{i}(v_{i}\in\mathcal{V})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V ) is composed of contextual semantic features (i.e., h~isubscript~𝑖\tilde{h}_{i}over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), the edge eij=1(eij)subscript𝑒𝑖𝑗1subscript𝑒𝑖𝑗e_{ij}=1(e_{ij}\in\mathcal{R})italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 ( italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_R ) indicates that there is a conversation relationship between node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and node vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, otherwise eij=0subscript𝑒𝑖𝑗0e_{ij}=0italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0, ωij(ωij𝒲,0ωij1)subscript𝜔𝑖𝑗formulae-sequencesubscript𝜔𝑖𝑗𝒲0subscript𝜔𝑖𝑗1\omega_{ij}(\omega_{ij}\in\mathcal{W},0\leq\omega_{ij}\leq 1)italic_ω start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ∈ caligraphic_W , 0 ≤ italic_ω start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ≤ 1 ) represents the weight of edge eijsubscript𝑒𝑖𝑗e_{ij}italic_e start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, and r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R represents the edge relationship.

3.3 Long-distance Latent Context Relationship Extraction

Unlike previous work that set the context window size to 10 (i.e., the number of nodes), to capture long-distance latent dependencies of contexts, we adopt a larger context window to explore potential correlations between contexts. Specifically, we first construct an original graph 𝒢𝒢\mathcal{G}caligraphic_G with a larger context window, the generalized forward push algorithm is then used to calculate the propagation matrix of the row vectors, and top-k𝑘kitalic_k sparsification is used to further reduce the training time of the network, so as to comprehensively consider the latent correlation of the context.

3.3.1 Propagation Matrix

We use a mixed-order matrix of feature propagation to aggregate neighbor node information of different orders in the graph to obtain long-distance contextual latent dependency information. The formula of the propagation matrix is defined as follows:

Π=n=0Nwn(D~𝟏A~)nΠsuperscriptsubscript𝑛0𝑁subscript𝑤𝑛superscriptsuperscript~D1~A𝑛\Pi=\sum_{n=0}^{N}w_{n}\cdot\left(\mathbf{\widetilde{\mathrm{D}}^{-1}% \widetilde{\mathrm{A}}}\right)^{n}roman_Π = ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ ( over~ start_ARG roman_D end_ARG start_POSTSUPERSCRIPT - bold_1 end_POSTSUPERSCRIPT over~ start_ARG roman_A end_ARG ) start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (3)

where wn0subscript𝑤𝑛0w_{n}\geq 0italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ≥ 0 and n=0Nwn=1superscriptsubscript𝑛0𝑁subscript𝑤𝑛1\sum_{n=0}^{N}w_{n}=1∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 1, A~~A\widetilde{\mathrm{A}}over~ start_ARG roman_A end_ARG is the adjacency matrix, and D~~D\widetilde{\mathrm{D}}over~ start_ARG roman_D end_ARG is the degree matrix. The propagation matrix can fuse different orders of neighbor node information and capture important contextual potential dependency information by adjusting the weights.

Then we aggregate the node features and update the node features, defined as follows:

𝐗¯s=υ𝒩sπ𝐳sΠ(s,υ)hssubscript¯𝐗𝑠subscript𝜐superscriptsubscript𝒩𝑠𝜋subscript𝐳𝑠Π𝑠𝜐subscript𝑠\overline{\mathbf{X}}_{s}=\sum_{\upsilon\in\mathcal{N}_{s}^{\pi}}\mathbf{z}_{s% }\cdot\Pi(s,\upsilon)\cdot h_{s}over¯ start_ARG bold_X end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_υ ∈ caligraphic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⋅ roman_Π ( italic_s , italic_υ ) ⋅ italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT (4)

where 𝐳υBernoulli(1δ)similar-tosubscript𝐳𝜐Bernoulli1𝛿\mathbf{z}_{\mathcal{\upsilon}}\sim\text{Bernoulli}(1-\delta)bold_z start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT ∼ Bernoulli ( 1 - italic_δ ), ΠΠ\Piroman_Π is the row vectors of the node s𝑠sitalic_s, 𝒩sπsuperscriptsubscript𝒩𝑠𝜋\mathcal{N}_{s}^{\pi}caligraphic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT is the indices of non-zero value of ΠssubscriptΠ𝑠\Pi_{s}roman_Π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Through Eq. 4, we can solve the problem of slow inference speed caused by the high computational complexity of GCN and achieve rapid training of the model. Therefore, we can construct larger graphs to capture long-distance context latent dependencies.

However, ΠssubscriptΠ𝑠\Pi_{s}roman_Π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is actually a difficult estimation problem. To address the problem, We use a two-stage estimation step for calculation, which includes Generalized Forward Push (GFP) and Top-k𝑘kitalic_k sparsification. First, GFP gives the error bound of ΠssubscriptΠ𝑠\Pi_{s}roman_Π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and then Top-k𝑘kitalic_k sparsification only retains top-k𝑘kitalic_k elements to achieve faster calculation speed.

3.3.2 Generalized Forward Push

Since the row-normalized adjacency matrix D~𝟏A~superscript~D1~A\mathbf{\widetilde{\mathrm{D}}^{-1}\widetilde{\mathrm{A}}}over~ start_ARG roman_D end_ARG start_POSTSUPERSCRIPT - bold_1 end_POSTSUPERSCRIPT over~ start_ARG roman_A end_ARG is also an inverse random walk transition probability matrix on 𝒢𝒢\mathcal{G}caligraphic_G, we design an efficient GFP estimation algorithm to estimate ΠssubscriptΠ𝑠\Pi_{s}roman_Π start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. The key step of GFP is to accelerate the random walk probability diffusion process through pruning operation. Specifically, we first give two initial vectors q(n)|V|superscript𝑞𝑛superscript𝑉q^{(n)}\in\mathbb{R}^{|V|}italic_q start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V | end_POSTSUPERSCRIPT and r(n)|V|superscript𝑟𝑛superscript𝑉r^{(n)}\in\mathbb{R}^{|V|}italic_r start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | italic_V | end_POSTSUPERSCRIPT, and both q(0)superscript𝑞0q^{(0)}italic_q start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT and r(0)superscript𝑟0r^{(0)}italic_r start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT are initialized to e(s)superscript𝑒𝑠e^{(s)}italic_e start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT, where e(s)=1superscript𝑒𝑠1e^{(s)}=1italic_e start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT = 1 and e(v)=0superscript𝑒𝑣0e^{(v)}=0italic_e start_POSTSUPERSCRIPT ( italic_v ) end_POSTSUPERSCRIPT = 0 for sv𝑠𝑣s\neq vitalic_s ≠ italic_v. Furthermore, q(n)=0,r(n)=0,1nNformulae-sequencesuperscript𝑞𝑛0formulae-sequencesuperscript𝑟𝑛01𝑛𝑁q^{(n)}=0,r^{(n)}=0,1\leq n\leq Nitalic_q start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = 0 , italic_r start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT = 0 , 1 ≤ italic_n ≤ italic_N. Then, the GFP algorithm begins to iteratively update the q(n)superscript𝑞𝑛q^{(n)}italic_q start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT and r(n)superscript𝑟𝑛r^{(n)}italic_r start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT vectors through ru(n)ru(n)+rυ(n1)/dυsuperscriptsubscriptr𝑢𝑛superscriptsubscriptr𝑢𝑛superscriptsubscriptr𝜐𝑛1subscriptd𝜐\mathrm{r}_{u}^{(n)}\leftarrow\mathrm{r}_{u}^{(n)}+\mathrm{r}_{\upsilon}^{(n-1% )}/\mathrm{d}_{\upsilon}roman_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ← roman_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT + roman_r start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n - 1 ) end_POSTSUPERSCRIPT / roman_d start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT and qu(n)ru(n)superscriptsubscript𝑞𝑢𝑛superscriptsubscript𝑟𝑢𝑛q_{u}^{(n)}\leftarrow r_{u}^{(n)}italic_q start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT ← italic_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT until node v𝑣vitalic_v satisfies rυ(n1)>dυrmaxsuperscriptsubscript𝑟𝜐𝑛1subscript𝑑𝜐subscript𝑟𝑚𝑎𝑥r_{\upsilon}^{(n-1)}>d_{\upsilon}\cdot r_{max}italic_r start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_n - 1 ) end_POSTSUPERSCRIPT > italic_d start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT ⋅ italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, where 𝐝υ=𝐃~(υ,υ)subscript𝐝𝜐~𝐃𝜐𝜐\mathbf{d}_{\mathcal{\upsilon}}=\widetilde{\mathbf{D}}(\upsilon,\upsilon)bold_d start_POSTSUBSCRIPT italic_υ end_POSTSUBSCRIPT = over~ start_ARG bold_D end_ARG ( italic_υ , italic_υ ). When the GFP iteration is complete, we get Π~sn=0Nwn𝐪(n)subscript~Π𝑠superscriptsubscript𝑛0𝑁subscript𝑤𝑛superscript𝐪𝑛\tilde{\Pi}_{s}\leftarrow\sum_{n=0}^{N}w_{n}\cdot\mathbf{q}^{(n)}over~ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ← ∑ start_POSTSUBSCRIPT italic_n = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_q start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT.

3.3.3 Top-k𝑘kitalic_k Sparsification

To reduce the computational complexity of GCN, we perform top-k𝑘kitalic_k sparsification on Π~ssubscript~Π𝑠\tilde{\Pi}_{s}over~ start_ARG roman_Π end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT to accelerate model training. The core idea of Top-k𝑘kitalic_k sparsification is to retain only the top-k𝑘kitalic_k largest elements of ΠΠ\Piroman_Π, and set other elements to 0. Therefore, Π~(k)superscript~Π𝑘\tilde{\Pi}^{(k)}over~ start_ARG roman_Π end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT has only k𝑘kitalic_k non-zero elements, which preserves the most important emotion features in the latent dependencies of the context.

3.3.4 Learnable Information Propagation

Therefore, we introduce a learnable parameter W𝑊Witalic_W to achieve dimensionality reduction of multi-modal features while improving the learning ability of the model. The formula is defined as follows:

X¯s=υ𝒩s(k)𝐳vΠ~(k)(s,υ)hv𝐖subscript¯X𝑠subscript𝜐superscriptsubscript𝒩𝑠𝑘subscript𝐳𝑣superscript~Π𝑘𝑠𝜐subscript𝑣𝐖\overline{\mathrm{X}}_{s}=\sum_{\upsilon\in\mathcal{N}_{s}^{(k)}}\mathbf{z}_{v% }\cdot\widetilde{\Pi}^{(k)}(s,\upsilon)\cdot h_{v}\cdot\mathbf{W}over¯ start_ARG roman_X end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_υ ∈ caligraphic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_z start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⋅ over~ start_ARG roman_Π end_ARG start_POSTSUPERSCRIPT ( italic_k ) end_POSTSUPERSCRIPT ( italic_s , italic_υ ) ⋅ italic_h start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ⋅ bold_W (5)

3.4 Auxiliary Information Module

Graph random neural networks can effectively extract dialogue relationship information between speakers and long-distance context potential dependency information, but it is easy to ignore some discriminative original full-emotion features. Therefore, we use AIM to extract and fuse higher-level emotional features, adaptively aggregating original emotional features, speaker relationship information, and long-distance context potential dependency information.

3.4.1 Feature Extractor (AIM-FE)

Multimodal data are characterized by noise and high dimensionality. To achieve denoising and capture discriminative emotional features in multi-modal data, we introduce gated convolutional networks to capture auxiliary information. In the gated convolutional network, we use sigmoid and tanh functions, which can retain the most important emotional feature information and improve the nonlinear fitting ability of the model. The formula of the gated convolutional network is defined as follows:

𝐙𝒞subscript𝐙𝒞\displaystyle\mathbf{Z}_{\mathcal{C}}bold_Z start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT =tanh(Conv1D(h~it))absent𝑡𝑎𝑛𝐶𝑜𝑛𝑣1𝐷superscriptsubscript~𝑖𝑡\displaystyle=tanh\left(Conv1D\left(\tilde{h}_{i}^{t}\right)\right)= italic_t italic_a italic_n italic_h ( italic_C italic_o italic_n italic_v 1 italic_D ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) ) (6)
sigmoid(Conv1D(h~it))direct-productabsent𝑠𝑖𝑔𝑚𝑜𝑖𝑑𝐶𝑜𝑛𝑣1𝐷superscriptsubscript~𝑖𝑡\displaystyle\odot sigmoid\left(Conv1D\left(\tilde{h}_{i}^{t}\right)\right)⊙ italic_s italic_i italic_g italic_m italic_o italic_i italic_d ( italic_C italic_o italic_n italic_v 1 italic_D ( over~ start_ARG italic_h end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) )

where Conv1D𝐶𝑜𝑛𝑣1𝐷Conv1Ditalic_C italic_o italic_n italic_v 1 italic_D represents 1D convolution operations, direct-product\odot represents Hadama product.

3.4.2 Late Adaptive Fusion (AIM-LAF)

To capture finer-grained semantic information in multi-modal data, early and late adaptive fusion mechanisms are combined to capture auxiliary information with fine-grained emotional features. Specifically, late fusion fuses highly abstract time and space information, ignoring detailed information. Therefore, the combination of early and late adaptive fusion mechanisms proposed in this paper can more effectively capture more discriminative emotional features adaptively from multi-modal data.

In the early fusion process, we map the contextual features 𝐙𝒞subscript𝐙𝒞\mathbf{Z}_{\mathcal{C}}bold_Z start_POSTSUBSCRIPT caligraphic_C end_POSTSUBSCRIPT through the gated convolutional network and the latent features 𝐙𝒢subscript𝐙𝒢\mathbf{Z}_{\mathcal{G}}bold_Z start_POSTSUBSCRIPT caligraphic_G end_POSTSUBSCRIPT through the graph random neural network to the same dimension, obtain 𝐳g~~subscript𝐳𝑔\tilde{\mathbf{z}_{g}}over~ start_ARG bold_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG and 𝐳c~~subscript𝐳𝑐\tilde{\mathbf{z}_{c}}over~ start_ARG bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG and fuse them. The formula is defined as follows:

𝐳f~=Ω(𝐳g~,𝐳c~)~subscript𝐳𝑓Ω~subscript𝐳𝑔~subscript𝐳𝑐\tilde{\mathbf{z}_{f}}=\Omega(\tilde{\mathbf{z}_{g}},\tilde{\mathbf{z}_{c}})over~ start_ARG bold_z start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG = roman_Ω ( over~ start_ARG bold_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_ARG , over~ start_ARG bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ) (7)

where Ω()Ω\Omega(\cdot)roman_Ω ( ⋅ ) represents summation average operation. Then we use a FCN to achieve feature dimensionality reduction and obtain 𝐳gsubscript𝐳𝑔\mathbf{z}_{g}bold_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT, 𝐳csubscript𝐳𝑐\mathbf{z}_{c}bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, and 𝐳fsubscript𝐳𝑓\mathbf{z}_{f}bold_z start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT. Then we use the attention mechanism to obtain the corresponding attention score as follows:

eg=𝐪Ttanh(𝐖𝐳gT+𝐛)subscript𝑒𝑔superscript𝐪𝑇𝐖superscriptsubscript𝐳𝑔𝑇𝐛e_{g}=\mathbf{q}^{T}\cdot\tanh(\mathbf{W}\cdot\mathbf{z}_{g}^{T}+\mathbf{b})italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = bold_q start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ roman_tanh ( bold_W ⋅ bold_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT + bold_b ) (8)

where q𝑞qitalic_q represents the query matrix, W𝑊Witalic_W and b𝑏bitalic_b are the learnable parameters. Likewise, ecsubscript𝑒𝑐e_{c}italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and efsubscript𝑒𝑓e_{f}italic_e start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT are calculated using Eq. 8. Then we use the softmax function to normalize the attention coefficient as follows:

φg=exp(eg)exp(eg)+exp(ec)+exp(ef)subscript𝜑𝑔subscript𝑒𝑔subscript𝑒𝑔subscript𝑒𝑐subscript𝑒𝑓\varphi_{g}=\frac{\exp(e_{g})}{\exp(e_{g})+\exp(e_{c})+\exp(e_{f})}italic_φ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT = divide start_ARG roman_exp ( italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) end_ARG start_ARG roman_exp ( italic_e start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) + roman_exp ( italic_e start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + roman_exp ( italic_e start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ) end_ARG (9)

Finally, we perform a weighted sum of 𝐳g,𝐳csubscript𝐳𝑔subscript𝐳𝑐{\mathbf{z}_{g}},{\mathbf{z}_{c}}bold_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and 𝐳fsubscript𝐳𝑓{\mathbf{z}_{f}}bold_z start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT to obtain the final emotional feature vector representation. The formula is defined as follows:

𝐳=φg𝐳g+φc𝐳c+φf𝐳f𝐳subscript𝜑𝑔subscript𝐳𝑔subscript𝜑𝑐subscript𝐳𝑐subscript𝜑𝑓subscript𝐳𝑓\mathbf{z}=\varphi_{g}\cdot\mathbf{z}_{g}+\varphi_{c}\cdot\mathbf{z}_{c}+% \varphi_{f}\cdot\mathbf{z}_{f}bold_z = italic_φ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ⋅ bold_z start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT + italic_φ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ⋅ bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT + italic_φ start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ⋅ bold_z start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT (10)

3.5 Model Training

The final emotional feature vector 𝐳𝐳\mathbf{z}bold_z with contextual semantic information, dialogue relationship information between speakers, and long-distance latent dependency information is fed into the MLP with residual connections for feature conversion, and then use the softmax layer to get the probability of C-class emotion category:

Z𝑍\displaystyle Zitalic_Z =𝐳+ReLU(𝐳Wz+bz)absent𝐳𝑅𝑒𝐿𝑈𝐳subscript𝑊𝑧subscript𝑏𝑧\displaystyle=\mathbf{z}+ReLU(\mathbf{z}W_{z}+b_{z})= bold_z + italic_R italic_e italic_L italic_U ( bold_z italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) (11)
P𝑃\displaystyle Pitalic_P =softmax(ZWZ+bZ)absentsoftmax𝑍subscript𝑊𝑍subscript𝑏𝑍\displaystyle=\text{softmax}(ZW_{Z}+b_{Z})= softmax ( italic_Z italic_W start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT )

where Wzsubscript𝑊𝑧W_{z}italic_W start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, bzsubscript𝑏𝑧b_{z}italic_b start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, WZsubscript𝑊𝑍W_{Z}italic_W start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT, bZsubscript𝑏𝑍b_{Z}italic_b start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT is the learnable parameters. We then obtain the index of the maximum emotion probability by using the argmax function.

y^(j)=argmax(P(j))superscript^𝑦𝑗𝑎𝑟𝑔𝑚𝑎𝑥superscript𝑃𝑗\hat{y}^{(j)}=argmax(P^{(j)})over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT = italic_a italic_r italic_g italic_m italic_a italic_x ( italic_P start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) (12)

Finally, we use cross-entropy loss to complete the optimization of the model:

L=1i=1MLii=1Mj=1Lic=1Cyi,c(j)log2(y^i,c(j))𝐿1superscriptsubscript𝑖1𝑀subscript𝐿𝑖superscriptsubscript𝑖1𝑀superscriptsubscript𝑗1subscript𝐿𝑖superscriptsubscript𝑐1𝐶superscriptsubscript𝑦𝑖𝑐𝑗𝑙𝑜subscript𝑔2superscriptsubscript^𝑦𝑖𝑐𝑗L=-\frac{1}{\sum_{i=1}^{M}L_{i}}\sum_{i=1}^{M}\sum_{j=1}^{L_{i}}\sum_{c=1}^{C}% y_{i,c}^{(j)}log_{2}(\hat{y}_{i,c}^{(j)})italic_L = - divide start_ARG 1 end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT italic_l italic_o italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i , italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) (13)

where M𝑀Mitalic_M represents the number of dialogues, and Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the number of utterances in the i𝑖iitalic_i-th dialogue.

4 Experiments

4.1 Datasets

We evaluate the ELR-GNN model proposed in this paper on two benchmark datasets, IEMOCAP [34] and MELD [35]. All these data sets contain three modal data sets: text, video, and audio.

IEMOCAP is a public dataset widely used in emotion recognition research. This dataset was created by the Sippy team at the University of Southern California and aims to provide detailed annotations of emotional interactions and speech/non-verbal behaviors. The IEMOCAP dataset emotionally annotates speech and video, including six emotion categories: happy, sad, angry, excited, frustrated, and neutral. Emotional annotation is accomplished through consistent annotation of data by multiple evaluators. The IEMOCAP dataset contains text, audio, and video data from 10 different actors. Each actor participated in a series of emotional interaction tasks.

MELD is an open multi-modal dataset for emotion analysis research. It was created by researchers at the University of Toronto to advance research into natural language and speech emotion recognition. The MELD data set contains data in three modalities: text, video and audio. The text is the script text from the movie dialogue. The MELD dataset contains annotations for six emotion categories: joy, sadness, anger, fear, surprise and neutral. These emotion annotations are performed independently by multiple annotators.

Table 1: Comparison with other baseline models on the IEMOCAP dataset.
Methods IEMOCAP
Happy Sad Neutral Angry Excited Frustrated Average(w)
Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1
TextCNN 27.7 29..8 57.1 53.8 34.3 40.1 61.1 52.4 46.1 50.0 62.9 55.7 48.9 48.1
bc-LSTM 29.1 34.4 57.1 60.8 54.1 51.8 57.0 56.7 51.1 57.9 67.1 58.9 55.2 54.9
MFN 24.0 34.1 65.6 70.5 55.5 52.1 72.3 66.8 64.3 62.1 67.9 62.5 60.1 59.9
CMN 25.0 30.3 55.9 62.4 52.8 52.3 61.7 59.8 55.5 60.2 71.1 60.6 56.5 56.1
LFM 25.6 33.1 75.1 78.8 58.5 59.2 64.7 65.2 80.2 71.8 61.1 58.9 63.4 62.7
ICON 22.2 29.9 58.8 64.6 62.8 57.4 64.7 63.0 58.9 63.4 67.2 60.8 59.1 58.5
A-DMN 43.1 50.6 69.4 76.8 63.0 62.9 63.5 56.5 88.3 77.9 53.3 55.7 64.6 64.3
DialogueGCN 40.6 42.7 89.1 84.5 62.0 63.5 67.5 64.1 65.5 63.1 64.1 66.9 65.2 64.1
RGAT 60.1 51.6 78.8 77.3 60.1 65.4 70.7 63.0 78.0 68.0 64.3 61.2 65.0 65.2
AGHMN 48.3 52.1 68.3 73.3 61.6 58.4 57.5 61.9 68.1 69.7 67.1 62.3 63.5 63.5
BiERU 54.2 31.5 80.6 84.2 64.7 60.2 67.9 65.7 62.8 74.1 61.9 61.3 66.1 64.7
CoMPM 59.9 60.7 78.0 82.2 60.4 63.0 70.2 59.9 85.8 78.2 62.9 59.5 67.7 67.2
EmoBERTa 56.9 56.4 79.1 83.0 64.0 61.5 70.6 69.6 86.0 78.0 63.8 68.7 67.3 67.3
COGMEN 57.4 51.9 81.4 81.7 65.4 68.6 69.5 66.0 83.3 75.3 63.8 68.2 68.2 67.6
CTNet 47.9 51.3 78.0 79.9 69.0 65.8 72.9 67.2 85.3 78.7 52.2 58.8 68.0 67.5
LR-GCN 54.2 55.5 81.6 79.1 59.1 63.8 69.4 69.0 76.3 74.0 68.2 68.9 68.5 68.3
DER-GCN 60.7 58.8 75.9 79.8 66.5 61.5 71.3 72.1 71.1 73.3 66.1 67.8 69.7 69.4
ELR-GCN 64.7 62.9 75.7 80.8 66.2 62.4 70.7 70.0 76.8 78.6 67.9 68.1 70.6 70.9

4.2 Baselines and Evaluation Metrics

bc-LSTM [36] performs final emotion recognition by extracting the sequential context information of the utterance, which is context-sensitive.

Text-CNN [37] uses convolution filters to extract local semantic information from utterances, which is context-independent.

MFN [38] designs a multi-view learning mechanism to capture view-specific and cross-view semantic information, but MFN does not consider contextual information.

CMN [39] achieves the fusion of speaker information and multi-modal features by introducing an attention mechanism.

ICON [40] uses GRU to extract contextual information of multi-modal features and uses attention layers to achieve the fusion of multi-modal semantic information.

DialogueRNN [16] constructs three different gating units to achieve the extraction and fusion of speaker information, emotional information and global information.

DialogueGCN [22] DialogueGCN constructs a speaker relationship graph by using contextual semantic features, and utilizes contextual semantic information and speaker relationship information to achieve emotion classification.

Table 2: Comparison with other baseline models on the MELD dataset.
Methods MELD
Neutral Surprise Fear Sadness Joy Disgust Anger Average(w)
Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1 Acc. F1
TextCNN 76.2 74.9 43.3 45.5 4.6 3.7 18.2 21.1 46.1 49.4 8.9 8.3 35.3 34.5 56.3 55.0
bc-LSTM 78.4 73.8 46.8 47.7 3.8 5.4 22.4 25.1 51.6 51.3 4.3 5.2 36.7 38.4 57.5 55.9
DialogueRNN 72.1 73.5 54.4 49.4 1.6 1.2 23.9 23.8 52.0 50.7 1.5 1.7 41.0 41.5 56.1 55.9
DialogueGCN 70.3 72.1 42.4 41.7 3.0 2.8 20.9 21.8 44.7 44.2 6.5 6.7 39.0 36.5 54.9 54.7
RGAT 76.0 78.1 40.1 41.5 3.0 2.4 32.1 30.7 68.1 58.6 4.5 2.2 40.0 44.6 60.3 61.1
CoMPM 78.3 82.0 48.3 49.2 1.7 2.9 35.9 32.3 71.4 61.5 3.1 2.8 42.2 45.8 64.1 65.3
EmoBERTa 78.9 82.5 50.2 50.2 1.8 1.9 33.3 31.2 72.1 61.7 9.1 2.5 43.3 46.4 64.1 65.2
ConGCN 46.8 45.4 10.6 8.8 8.7 8.1 53.1 54.6 76.7 75.2 28.5 26.3 50.3 48.4 59.4 58.7
A-DMN 76.5 78.9 56.2 55.3 8.2 8.6 22.1 24.9 59.8 57.4 1.2 3.4 41.3 40.9 61.5 60.4
LR-GCN 76.7 80.0 53.3 55.2 0.0 0.0 49.6 35.1 68.0 64.4 10.7 2.7 48.0 51.0 65.7 65.6
DER-GCN 76.8 80.6 50.5 51.0 14.8 10.4 56.7 41.5 69.3 64.3 17.2 10.3 52.5 57.4 66.8 66.1
ELR-GCN 80.2 83.6 36.8 35.4 19.2 13.1 80.2 83.6 76.5 69.7 55.6 13.0 52.1 57.7 68.7 69.9

ConGCN [41] treats multimodal features as node features in the graph and utilizes heterogeneous graphs to model conversational relationship information between spe-akers.

LR-GCN [23] captures the latent dependencies between contexts by constructing multiple graphs and constructs densely connected layers to extract speaker relationship information and structural information of the graph.

AGHMN [42] uses BiGRU to fuse the correlation information between historical contexts and uses the attention mechanism to give higher weight to important context information.

BiERU [43] uses emotion recurrent units and emotion feature extractors to extract contextual semantic information respectively. and refine contextual emotion feature vectors.

EmoBERTa [44] uses RoBERTa to extract sequential contextual semantic information from text. This method does not use multi-modal data.

LFM [45] uses low-rank decomposition to effectively reduce the dimensionality disaster problem that occurs during the fusion process of multi-modal features.

RGAT [46] integrates position encoding information into the graph attention network to improve the model’s context understanding ability.

CoMPM [47] uses a pre-trained model to extract pre-trained context memory information and combines it with the context model to understand the global contextual emotional features in a fine-grained manner.

COGMEN [48] improves the representation ability of emotional feature vectors by building context GCN to extract global and local context information and fuse them.

DER-GCN [49] improves the model’s emotional representation capabilities by constructing speaker relationship graphs and event graphs.

A-DMN [50] A-DMN comprehensively considers the intra-speaker and inter-speaker contextual information, and uses GRU to achieve cross-modal feature fusion.

CTNet [19] realizes semantic information interaction within and between modalities by building Single Transformer and Cross Transformer.

Refer to caption
Figure 2: Confusion matrix of ELR-GNN and LR-GNN classification on IEMOCAP and MELD datasets.

4.3 Comparison with the State-of-the-Art Methods

To verify the superiority of the ELR-GNN method proposed in this paper, we report the experimental results of ELR-GNN and other comparative methods on the IEMOCAP and MELD data sets. Experimental results are presented in Tables 1 and 2.

IEMOCAP: As shown in Table 1, the multi-modal emotion recognition method proposed in this paper achie-ved the best emotion recognition effect on the IEMOCAP data set, with an average accuracy of 70.6% and an average F1 value of 70.9%. ELR-GCN proposes an effective modeling method of long-distance context latent dependencies for multi-modal emotion recognition. In addition, ELR-GCN also combines early and adaptive late fusion methods to achieve the capture of fine-grained emotional features. Among other comparison methods, the emotion recognition effect of DER-GCN is slightly lower than that of ELR-GNN, with an average accuracy of 69.7% and an average F1 value of 69.4%. Although DER-GCN comprehensively considers event relationships and dialogue relationships between speakers to enhance the model’s emotional understanding, it ignores latent context dependencies. The emotion recognition effect of LR-GCN is lower than ELR-GNN and DER-GCN, with an average accuracy of 68.5% and an average F1 value of 68.3%. Although LR-GCN considers latent dependencies between contexts, due to the high computational complexity of GCN, LR-GCN can only capture local latent dependencies. The emotion recognition effects of other comparison methods are lower than ELR-GNN. Likewise, none of them take into account potential dependencies on context. Overall, the accuracy of ELR-GNN on the happy emotion analogy is much higher than that of other comparison algorithms, while the accuracy of other emotion categories is also relatively close to that of other comparison algorithms. In addition, the F1 value of ELR-GNN on the happy and excited emotional analogies is much higher than that of other comparison algorithms. At the same time, the F1 value of ELR-GNN on other emotional categories is also relatively close to other comparison algorithms. The experimental results prove the superiority of the ELR-GNN method proposed in this paper.

MELD: As shown in Table 2, The ELR-GNN method proposed in this article has the best emotion recognition effect on the MELD data set, with an average accuracy of 68.7% and an average F1 value of 69.9%. The emotion recognition effect of DER-GCN is second, with an average accuracy of 69.7% and an average F1 value of 69.4%. The emotion recognition effect of LR-GCN is lower than that of ELR-GNN and DER-GCN, with an average accuracy of 68.5% and an average F1 value of 68.3%. The emotion recognition effects of other comparison methods are relatively poor, and the average accuracy and F1 value are lower than ELR-GNN. The performance improvement may be attributed to ELR-GNN’s ability to capture long-distance contextual latent dependencies and fine-grained fusion of dialogue relationships between speakers, contextual latent dependencies and contextual semantic information. Overall, the accuracy of ELR-GNN on the neutral, fear, sadness, joy, and disgust emotion analogy is much higher than that of other comparison algorithms, while the accuracy of other emotion categories is also relatively close to that of other comparison algorithms. In addition, the F1 value of ELR-GNN on the neutral, fear, sadness, joy, and anger emotional analogies is much higher than that of other comparison algorithms. At the same time, the F1 value of ELR-GNN on other emotional categories is also relatively close to other comparison algorithms. In addition, we find that ELR-GNN has better emotion recognition effects on the minority emotions fear and disgust, with relatively high accuracy and F1 value. The experimental results prove the superiority of the ELR-GNN method proposed in this paper.

In addition, to intuitively illustrate that the running time of the ELR-GNN method proposed in this paper is better than other comparative methods, we statistics in Table 3 the running time of other comparative methods of the ELR-GNN method on the IEMOCAP and MELD data sets. As shown in Table 3, the running time of the ELR-GNN method proposed in this paper on the IEMOCAP and MELD data sets is 41s and 91s respectively, which is significantly better than other comparison methods. The running times of DialogueGCN are 58s and 127s respectively, which are lower than LR-GCN and DER-GCN, but the emotion recognition effect is relatively poor. The running times of LR-GCN are 87s and 142s respectively. The running times of DER-GCN are 125s and 189s respectively. The experimental results prove the efficiency and effectiveness of the ELR-GNN method proposed in this paper.

Table 3: We tested the running time of the ELR-GNN method proposed in this paper and other comparative methods on the IEMOCAP and MELD data sets. In particular, ELR-GNN sets rmaxsubscript𝑟𝑚𝑎𝑥r_{max}italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT to 105superscript10510^{-5}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and neighbor size to 64.
      Methods       Running time (s)
      IEMOCAP       MELD
      DialogueGCN       58       127
      LR-GCN       87       142
      DER-GCN       125       189
      ELR-GNN       41       91

4.4 Analysis of the Experimental Results

To intuitively understand the ability of the feeling model for each emotion category, we analyzed the emotion classification of ELR-GNN and LR-GCN on the test set. Fig. 2 shows the confusion matrix of ELR-GNN and LR-GCN for emotion classification on IEMOCAP and MELD data sets.

Refer to caption
Figure 3: We tested the impact of the maximum neighborhood size and parameter rmaxsubscript𝑟𝑚𝑎𝑥r_{max}italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT in ELR-GNN on the accuracy and running time of emotion recognition.

Overall, on the IEMOCAP data set, ELR-GCN has a higher number of correct classifications for each emotion category than LR-GCN. On the MELD dataset, ELR-GCN has a higher number of correct classifications than LR-GCN in most emotional categories. The performance improvement may be attributed to ELR-GNN’s ability to understand the semantic representation of each emotion category in a fine-grained manner.

On the IEMOCAP dataset, the confusion matrix shows that ELR-GNN easily misclassifies happy emotions into excited emotions. Similarly, LR-GCN also easily misclassifies happy emotions into excited emotions, and even the number of misclassifications is greater than that of ELR-GNN. We speculate that this is because the semantics of happy emotions and excited emotions are relatively similar, and the model cannot differentiate between these two types of emotions in a fine-grained manner. In addition, we also find that ELR-GNN easily misclassifies neutral emotions into frustated emotions.

On the MELD data set, the confusion matrix shows that ELR-GNN has a very poor classification effect on disgust and fear emotions, and can only correctly classify a few samples. This is because the number of disgust and fear emotion categories is relatively small, and the data set has a serious imbalance problem, which leads to deviations in the model’s emotional understanding ability. The number of correct classifications of ELR-GNN on neutral emotions is very large, and there are very few misclassified samples. Experimental results prove that ELR-GNN has a relatively strong ability to understand neutral emotional categories.

4.5 Ablation Study

4.5.1 Importance of the Modalities

To verify the importance of the three modal features of text, video and audio for ELR-GNN, we conducted ablation experiments on the IEMOCAP and MELD data sets to compare the performance of the combination of different modal features. The experimental results are shown in Table 4. In single-modal experiments, ELR-GNN with text modality features has the best emotion recognition effect. The average accuracy on the IEMOCAP and MELD data sets are 64.1% and 63.5%, respectively, and the average F1 value is 63.9% and 62.4%, respectively. The emotion recognition effect of ELR-GNN with audio modal features is second, with average accuracy rates of 61.1% and 62.7% on the IEMOCAP and MELD data sets, and average F1 values of 60.8% and 62.0% respectively. ELR-GNN with video modality features has the worst emotion recognition effect, with average accuracy rates of 59.4% and 60.1% on the IEMOCAP and MELD data sets, and average F1 values of 59.7% and 61.4% respectively. Experimental results show that text features contain the most emotional semantic information. In the dual-modal experiment, ELR-GNN with text and audio modal features has the best emotion recognition effect. The average accuracy on the IEMOCAP and MELD data sets are 65.0% and 64.1%, respectively, and the average F1 values are are 64.4% and 63.2%, respectively. Experimental results demonstrate the effectiveness of multimodal features.

Table 4: The effect of ELR-GNN on IEMOCAP and MELD datasets using unimodal features and multimodal features, respectively. We report average accuracy and F1-score.
Modality IEMOCAP MELD
Acc. F1 Acc F1
T 64.1 63.9 63.5 62.4
A 61.1 60.8 62.7 62.0
V 59.4 59.7 60.1 61.4
T+A 65.0 64.4 64.1 63.2
T+V 64.3 64.6 64.0 62.9
V+A 63.0 62.7 61.5 61.9
T+A+V 70.6 70.9 68.7 69.9

4.5.2 Parameter Analysis

We tested the impact of the maximum neighborhood size and parameter rmaxsubscript𝑟𝑚𝑎𝑥r_{max}italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT in ELR-GNN on the accuracy and running time of emotion recognition. As shown in Figs. 3(a), and 3(b), we tested the impact of different neighborhood sizes and rmaxsubscript𝑟𝑚𝑎𝑥r_{max}italic_r start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT on emotion recognition accuracy on the IEMOCAP and MELD datasets. Experimental results show that when r=105𝑟superscript105r=10^{-5}italic_r = 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT, ELR-GNN has the best emotion recognition effect. When r=104𝑟superscript104r=10^{-4}italic_r = 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, the emotion recognition effect of ELR-GNN is second. When r=103𝑟superscript103r=10^{-3}italic_r = 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, ELR-GNN has the worst emotion recognition effect. Furthermore, as the size of the neighborhood continues to increase, the model’s emotion recognition performance also improves. Experimental results demonstrate the necessity of capturing long-range latent context dependencies.

As shown in Figs. 3(c), and 3(d), We also calculated the impact of different neighborhood sizes on running time and emotion recognition accuracy. Experimental results show that as the neighborhood size increases, the running time of the model also increases, but it is lower than the running time of LR-GCN and DER-GCN. In addition, as the neighborhood size increases, the emotion recognition effect of the model also improves.

5 Conclusions

In this paper, we propose a novel Efficient Long-distance Latent Relation-aware Graph Neural Network (ELR-GNN) for multi-modal emotion recognition. Specifically, we first use RoBERTa, 3D-CNN and openSMILE to perform pre-feature extraction of text, video and audio features respectively. Next, we use Bi-LSTM to capture contextual semantic information and obtain low-level utterence features. We then use low-level utterence features to construct a speaker graph. In the constructed speaker relationship graph, low-level utterence features are used as node features, while dialogue relationship information between speakers is used for edge construction. To capture the latent dependency information between long-distance contexts, we use the graph random neural network algorithm to randomly sample top-k𝑘kitalic_k nodes for information extraction. In addition, we combine early fusion and adaptive late fusion mechanisms to simultaneously fuse speaker relationship information and latent dependency information between contexts. On the IEMOCAP and MELD data sets, the ELR-GNN method proposed in this paper is better than other comparative methods, and the experimental results prove the superiority of the ELR-GNN method.

CRediT authorship contribution statement

Yuntao Shou: Conceptualization, Methodology, Software, Data curation,Visualization, Validation, Writing - original draft, Writing - review & editing. Wei Ai: Conceptualization, Methodology, Writing - review & editing. Jiayi Du: Conceptualization, Methodology, Writing - review & editing. Haiyan Liu: Conceptualization, Methodology, Writing - review & editing. Tao Meng: Conceptualization, Methodology, Software, Visualization, Validation, Writing - original draft, Writing - review & editing, Funding acquisition.

Data availability

Data will be made available on request.

Declaration of competing interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant No. 61802444), the Research Foundation of Education Bureau of Hunan Province of China (GrantNo. 22B0275).

References

  • [1] Z. Lian, B. Liu, J. Tao, Smin: Semi-supervised multi-modal interaction network for conversational emotion recognition, IEEE Transactions on Affective Computing (2022).
  • [2] Y. Shou, T. Meng, W. Ai, S. Yang, K. Li, Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis, Neurocomputing 501 (2022) 629–639.
  • [3] Y. Shou, X. Cao, D. Meng, B. Dong, Q. Zheng, A low-rank matching attention based cross-modal feature fusion method for conversational emotion recognition, arXiv preprint arXiv:2306.17799 (2023).
  • [4] T. Meng, Y. Shou, W. Ai, N. Yin, K. Li, Deep imbalanced learning for multimodal emotion recognition in conversations, arXiv preprint arXiv:2312.06337 (2023).
  • [5] T. Meng, Y. Shou, W. Ai, J. Du, H. Liu, K. Li, A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition, Neurocomputing 569 (2024) 127109.
  • [6] W. Ai, Y. Shou, T. Meng, K. Li, Der-gcn: Dialog and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition, IEEE Transactions on Neural Networks and Learning Systems (2024).
  • [7] Y. Shou, T. Meng, W. Ai, K. Li, Adversarial representation with intra-modal and inter-modal graph contrastive learning for multimodal emotion recognition, arXiv preprint arXiv:2312.16778 (2023).
  • [8] Y. Shou, T. Meng, F. Zhang, N. Yin, K. Li, Revisiting multi-modal emotion learning with broad state space models and probability-guidance fusion, arXiv preprint arXiv:2404.17858 (2024).
  • [9] W. Ai, F. Zhang, T. Meng, Y. Shou, H. Shao, K. Li, A two-stage multimodal emotion recognition model based on graph contrastive learning, in: 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, 2023, pp. 397–404.
  • [10] T. Meng, F. Zhang, Y. Shou, W. Ai, N. Yin, K. Li, Revisiting multimodal emotion recognition in conversation from the perspective of graph spectrum, arXiv preprint arXiv:2404.17862 (2024).
  • [11] D. Priyasad, T. Fernando, S. Denman, S. Sridharan, C. Fookes, Attention driven fusion for multi-modal emotion recognition, in: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2020, pp. 3227–3231.
  • [12] J. Zhang, Z. Yin, P. Chen, S. Nichele, Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review, Information Fusion 59 (2020) 103–126.
  • [13] V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, N. Onoe, M2fnet: Multi-modal fusion network for emotion recognition in conversation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 4652–4661.
  • [14] C. Li, Z. Bao, L. Li, Z. Zhao, Exploring temporal representations by leveraging attention-based bidirectional lstm-rnns for multi-modal emotion recognition, Information Processing & Management 57 (3) (2020) 102185.
  • [15] Y. Shou, T. Meng, W. Ai, C. Xie, H. Liu, Y. Wang, Object detection in medical images based on hierarchical transformer and mask mechanism, Computational Intelligence and Neuroscience 2022 (1) (2022) 5863782.
  • [16] N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, E. Cambria, Dialoguernn: An attentive rnn for emotion detection in conversations, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 33, 2019, pp. 6818–6825.
  • [17] Z. Yang, X. Li, Y. Cheng, T. Zhang, X. Wang, Emotion recognition in conversation based on a dynamic complementary graph convolutional network, IEEE Transactions on Affective Computing (2024).
  • [18] H. Ma, J. Wang, H. Lin, B. Zhang, Y. Zhang, B. Xu, A transformer-based model with self-distillation for multimodal emotion recognition in conversations, IEEE Transactions on Multimedia (2023).
  • [19] Z. Lian, B. Liu, J. Tao, Ctnet: Conversational transformer network for emotion recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021) 985–1000.
  • [20] J. Li, X. Wang, G. Lv, Z. Zeng, Graphcfc: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition, IEEE Transactions on Multimedia (2023).
  • [21] Y. Shou, W. Ai, T. Meng, F. Zhang, K. Li, Graphunet: Graph make strong encoders for remote sensing segmentation, in: 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS), IEEE, 2023, pp. 2734–2737.
  • [22] D. Ghosal, N. Majumder, S. Poria, N. Chhaya, A. Gelbukh, Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation, in: EMNLP-IJCNLP 2019-2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2020.
  • [23] M. Ren, X. Huang, W. Li, D. Song, W. Nie, Lr-gcn: Latent relation-aware graph convolutional network for conversational emotion recognition, IEEE Transactions on Multimedia 24 (2021) 4422–4432.
  • [24] D. Zhang, F. Chen, J. Chang, X. Chen, Q. Tian, Structure aware multi-graph network for multi-modal emotion recognition in conversations, IEEE Transactions on Multimedia (2023).
  • [25] D. Hazarika, S. Poria, R. Zimmermann, R. Mihalcea, Emotion recognition in conversations with transfer learning from generative conversation modeling, arXiv preprint arXiv:1910.04980 (2019).
  • [26] W. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large graphs, Advances in neural information processing systems 30 (2017).
  • [27] J. Chen, T. Ma, C. Xiao, Fastgcn: Fast learning with graph convolutional networks via importance sampling, in: International Conference on Learning Representations, 2018.
  • [28] D. Zou, Z. Hu, Y. Wang, S. Jiang, Y. Sun, Q. Gu, Layer-dependent importance sampling for training deep and large graph convolutional networks, Advances in neural information processing systems 32 (2019).
  • [29] W.-L. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, C.-J. Hsieh, Cluster-gcn: An efficient algorithm for training deep and large graph convolutional networks, in: Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 257–266.
  • [30] H. Zeng, H. Zhou, A. Srivastava, R. Kannan, V. Prasanna, Graphsaint: Graph sampling based inductive learning method, in: International Conference on Learning Representations, 2019.
  • [31] H. Chen, D. Jiang, H. Sahli, Transformer encoder with multi-modal multi-head attention for continuous affect recognition, IEEE Transactions on Multimedia 23 (2021) 4171–4183. doi:10.1109/TMM.2020.3037496.
  • [32] J. Li, X. Wang, G. Lv, Z. Zeng, Ga2mif: Graph and attention based two-stage multi-source information fusion for conversational emotion detection, IEEE Transactions on Affective Computing (2023).
  • [33] Z. Yang, D. Li, F. Hou, Y. Song, Q. Gao, Deep feature extraction and attention fusion for multimodal emotion recognition, IEEE Transactions on Circuits and Systems II: Express Briefs (2023).
  • [34] C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, S. S. Narayanan, Iemocap: Interactive emotional dyadic motion capture database, Language resources and evaluation 42 (2008) 335–359.
  • [35] S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, R. Mihalcea, Meld: A multimodal multi-party dataset for emotion recognition in conversations, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 527–536.
  • [36] S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, L.-P. Morency, Context-dependent sentiment analysis in user-generated videos, in: Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), 2017, pp. 873–883.
  • [37] Y. Kim, Convolutional neural networks for sentence classification, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, 2014.
  • [38] A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, L.-P. Morency, Memory fusion network for multi-view sequential learning, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 32, 2018.
  • [39] D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, R. Zimmermann, Conversational memory network for emotion recognition in dyadic dialogue videos, in: Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2018, NIH Public Access, 2018, p. 2122.
  • [40] D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, R. Zimmermann, Icon: Interactive conversational memory network for multimodal emotion detection, in: Proceedings of the 2018 conference on empirical methods in natural language processing, 2018, pp. 2594–2604.
  • [41] D. Zhang, L. Wu, C. Sun, S. Li, Q. Zhu, G. Zhou, Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations., in: IJCAI, 2019, pp. 5415–5421.
  • [42] W. Jiao, M. Lyu, I. King, Real-time emotion recognition via attention gated hierarchical memory network, in: Proceedings of the AAAI conference on artificial intelligence, Vol. 34, 2020, pp. 8002–8009.
  • [43] W. Li, W. Shao, S. Ji, E. Cambria, Bieru: Bidirectional emotional recurrent unit for conversational sentiment analysis, Neurocomputing 467 (2022) 73–82.
  • [44] T. Kim, P. Vossen, Emoberta: Speaker-aware emotion recognition in conversation with roberta, Computing Research Repository 2108 (12009) (2021).
  • [45] Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. Bagher Zadeh, L.-P. Morency, Efficient low-rank multimodal fusion with modality-specific factors, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL, 2018, pp. 2247–2256.
  • [46] T. Ishiwatari, Y. Yasuda, T. Miyazaki, J. Goto, Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 7360–7370.
  • [47] J. Lee, W. Lee, Compm: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 5669–5679.
  • [48] A. Joshi, A. Bhat, A. Jain, A. Singh, A. Modi, Cogmen: Contextualized gnn based multimodal emotion recognition, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 4148–4164.
  • [49] W. Ai, Y. Shou, T. Meng, K. Li, Der-gcn: Dialogue and event relation-aware graph convolutional neural network for multimodal dialogue emotion recognition, arXiv preprint arXiv:2312.10579 (2023).
  • [50] S. Xing, S. Mai, H. Hu, Adapted dynamic memory network for emotion recognition in conversation, IEEE Transactions on Affective Computing 13 (3) (2020) 1426–1439.