\floatsetup

[table]capposition=top \newfloatcommandcapbtabboxtable[][\FBwidth]

Emotion and Intent Joint Understanding in Multimodal Conversation: A Benchmarking Dataset

Rui Liu
Inner Mongolia University
[email protected]
\AndHaolin Zuo
Inner Mongolia University
[email protected]
\ANDZheng Lian
Chinese Academy of Sciences
[email protected]
\AndXiaofen Xing
South China University of Technology
[email protected]
\AndBjörn W. Schuller
Technical University of Munich
[email protected]
\AndHaizhou Li
The Chinese University of Hong Kong, Shenzhen
[email protected]
Abstract

Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU) aims to decode the semantic information manifested in a multimodal conversational history, while inferring the emotions and intents simultaneously for the current utterance. MC-EIU is enabling technology for many human-computer interfaces. However, there is a lack of available datasets in terms of annotation, modality, language diversity, and accessibility. In this work, we propose an MC-EIU dataset, which features 7 emotion categories, 9 intent categories, 3 modalities, i.e., textual, acoustic, and visual content, and two languages, i.e., English and Mandarin. Furthermore, it is completely open-source for free access. To our knowledge, MC-EIU is the first comprehensive and rich emotion and intent joint understanding dataset for multimodal conversation. Together with the release of the dataset, we also develop an Emotion and Intent Interaction (EI2) network as a reference system by modeling the deep correlation between emotion and intent in the multimodal conversation. With comparative experiments and ablation studies, we demonstrate the effectiveness of the proposed EI2 method on the MC-EIU dataset. The dataset and codes will be made available at: https://github.com/MC-EIU/MC-EIU.

1 Introduction

Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU) aims to infer the emotional state and intent information simultaneously by modeling the semantic dependency among the multimodal conversation Singh et al. (2022). Unlike separate tasks of emotion or intent recognition Lian et al. (2023); Zuo et al. (2023); Zhang et al. (2022a), the MC-EIU task provides richer information to assist machines in better understanding human needs and improve empathy in human-machine conversation Poria et al. (2017a); Deng et al. (2023). It holds significant potential for application in various human-computer interaction scenarios, including call center dialog systems Danieli et al. (2015), conversational agents Cowie et al. (2001), and mental health counseling Ringeval et al. (2018), etc.

The MC-EIU task is challenging and involves two key problems: 1) modeling multimodal contextual information, and 2) modeling the interaction between emotion and intent. To support such research, extensive work has been done by previous researchers in both dataset and methodology aspects. 1) In terms of dataset, multiple datasets have been proposed to annotate emotion or intent labels for human dialogue video clips. However, there is a lack of available datasets in terms of annotation, modality, language diversity, and accessibility. For example, the IEMOCAP Busso et al. (2008), MEISD Firdaus et al. (2020), MELD Poria et al. (2019), and M3ED Zhao et al. (2022) datasets annotate multimodal dialogue data with 5, 8, and 7 emotion categories respectively. However, they have limited annotation diversity as they only provide emotion labels and lack intent labels. The OSED Welivita et al. (2020) is an open-source dataset that includes 32 emotions and 9 intents. However, it is limited in terms of modality and language diversity, as it only consists of English textual data. While EmoInt-MD Singh et al. (2022) aligns well with the MC-EIU task, it only provides English data, which lacks language diversity. Additionally, the provided open-source address of EmoInt-MD for accessing the data is not available. 2) In terms of methodology, some works utilize state-of-the-art network architectures to build multi-task frameworks for joint recognition of emotion and intent. Specifically, the EmoInt-Trans Singh et al. (2022) just utilizes the embedding of neighboring utterances to boost the representation for the current utterance, and then adopts the Transformer with two projection heads to predict the emotion and intent labels. This approach neglects the complex dependency relationships within the multimodal dialog history and the deep interaction between emotion and intent information, which have been proven to be crucial in multimodal dialogue understanding Deng et al. (2023).

In this work, we propose the MC-EIU dataset that simultaneously fulfills four attributes, namely annotation, modality, language diversity, and accessibility, to support the research on the MC-EIU task. This dataset consists of 4,970 conversational video clips from 3 English and 4 Chinese TV series, offering dialog scenarios closely related to the real world. It comprises annotations for 7 emotions and 9 intents, encompassing textual, acoustic, and visual modalities, in a total of 45,009 English utterances and 11,003 Mandarin utterances. Importantly, it is fully open-source, allowing for accessibility. To the best of our knowledge, this dataset is the first comprehensive and rich emotion and intent joint understanding dataset for multimodal conversation, which can promote research of affective computing for the English and Chinese languages.

Additionally, we also develop an Emotion and Intent Interaction (EI2) framework as a reference system by modeling the crucial deep correlation between emotion and intent in the multimodal conversation while considering the multimodal dialog context. Specifically, We design a multimodal history encoder to model the conversation history adequately. Then, we thoroughly integrate the conversation history features with the emotion and intent features of the current utterance. Finally, we propose an emotion-intent interaction encoder to learn the complex interaction correlations between emotion and intent. In summary, the main contributions of this work are as follows:

  • We construct the first comprehensive and rich multimodal conversational emotion and intent joint understanding dataset, termed MC-EIU, which fulfills four attributes: annotation, modality, language diversity, and accessibility. It can promote research on affective computing and human-computer interaction.

  • We further propose the accompanying system, that is the Emotion and Intent Interaction (EI2) framework, for the MC-EIU task. It involves modeling multimodal conversational history and deep emotion-intent interaction.

  • The comparative experimental results demonstrate that our EI2 surpasses all advanced baselines on our benchmarking dataset. Ablation and case studies further validate our EI2.

2 Related Work

2.1 Related Dataset

Table 1 provides an overview and summary of some of the most important datasets related to our work. To highlight the significance of our MC-EIU dataset, we conduct a detailed comparison of these datasets based on annotation, modality, language diversity, and accessibility.

Annotation Diversity: The EmpatheticDialogues Rashkin et al. (2019), EmotionLines Chen et al. (2018), DailyDialog Li et al. (2017), EmoV-DB Adigwe et al. (2018), IEMOCAP, MELD, MEISD, and M3ED are widely used for emotion recognition, but these datasets primarily focus on emotion labels and lack annotation diversity as they do not include intent labels. Similarly, the Behance Intent Discovery dataset Maharana et al. (2022) and MIntRec Zhang et al. (2022a) are limited in annotation diversity because they are only used for intent recognition and do not provide emotion labels. In comparison, the Twitter Customer SupportHerzig et al. (2016), ED Welivita and Pu (2020), OSED, and EmoInt-MD contain the advantage of including both emotion and intent labels.

Modality Diversity: The EmpatheticDialogues, Emotionlines, DailyDialog, Twitter Customer Support Herzig et al. (2016), ED Welivita and Pu (2020), and OSED offer a vast collection of conversational textual data. The EmoV-DB is built with speech data in five emotion categories. However, these datasets only provide information on a single modality and do not include multiple modalities simultaneously. Consequently, the absence of support for multiple modalities limits the modality diversity of these datasets. Different from them, the IEMOCAP, MELD, MEISD, M3ED, Behance Intent Discovery dataset Maharana et al. (2022), MIntRec Zhang et al. (2022a), and EmoInt-MD offer a broader range of modalities, including textual, acoustic, and visual data.

Language Diversity: Most datasets in Table 1 mainly comprise English data, while the M3ED dataset solely consists of Mandarin data. Consequently, these datasets exhibit limitations in terms of language diversity. Only EmoV-DB stands out as it includes both English and French data.

Accessibility: All the datasets listed in Table 1 are accessible except for MEISD and Twitter. It is worth mentioning that the dataset from the provided open-source address in EmoInt-MD is currently unavailable, which poses a challenge in accessing the data.

In a nutshell, there is a lack of available data on the aforementioned four aspects of the MC-EIU task. Our dataset addresses these gaps and provides data that fulfill all four aspects.

Datasets Emos Ints Modality Language
Dias
(k)
Utts
(k)
AD MD LD AC
EmpatheticDialogues Rashkin et al. (2019) - T English 24.8 107.2 ×\times× ×\times× ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
Emotionlines Chen et al. (2018) - T English 1.0 14.5 ×\times× ×\times× ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
DailyDialog Li et al. (2017) - T English 12.2 103.6 ×\times× ×\times× ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
EmoV-DB Adigwe et al. (2018) - A
English + French
- 7.6 ×\times× ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
IEMOCAP Busso et al. (2008) - T+A+V English 0.2 7.4 ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
MELD Poria et al. (2019) - T+A+V English 1.4 13.7 ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
MEISD Firdaus et al. (2020) - T+A+V English 1.0 20.0 ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× ×\times×
M3ED Zhao et al. (2022) - T+A+V Mandarin 1.0 24.4 ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
Behance Intent Discovery dataset Maharana et al. (2022) - T+A+V English - 20.0 ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
MIntRec Zhang et al. (2022a) - T+A+V English - 2.2 ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
Twitter Customer Support Herzig et al. (2016) T English 2.4 14.1 \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× ×\times× ×\times×
ED Welivita and Pu (2020) T English 24.8 107.2 \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
OSED Welivita et al. (2020) T English 1000.0 3488.3 \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× ×\times× \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
EmoInt-MD Singh et al. (2022) T+A+V English 32.4 724.8 \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ ×\times× ×\times×
MC-EIU(Ours) T+A+V
English + Mandarin
5.0 56.0 \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝ \textcolor{white}{\textbf{\footnotesize$\emptyset$}}⃝
Table 1: Comparison with existing benchmark datasets. ‘T’, ‘V’, and ‘A’ refer to textual, visual, and acoustic information. AD, MD, LD, AC denote Annotation Diversity, Modality Diversity, Language Diversity, and Accessibility, respectively

2.2 Related Method

Multi-task Prediction: Multi-task learning aims to leverage the valuable information in multiple related tasks to enhance generalization performance across all tasks Yang and Hospedales (2017), which has a wide range of applications in fields such as Emotion Recognition Hazarika et al. (2020), Object Detection Li et al. (2023), Conversational Speech Synthesis Liu et al. (2023), etc. Multi-task learning methods can be broadly classified into Hard Parameter Sharing (HPS) and Soft Parameter Sharing (SPS) Zhang et al. (2022b); Pahari and Shimada (2022). HPS serves as the foundational deep neural network that is shared across different tasks, while each task utilizes its own distinct top layer Zhang et al. (2022b); Hazarika et al. (2020). SPS allows the model to share some parameters between different tasks while preserving task-specific parameters Pahari and Shimada (2022).

It should be noted that compared to HPS, the advantage of SPS lies in the ability of the network to utilize shared features among different tasks by sharing bottom-layer parameters. It allows for better utilization of interactive information between tasks Pahari and Shimada (2022).

Emotion-Intent Interaction: Previous works confirm that there is a strong interaction between emotion and intent Welivita and Pu (2020); Welivita et al. (2020); Singh et al. (2022); Deng et al. (2023). For instance, Singh et al. (2022) emphasizes that the emotions of the speaker can be influenced by particular intents in dialogues. To explore the interaction between emotion and intention, they built the EmoInt-Trans model that follows the HPS framework. It uses a MISA network Hazarika et al. (2020) as the backbone, where the emotion and intent prediction tasks share all the parameters of this backbone. However, the complex relationship between emotion and intent poses a challenge for HPS, as it may struggle to accommodate the distinct characteristics of each task. Note that the biggest difference between our EI2 model and EmoInt-Trans is that we employ an SPS mode to learn the deep-level interactive information for emotion and intent.

3 MC-EIU Dataset Construction

3.1 Data Collection and Pre-processing

Statistics English Mandarin
Train Valid Test Train Valid Test
# Conversations 2,807 400 806 667 95 195
# Utterances 31,451 4,509 9,049 7,643 1,148 2,212
# Duration (hours) 28.51 4.02 8.22 8.51 1.36 2.42
Avg. UL 12.68 12.49 12.76 19.11 19.91 18.14
Avg. # of DU (seconds) 3.26 3.21 3.27 4.01 4.26 3.94
Avg. # of UC 11.20 11.27 11.23 11.46 12.08 11.34
Avg. # of EC 2.58 2.57 2.60 2.41 2.54 2.42
Avg. # of IC 3.29 3.86 3.87 3.18 3.24 3.10
Table 2: Statistic of our MC-EIU. UL refers to the utterance length, DU denotes the duration per utterance, UC is the utterances per conversation, EC means the emotions per conversation, and IC means the intents per conversation.

To simulate the emotional conversation scenarios in real-world situations Zhao et al. (2022); Singh et al. (2022), we select emotional video clips from 3 English (716 episodes) TV series and 4 Chinese (119 episodes) TV series across genres such as family, romance, crime, etc.111More details are shown in the Appendix. All videos are accompanied by corresponding subtitle files. Afterward, we preprocess the data to obtain the desired format and filter out low-quality data that does not meet the criteria. Specifically, 1) We first design Regular Expression scripts to extract text transcription and the timestamps from the subtitle; 2) Then, we leverage VideoFileClip222https://moviepy-tburrows13.readthedocs.io/en/improve-docs/ref/VideoClip/VideoFileClip.html to partition the videos into multiple clips based on the timestamps; 3) At last, we engage crowd workers to pick out high-quality conversational segments from the same conversational scene. Note that we provide the following specific guidelines to the crowd workers based on relevant work Zhao et al. (2022): a) Conversational scenes should exclusively involve interaction between two speakers; b) The selected conversational scenes should be free from noise or special effects sounds to ensure video quality is not affected; c) Each conversational segment should encompass at least two rounds of interaction between the speakers to provide abundant context information; d) The text content should be accurate and align with the video clips.

3.2 Data Annotation

Annotation Scheme: The annotation scheme consists of emotion, intent, and speaker annotation. For emotion annotation, we choose Ekman’s six basic emotions (happy, surprise, sad, disgust, anger, and fear), along with the neutral label, which is a 7-emotion annotation scheme widely used in previous works Zhao et al. (2022); Busso et al. (2008); Poria et al. (2019). For intent annotation, we follow Welivita and Pu (2020); Welivita et al. (2020) and annotate each utterance based on nine intents: questioning, agreeing, acknowledging, sympathizing, encouraging, consoling, suggesting, wishing, and neutral. For speaker annotation, we follow the scheme described in Zhao et al. (2022) and assign the labels “0” and “1” to annotate the speaker for each dialog. The above annotation scheme ensures consistency with real conversation scenarios Zhao et al. (2022); Poria et al. (2019).

We recruit postgraduate students from the Foreign Languages College as annotators for our project333Although these annotators’ native language is Chinese, they have passed the Test for English Majors Grade Eight (TEM-8), indicating their proficiency in English language skills, including listening, speaking, reading, and writing, comparable to native English speakers.. Prior to commencing the annotation process, all annotators undergo a training and assessment phase. Only those who obtain a passing score are selected to proceed to the data annotation stage. Annotators who do not meet the passing criteria are required to undergo additional training and reassessment until they successfully pass the assessment.

Annotation Process: We recruited 21 annotators and divided them into seven groups. Each group consists of three volunteers, and non-duplicate data is assigned to each group for annotation. This ensures that each conversation data is annotated by three volunteers. Each annotator can access the current utterance and history in the multimodal conversation, including text, audio, and video clips during the annotation process. The annotators are asked to select the appropriate emotion and intent labels from the annotation scheme after watching the video clip. Simultaneously, annotators are required to assign speaker labels to each utterance in the dialog based on the speaking order. Suppose annotators encounter difficulty in assigning a specific category to a video clip. In that case, they categorize the utterance as other, and the corresponding dialog containing this utterance will not be included in our dataset.

3.3 Data Annotation Finalization

Refer to caption
(a) English
Refer to caption
(b) Mandarin
Figure 1: Visualization of the correlation between emotions and intents in the MC-EIU dataset. Each circle in the graph represents the sample count for a specific ‘emotion-intent’ pair.

We employ a majority voting strategy on all utterance annotations to derive the final emotion and intent labels. If at least two annotators provide the same annotation for one utterance, it is considered the final label. If all three annotators have different annotations, an additional emotion expert is consulted to confirm the final label.

The statistics of the MC-EIU dataset are presented in Table 2. It comprises a total of 56,012 utterances, including 4,013 conversations in English and 957 conversations in Mandarin. The total duration of the MC-EIU set is 53.06 hours. We partition the dataset into the train, valid, and test sets with a ratio of 7:1:2 as stated in Zhao et al. (2022); Poria et al. (2019). To assess the reliability of our data annotation, we compute Fleiss’s Kappa (κ𝜅\kappaitalic_κ) Fleiss et al. (2013) for emotion and intent annotation independently, as presented in Table 3. Note that κ𝜅\kappaitalic_κ of our dataset is higher than or comparable to that of previous studies, which serves as strong evidence for the reliability and accuracy of our annotation.

Dataset Emotion Intent
IEMOCAP Busso et al. (2008) 0.48 -
MELD Poria et al. (2019) 0.43 -
M3ED Zhao et al. (2022) 0.59 -
OSED Welivita et al. (2020) 0.46 0.46
EmoInt-MD Singh et al. (2022) 0.69 0.57
MC-EIU (English) 0.57 0.59
MC-EIU (Mandarin) 0.54 0.57
Table 3: Comparison of Fleiss’s Kappa coefficients between other datasets and MC-EIU.

We further explore the correlation between emotion and intent. We created two 7x9 two-dimensional matrices, with each element representing the number of samples in the dataset for each “emotion-intent” pair. Using the sample count as the radius, we visualize circles at the corresponding matrix positions. The resulting visualization is depicted in Figure 1. Larger circles indicate more samples and higher correlation. It can be observed that emotions and intents are not strictly one-to-one correspondences, and different intents vary in their influence on specific emotions, and vice versa. Take Figure.1 (a) as an example, compared to “Hap-Sym", “Hap-Agr" occurs at a higher frequency, indicating that “Agreeing" is more likely to drive the expression of “Happy". Similarly, compared to “Dis-Que", “Sad-Que" has a higher occurrence rate, suggesting that “Questioning" is more closely associated with “Sad" compared to “Disgust". Additionally, we observe that the correlation between emotion and intent in the English dataset is more intricate compared to the Mandarin dataset. For example, the emotion "Sur" is associated with all intent categories in the English dataset, whereas in the Mandarin dataset, it is only linked to 6 intent categories ("Que", "Agr", "Con", "Sug", "Wis", and "Neu"). Such a complex relationship introduces a significant challenge to the MC-EIU task. We make more details of data annotation finalization in the Appendix.

Refer to caption
Figure 2: Overview of the Emotion-Intent Interaction (EI2) Network. The modules with fire symbols indicate the need for pre-training.

4 Emotion-Intent Interaction (EI2) Network

4.1 Overall Architecture

To model the multimodal dialog history and the deep-level interaction between emotion and intent, we propose an EI2 network as shown in Figure 2. The network consists of the following components: 1) Emotion & Intent Encoders aim to generate the multimodal emotion and intent representations for current utterance; 2) the Multimodal History Encoder is responsible for capturing the multimodal contextual semantic information from the multimodal history; 3) the Emotion-Intent Interaction Encoder is proposed to learn the deep interaction between emotions and intents in a conversation; 4) Emotion & Intent Classifiers aim to make predictions based on the emotion-intent interactive information.

Emotion & Intent Encoders: In contrast to previous work Singh et al. (2022) that extracts general semantic representations from utterances, we add separate feature encoders for emotion and intent understanding to extract more explicit emotion and intent features.

The emotion and intent encoders share a similar structure, which includes a Visual Encoder, a Textual Encoder, an Acoustic Encoder, and a Transformer fusion network. Assume that unasuperscriptsubscript𝑢𝑛𝑎u_{n}^{a}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, untsuperscriptsubscript𝑢𝑛𝑡u_{n}^{t}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT, unvsuperscriptsubscript𝑢𝑛𝑣u_{n}^{v}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT represent the acoustic, textual, and visual features of the n𝑛nitalic_n-th utterance unsubscript𝑢𝑛u_{n}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the multimodal emotion representation fesubscriptsuperscript𝑓𝑒f^{*}_{e}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and intent representation fisubscriptsuperscript𝑓𝑖f^{*}_{i}italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be expressed as:

fs=Fs(Concat(Fsv(unv),Fsa(una),Fst(unt))),subscriptsuperscript𝑓𝑠subscriptsuperscript𝐹𝑠𝐶𝑜𝑛𝑐𝑎𝑡subscriptsuperscript𝐹𝑣𝑠subscriptsuperscript𝑢𝑣𝑛subscriptsuperscript𝐹𝑎𝑠subscriptsuperscript𝑢𝑎𝑛subscriptsuperscript𝐹𝑡𝑠subscriptsuperscript𝑢𝑡𝑛f^{*}_{s}=F^{*}_{s}(Concat(F^{v}_{s}(u^{v}_{n}),F^{a}_{s}(u^{a}_{n}),F^{t}_{s}% (u^{t}_{n}))),italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_C italic_o italic_n italic_c italic_a italic_t ( italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_F start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_u start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) ) , (1)

where s{e,i}𝑠𝑒𝑖s\in\{e,i\}italic_s ∈ { italic_e , italic_i }, Fsasubscriptsuperscript𝐹𝑎𝑠F^{a}_{s}italic_F start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the Acoustic Encoder based on LSTM Sak et al. (2014) and max-pooling, Fsvsubscriptsuperscript𝐹𝑣𝑠F^{v}_{s}italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the Visual Encoder that has the same structure as Fsasubscriptsuperscript𝐹𝑎𝑠F^{a}_{s}italic_F start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, Fstsubscriptsuperscript𝐹𝑡𝑠F^{t}_{s}italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the TextCNN-based Text Encoder Kim (2014) adopted, and the Fssubscriptsuperscript𝐹𝑠F^{*}_{s}italic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT means the Transformer fusion network.

It is worth noting that the modules with fire symbols in Figure 2 indicate the need for pre-training our intent and emotion encoders. Further details will be provided in Section 4.2.

Multimodal History Encoder: Different from EmoInt-Trans Singh et al. (2022), which solely models the context information of adjacent utterances, our Multimodal History Encoder takes into account a broader range of historical information. For the current utterance unsubscript𝑢𝑛u_{n}italic_u start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the multimodal conversational history information fhsubscript𝑓f_{h}italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT can be represented as:

fhm=Fhm({u1m,u2m,,un1m})subscriptsuperscript𝑓𝑚superscriptsubscript𝐹𝑚subscriptsuperscript𝑢𝑚1subscriptsuperscript𝑢𝑚2subscriptsuperscript𝑢𝑚𝑛1f^{m}_{h}=F_{h}^{m}(\{u^{m}_{1},u^{m}_{2},...,u^{m}_{n-1}\})italic_f start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( { italic_u start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_u start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_u start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT } ) (2)
fh=Concat(fhv,fha,fht)subscript𝑓𝐶𝑜𝑛𝑐𝑎𝑡subscriptsuperscript𝑓𝑣subscriptsuperscript𝑓𝑎subscriptsuperscript𝑓𝑡f_{h}=Concat(f^{v}_{h},f^{a}_{h},f^{t}_{h})italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT = italic_C italic_o italic_n italic_c italic_a italic_t ( italic_f start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) (3)

where m{v,a,t}𝑚𝑣𝑎𝑡m\in\{v,a,t\}italic_m ∈ { italic_v , italic_a , italic_t }, Fhmsuperscriptsubscript𝐹𝑚F_{h}^{m}italic_F start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT denotes the GRU-based History Encoder Lee et al. (2023). Subsequently, fhsubscript𝑓f_{h}italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT is independently fused with fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to obtain the robust emotion feature fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and intent feature fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT: fs=fs+fhsubscript𝑓𝑠subscriptsuperscript𝑓𝑠subscript𝑓f_{s}=f^{*}_{s}+f_{h}italic_f start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT.

Emotion-Intent Interaction Encoder: The approach of simply sharing the hidden state is not sufficient to achieve explicit information transfer between two tasks. Therefore, we propose the Emotion-Intent Interaction Encoder to learn the deep interactive information between them by considering their complex correlations. As shown in Figure 2, the Emotion-Intent Interaction Encoder contains two branches for emotion or intent prediction, where each branch consists of Binary Correlation Attention, Triple Interaction Attention, and Gate Regulator.

Binary Correlation Attention first employs cross-attention to learn the mutual influence between emotions and intentions, which can also be referred to as the binary correlation between the two tasks. Specifically, we first adopt linear projection to map the fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to generate the corresponding Q𝑄Qitalic_Q, K𝐾Kitalic_K, and V𝑉Vitalic_V. Then, the attention mechanism is used to extract the correlation between fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

fγβ=Attention(fγ,fβ,fβ),subscript𝑓𝛾𝛽𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛subscriptsuperscript𝑓𝛾subscriptsuperscript𝑓𝛽subscriptsuperscript𝑓𝛽f_{\gamma-\beta}=Attention(f^{*}_{\gamma},f^{*}_{\beta},f^{*}_{\beta}),italic_f start_POSTSUBSCRIPT italic_γ - italic_β end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_β end_POSTSUBSCRIPT ) , (4)

where γ,β{e,i}𝛾𝛽𝑒𝑖\gamma,\beta\in\{e,i\}italic_γ , italic_β ∈ { italic_e , italic_i }, if γ𝛾\gammaitalic_γ denotes e𝑒eitalic_e, then, β𝛽\betaitalic_β denotes i𝑖iitalic_i, and vice versa.

Triple Interaction Attention further explores the deep interactive information between the two tasks by integrating the binary correlation and task-specific information from each branch to obtain a more comprehensive and in-depth task interaction feature. Inspired by Jiang et al. (2023), we propose a triple interaction attention mechanism to compute cascaded interaction feature representations. It takes the outputs of binary correlation attention, along with the fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, as inputs; the deep interactive information fγβγsubscript𝑓𝛾𝛽𝛾f_{\gamma-\beta-\gamma}italic_f start_POSTSUBSCRIPT italic_γ - italic_β - italic_γ end_POSTSUBSCRIPT of two tasks can be obtained by:

fγβγ=Attention(fγ,fγβ,fγβ)subscript𝑓𝛾𝛽𝛾𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛subscriptsuperscript𝑓𝛾subscriptsuperscript𝑓𝛾𝛽subscriptsuperscript𝑓𝛾𝛽f_{\gamma-\beta-\gamma}=Attention(f^{*}_{\gamma},f^{*}_{\gamma-\beta},f^{*}_{% \gamma-\beta})italic_f start_POSTSUBSCRIPT italic_γ - italic_β - italic_γ end_POSTSUBSCRIPT = italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ - italic_β end_POSTSUBSCRIPT , italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ - italic_β end_POSTSUBSCRIPT ) (5)

Gate Regulator: It utilizes a gating mechanism to automatically learn the weights of the binary correlation features and ternary interaction features, to model the potential impact of the correlation between different emotion-intent pairs in Figure 1 on the interaction features. This allows us to adjust the contribution of the binary correlation to the final recognition of emotions or intents. Specifically, the gate regulator first adds fγβγsubscript𝑓𝛾𝛽𝛾f_{\gamma-\beta-\gamma}italic_f start_POSTSUBSCRIPT italic_γ - italic_β - italic_γ end_POSTSUBSCRIPT and fγβsubscript𝑓𝛾𝛽f_{\gamma-\beta}italic_f start_POSTSUBSCRIPT italic_γ - italic_β end_POSTSUBSCRIPT together. Afterward, the Sigmoid function is applied to obtain a control gate value between them. Lastly, fγβsubscript𝑓𝛾𝛽f_{\gamma-\beta}italic_f start_POSTSUBSCRIPT italic_γ - italic_β end_POSTSUBSCRIPT is multiplied by the control gate value to effectively adjust the weight of the correlation information:

gγ=fγβγsigmoid(fγβγ+fγβ),subscriptsuperscript𝑔𝛾subscript𝑓𝛾𝛽𝛾𝑠𝑖𝑔𝑚𝑜𝑖𝑑subscript𝑓𝛾𝛽𝛾subscript𝑓𝛾𝛽g^{*}_{\gamma}=f_{\gamma-\beta-\gamma}*sigmoid(f_{\gamma-\beta-\gamma}+f_{% \gamma-\beta}),italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_γ - italic_β - italic_γ end_POSTSUBSCRIPT ∗ italic_s italic_i italic_g italic_m italic_o italic_i italic_d ( italic_f start_POSTSUBSCRIPT italic_γ - italic_β - italic_γ end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_γ - italic_β end_POSTSUBSCRIPT ) , (6)

where gγsubscriptsuperscript𝑔𝛾g^{*}_{\gamma}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT represents the final output of Emotion-Intent Interaction Encoder. With the collaboration of these three components, our EI2 model captures deep-level interaction between emotion and intent.

Emotion & Intent Classifiers: To preserve the specific information of the emotion and intent representations while incorporating the deep interactive information, we perform the residual connection to combine the gsuperscript𝑔g^{*}italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with the emotion feature fesubscript𝑓𝑒f_{e}italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and the intent feature fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT separately before making the final prediction. At last, the Emotion and Intent Classifiers take the results of the residual connection to predict the final emotion and intent categories, respectively:

Pe=CLSe(ge=ge+fe);Pi=CLSi(gi=gi+fi).formulae-sequencesubscript𝑃𝑒𝐶𝐿subscript𝑆𝑒subscript𝑔𝑒subscriptsuperscript𝑔𝑒subscript𝑓𝑒subscript𝑃𝑖𝐶𝐿subscript𝑆𝑖subscript𝑔𝑖subscriptsuperscript𝑔𝑖subscript𝑓𝑖P_{e}=CLS_{e}(g_{e}=g^{*}_{e}+f_{e});P_{i}=CLS_{i}(g_{i}=g^{*}_{i}+f_{i}).italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_C italic_L italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ; italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_C italic_L italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (7)

4.2 Training Strategy

We first pre-train the emotion and intent encoders to ensure the effective extraction of emotion and intent information. given the presence of category imbalance in the dataset (shown in the Appendix), we employ the Focal Loss (FL) Lin et al. (2017) as the loss function presubscript𝑝𝑟𝑒\mathcal{L}_{pre}caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT in the pre-training phase to constrain the prediction of the emotion and intent to be close to the Ground Truth P^esubscript^𝑃𝑒\hat{P}_{e}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and P^isubscript^𝑃𝑖\hat{P}_{i}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of emotion and intent and to improve the model’s ability to focus on the categories of a small sample:

Pe=CLSe(fe);Pi=CLSi(fi)formulae-sequencesubscriptsuperscript𝑃𝑒𝐶𝐿subscriptsuperscript𝑆𝑒subscriptsuperscript𝑓𝑒subscriptsuperscript𝑃𝑖𝐶𝐿subscriptsuperscript𝑆𝑖subscriptsuperscript𝑓𝑖P^{*}_{e}=CLS^{*}_{e}(f^{*}_{e});\quad P^{*}_{i}=CLS^{*}_{i}(f^{*}_{i})italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = italic_C italic_L italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ; italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_C italic_L italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (8)
pre=FL(P^e,Pe)+FL(P^i,Pi),subscript𝑝𝑟𝑒𝐹𝐿subscript^𝑃𝑒subscriptsuperscript𝑃𝑒𝐹𝐿subscript^𝑃𝑖subscriptsuperscript𝑃𝑖\mathcal{L}_{pre}=FL(\hat{P}_{e},P^{*}_{e})+FL(\hat{P}_{i},P^{*}_{i}),caligraphic_L start_POSTSUBSCRIPT italic_p italic_r italic_e end_POSTSUBSCRIPT = italic_F italic_L ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) + italic_F italic_L ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (9)

where CLSe𝐶𝐿subscript𝑆𝑒CLS_{e}italic_C italic_L italic_S start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT and CLSi𝐶𝐿subscript𝑆𝑖CLS_{i}italic_C italic_L italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the emotion and intent classifiers during the pre-training stage.

During the training phase of EI2, we initialize the emotion and intent encoders with pre-trained weights. These encoders are then further updated during the EI2 training. At last, we adopt a joint training approach and also utilize FL loss as the final loss function totalsubscript𝑡𝑜𝑡𝑎𝑙\mathcal{L}_{total}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT for the MC-EIU task.

total=FL(P^e,Pe)+FL(P^e,Pe)subscript𝑡𝑜𝑡𝑎𝑙𝐹𝐿subscript^𝑃𝑒subscript𝑃𝑒𝐹𝐿subscript^𝑃𝑒subscript𝑃𝑒\mathcal{L}_{total}=FL(\hat{P}_{e},P_{e})+FL(\hat{P}_{e},P_{e})\vspace{-2mm}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = italic_F italic_L ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) + italic_F italic_L ( over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) (10)

5 Experiment and Analysis

5.1 Baseline

Systems English Mandarin
Emo Int Emo Int
bc-LSTM 40.11 42.81 44.74 54.76
MMIN 40.94 43.98 46.78 56.95
MISA 41.46 37.64 48.51 56.26
COGMEN 41.70 42.23 49.01 55.80
EmoInt-Trans 40.40 44.46 50.47 58.45
EI2 (Ours) 42.09 45.53 55.08 61.63
ΔsotasubscriptΔ𝑠𝑜𝑡𝑎\Delta_{sota}roman_Δ start_POSTSUBSCRIPT italic_s italic_o italic_t italic_a end_POSTSUBSCRIPT 0.39 1.07 4.61 3.18
w/o History 41.50 45.31 54.42 60.76
w/o Interaction 41.44 45.25 53.77 61.03
w/o Gating 41.43 44.78 54.77 61.32
w/o FL 41.55 45.37 53.95 61.07
w/o Pre-training 40.48 44.99 53.57 59.62
Table 4: Comparison results of our EI2, along with all baseline systems and ablation systems, in terms of the WAF metric (%). The results in bold indicate the highest performance, while the underlined results represent the second highest performance. The row with ΔsotasubscriptΔ𝑠𝑜𝑡𝑎\Delta_{sota}roman_Δ start_POSTSUBSCRIPT italic_s italic_o italic_t italic_a end_POSTSUBSCRIPT means the improvement of EI2 compared to state-of-the-art systems.

To validate our MC-EIU datasets and the proposed EI2 network, we develop four MC-EIU systems based on state-of-the-art models: bc-LSTM Poria et al. (2017b) is widely employed in conversational sentiment recognition tasks. It adopts the bi-directional LSTM and multi-head attention mechanism, enabling the model to capture both contextual information and abundant semantic information within utterances. We add additional emotion and intent classifiers to make bc-LSTM support multi-task prediction. MMIN Zhao et al. (2021) incorporates a cascade residual autocoder network and cyclic consistency constraints to learning the robust multimodal joint representation for multimodal emotion recognition. We expand the MMIN model by incorporating an intent classifier, enabling it to recognize emotion and intent simultaneously. MISA Hazarika et al. (2020) introduces a modality-invariant and -specific feature representation learning mechanism to achieve robust multimodal emotion recognition. We also incorporate an additional intent classifier for MISA to support the MC-EIU task. COGMEN Joshi et al. (2022) is a multimodal sentiment analysis system based on the contextual Graph Neural Network (GNN) for predicting the sentiment of each speaker per utterance in a conversation. We integrated an extra intent classifier following the feature fusion layer to enable concurrent recognition of sentiment and intent information. EmoInt-Trans Singh et al. (2022) is the state-of-the-art baseline for the MC-EIU task. It adopts MISA Hazarika et al. (2020) as the backbone, which utilizes adjacent sentences as contextual information.

We choose the Weighted Average F-score (WAF) Poria et al. (2019) as the evaluation metrics. More implementation details are shown in the Appendix.

5.2 Main Results

We first compare the recognition performance of our EI2 and the baselines in both English and Mandarin. As shown in Table 4, we observe that EI2 achieves the highest WAF in both emotion and intent recognition in both languages, clearly outperforming the baseline systems. For example, the WAF of emotion and intent for EI2 in English are 42.09% and 45.53%, respectively. In Mandarin, the scores are 55.08% for emotion and 61.63% for intent. These results demonstrate the effectiveness of our method. By learning from multimodal dialog history and employing soft parameter sharing to capture the interaction between emotion and intent, our model achieves impressive joint recognition performance for emotion and intent.

5.3 Ablation Study

We design three different ablation experiments, that are module ablation, task ablation, and modality ablation to further validate the components of EI2 system.

Module Ablation: We conducted ablation experiments on the multimodal history encoder, emotion-intent interaction encoder, gate regulator, and the pretraining strategy for emotion&intent encoders: 1) w/o History: we remove the multimodal history encoder of EI2; 2) w/o Interaction: the features outputted by the emotion and intent encoders are directly used for final prediction; 3) w/o Gate: we remove the gate regulator; 4) w/o FL: we replace Focal Loss referenced in Section 4.2 with the Cross-Entropy Loss; 5)w/o Pre-training: we randomly initialize the parameters of the Emotion and Intent Encoders during the EI2 training.

As shown in Table 4, EI2 outperforms all the ablation models on both datasets. This provides compelling evidence for the following assertions: 1) The inclusion of conversation history helps analyze the current speaker’s emotions and intent states; 2) By modeling and integrating deep interactive information, the joint understanding task achieves improved performance, and further demonstrates the significance of modeling the interactive information between intent and emotion for joint understanding; 3) After reducing the gating mechanism, the model’s performance experienced a significant decline, highlighting the crucial role of the gate regulator in adjusting the weights of binary correlation to the final recognition of emotion or intent; 4) The pre-training of emotion and intent encoders resulted in a notable performance improvement, suggesting their capacity to learn substantial semantic information and distinct features for emotions and intents through pre-training.

Task Discription English Mandarin
Emo Int Emo Int
Single Task 41.50 44.91 53.74 61.59
Joint Task 42.09 45.53 55.08 61.63
Table 5: Comparison results of single task and joint task, in terms of the WAF metric (%). (The bold numbers have the same meanings as Table 4.)

Task Ablation: To investigate the effectiveness of the joint recognition task further, we conduct independent emotion and intent recognition tasks on the MC-EIU dataset. The results of these separate tasks are presented in Table 5. We can observe that the joint task outperforms the recognition results of the single task. This further substantiates the strong correlation between intent and emotion and that the joint task can improve prediction accuracy through the deep interaction between emotion and intent.

Modality Ablation: We conduct modality ablation experiments to verify the impact of different modal data, as depicted in Table 6. In the unimodal experiments, we find that the textual modality exhibits the highest overall performance, while the visual modality demonstrates the poorest performance. This aligns with previous research Fu et al. (2022), which concludes that the textual modality contains the most comprehensive semantic information, while the visual modality is comparatively limited in this regard. Additionally, the model’s performance improves as the number of modalities fused increases, with the highest performance achieved when all three modalities are combined. This highlights the effectiveness of integrating complementary information from different modalities.

6 Conclusion

Task Discription Modality English Mandarin
T A V Emo Int Emo Int
Unimodality - - 38.34 44.27 49.15 58.49
- - 37.76 30.66 46.98 45.29
- - 36.66 26.17 45.70 41.57
Bimodality - 40.50 45.35 51.68 59.90
- 39.64 30.46 49.98 45.42
- 40.20 44.42 51.59 60.69
Trimodality 42.09 45.53 55.08 61.63
Table 6: Modality ablation results of the EI2 system, in terms of WAF metric (%). (The bold numbers have the same meanings as in Table 4.)

This work introduced a novel dataset called Multimodal Conversational Emotion and Intent Dataset (MC-EIU), which possesses several key properties: annotation diversity (includes emotion and intent labels), modality diversity (encompasses textual, acoustic, and visual data), language diversity (comprises English and Mandarin data), and accessibility. Furthermore, we proposed an Emotion and Intent Interaction (EI2) Network for the MC-EIU task, which effectively captures the conversational history and the complex interaction between emotion and intent. Extensive experiments conducted on our dataset demonstrate the effectiveness of our EI2, showcasing its superior performance. Our work is dedicated to advancing the field of affective computing.

Given the effective modeling of long-range dependencies in conversations by large language models Touvron et al. (2023), this characteristic plays a crucial role in understanding the intricate interaction between emotions and intentions, both within individuals and between speakers Deng et al. (2023). This suggests a promising avenue for future research. Furthermore, previous research has established that emotion stimulus and inertia are significant factors that influence the emotional state of speakers in dialogs Zhao et al. (2022). However, it remains an open question whether they impact the correlation between emotion and intent.

References

  • Adigwe et al. [2018] Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Ostadabbas, and Thierry Dutoit. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514, 2018.
  • Busso et al. [2008] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359, 2008.
  • Chen et al. [2018] Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun Kuo, Lun-Wei Ku, et al. Emotionlines: An emotion corpus of multi-party conversations. arXiv preprint arXiv:1802.08379, 2018.
  • Cowie et al. [2001] Roddy Cowie, Ellen Douglas-Cowie, Nicolas Tsapatsoulis, George Votsis, Stefanos Kollias, Winfried Fellenz, and John G Taylor. Emotion recognition in human-computer interaction. IEEE Signal processing magazine, 18(1):32–80, 2001.
  • Danieli et al. [2015] Morena Danieli, Giuseppe Riccardi, and Firoj Alam. Emotion unfolding and affective scenes: A case study in spoken conversations. In Proceedings of the International Workshop on Emotion Representations and Modelling for Companion Technologies, pages 5–11, 2015.
  • Deng et al. [2023] Yayue Deng, **long Xue, Feng** Wang, Yingming Gao, and Ya Li. Cmcu-css: Enhancing naturalness via commonsense-based multi-modal context understanding in conversational speech synthesis. In Proceedings of the 31st ACM International Conference on Multimedia, pages 6081–6089, 2023.
  • Fernández et al. [2018] Alberto Fernández, Salvador Garcia, Francisco Herrera, and Nitesh V Chawla. Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61:863–905, 2018.
  • Firdaus et al. [2020] Mauajama Firdaus, Hardik Chauhan, Asif Ekbal, and Pushpak Bhattacharyya. Meisd: A multimodal multi-label emotion, intensity and sentiment dialogue dataset for emotion recognition and sentiment analysis in conversations. In Proceedings of the 28th international conference on computational linguistics, pages 4441–4453, 2020.
  • Fleiss et al. [2013] Joseph L Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions. john wiley & sons, 2013.
  • Fu et al. [2022] Ziwang Fu, Feng Liu, Qing Xu, Jiayin Qi, Xiangling Fu, Aimin Zhou, and Zhibin Li. Nhfnet: a non-homogeneous fusion network for multimodal sentiment analysis. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2022.
  • Hazarika et al. [2020] Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM international conference on multimedia, pages 1122–1131, 2020.
  • He and Garcia [2009] Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • Herzig et al. [2016] Jonathan Herzig, Guy Feigenblat, Michal Shmueli-Scheuer, David Konopnicki, Anat Rafaeli, Daniel Altman, and David Spivak. Classifying emotions in customer support dialogues in social media. In Proceedings of the 17th annual meeting of the special interest group on discourse and dialogue, pages 64–73, 2016.
  • Japkowicz and Stephen [2002] Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002.
  • Jiang et al. [2023] Liting Jiang, Di Wu, Bohui Mao, Yanbing Li, and Wushour Slamu. Empathy intent drives empathy detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6279–6290, 2023.
  • Johnson and Khoshgoftaar [2019] Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):1–54, 2019.
  • Joshi et al. [2022] Abhinav Joshi, Ashwani Bhat, Ayush Jain, Atin Singh, and Ashutosh Modi. Cogmen: Contextualized gnn based multimodal emotion recognition. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4148–4164, 2022.
  • Kim [2014] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, 2014.
  • Kingma and Ba [2015] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In The Eleventh International Conference on Learning Representations, 2015.
  • Lee et al. [2023] Keon Lee, Kyumin Park, and Daeyoung Kim. Dailytalk: Spoken dialogue dataset for conversational text-to-speech. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  • Li et al. [2023] Xin Li, Tao Ma, Yuenan Hou, Botian Shi, Yuchen Yang, Youquan Liu, Xingjiao Wu, Qin Chen, Yikang Li, Yu Qiao, et al. Logonet: Towards accurate 3d object detection with local-to-global cross-modal fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17524–17534, 2023.
  • Li et al. [2017] Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. Dailydialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, 2017.
  • Lian et al. [2023] Zheng Lian, Haiyang Sun, Licai Sun, Kang Chen, Mngyu Xu, Kexin Wang, Ke Xu, Yu He, Ying Li, **ming Zhao, et al. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9610–9614, 2023.
  • Lin et al. [2017] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • Liu et al. [2023] Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, and Haizhou Li. Emotion rendering for conversational speech synthesis with heterogeneous graph-based context modeling. arXiv preprint arXiv:2312.11947, 2023.
  • Maharana et al. [2022] Adyasha Maharana, Quan Hung Tran, Franck Dernoncourt, Seunghyun Yoon, Trung Bui, Walter Chang, and Mohit Bansal. Multimodal intent discovery from livestream videos. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 476–489, 2022.
  • Pahari and Shimada [2022] Niraj Pahari and Kazutaka Shimada. Multi-task learning using bert with soft parameter sharing between layers. In 2022 Joint 12th International Conference on Soft Computing and Intelligent Systems and 23rd International Symposium on Advanced Intelligent Systems (SCIS&ISIS), pages 1–6. IEEE, 2022.
  • Poria et al. [2017a] Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. A review of affective computing: From unimodal analysis to multimodal fusion. Information fusion, 37:98–125, 2017a.
  • Poria et al. [2017b] Soujanya Poria, Erik Cambria, Devamanyu Hazarika, Navonil Majumder, Amir Zadeh, and Louis-Philippe Morency. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 873–883, 2017b.
  • Poria et al. [2019] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 527–536, 2019.
  • Rashkin et al. [2019] Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. Towards empathetic open-domain conversation models: A new benchmark and dataset. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5370–5381, 2019.
  • Ringeval et al. [2018] Fabien Ringeval, Björn Schuller, Michel Valstar, Roddy Cowie, Heysem Kaya, Maximilian Schmitt, Shahin Amiriparian, Nicholas Cummins, Denis Lalanne, Adrien Michaud, et al. Avec 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition. In Proceedings of the 2018 on audio/visual emotion challenge and workshop, pages 3–13, 2018.
  • Sak et al. [2014] Haşim Sak, Andrew Senior, and Françoise Beaufays. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition. arXiv preprint arXiv:1402.1128, 2014.
  • Schneider et al. [2019] Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. wav2vec: Unsupervised pre-training for speech recognition. Interspeech 2019, 2019.
  • Singh et al. [2022] Gopendra Vikram Singh, Mauajama Firdaus, Asif Ekbal, and Pushpak Bhattacharyya. Emoint-trans: A multimodal transformer for identifying emotions and intents in social conversations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:290–300, 2022.
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Welivita and Pu [2020] Anuradha Welivita and Pearl Pu. A taxonomy of empathetic response intents in human social conversations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4886–4899, 2020.
  • Welivita et al. [2020] Anuradha Welivita, Yubo Xie, and Pearl Pu. Fine-grained emotion and intent learning in movie dialogues. arXiv preprint arXiv:2012.13624, 2020.
  • Wu et al. [2020] Neo Wu, Bradley Green, Xue Ben, and Shawn O’Banion. Deep transformer models for time series forecasting: The influenza prevalence case. arXiv preprint arXiv:2001.08317, 2020.
  • Yang and Hospedales [2017] Yongxin Yang and Timothy Hospedales. Trace norm regularised deep multi-task learning. In 5th International Conference on Learning Representations, 2017.
  • Yu et al. [2020] Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 3718–3727, 2020.
  • Zhang et al. [2022a] Hanlei Zhang, Hua Xu, Xin Wang, Qianrui Zhou, Shaojie Zhao, and Jiayan Teng. Mintrec: A new dataset for multimodal intent recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1688–1697, 2022a.
  • Zhang et al. [2022b] Lijun Zhang, Qizheng Yang, Xiao Liu, and Hui Guan. Rethinking hard-parameter sharing in multi-domain learning. In 2022 IEEE International Conference on Multimedia and Expo (ICME), pages 01–06. IEEE, 2022b.
  • Zhao et al. [2021] **ming Zhao, Ruichen Li, and Qin **. Missing modality imagination network for emotion recognition with uncertain missing modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2608–2618, 2021.
  • Zhao et al. [2022] **ming Zhao, Tenggan Zhang, **gwen Hu, Yuchen Liu, Qin **, Xinchao Wang, and Haizhou Li. M3ed: Multi-modal multi-scene multi-label emotional dialogue database. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5699–5710, 2022.
  • Zuo et al. [2023] Haolin Zuo, Rui Liu, **ming Zhao, Guanglai Gao, and Haizhou Li. Exploiting modality-invariant feature for robust multimodal emotion recognition with missing modalities. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.

Appendix A Datasheets for datasets

A.1 Motivation

Previous works confirm a strong interaction between emotion and intent Welivita and Pu [2020], Welivita et al. [2020], Singh et al. [2022], Deng et al. [2023]. For instance, Singh et al. [2022] emphasizes that particular intents can influence the emotions of the speaker in dialogues. However, the existing datasets for emotion and intent joint understanding, such as Twitter Customer Support Maharana et al. [2022], ED Welivita and Pu [2020], and OSED Welivita et al. [2020], only consist of English textual data, which cannot fulfill the requirements for multimodal and multilingual research. Furthermore, while Singh et al.Singh et al. [2022] proposed a multimodal dialogue-based dataset for emotion and intent joint understanding in multimodal conversation, namely EmoInt-MD, it also comprises only English data. Moreover, it is worth noting that the provided open-source link for EmoInt-MD is currently unavailable. As a result, there is no available dataset for emotion and intent joint understanding in the context of multimodal dialogue scenarios. Without the task-specific datasets, the potential of multimodal emotion and intent joint understanding could not be fully explored, nor deepen the understanding of human complicated affections.

We fill the gap by constructing a large-scale benchmark MC-EIU dataset, which features 7 emotion categories, 9 intent categories, 3 modalities, i.e., textual, acoustic, and visual content, and two languages, i.e., English and Mandarin. Furthermore, it is completely open-source for free access. We aim for our work to facilitate a deeper exploration of the emotion and intent joint understanding, thereby advancing the field of affective computing.

A.2 Composition

The MC-EIU dataset comprises 56,012 utterances, including 4,013 conversations in English and 957 conversations in Mandarin, with a total duration of 53.06 hours. Each utterance is annotated with emotion, intent, and speaker labels, and contains textual, visual, and acoustic information. All utterances are recorded in a .CSV file, with each utterance uniquely identified by a Dia_No and an Utt_No. Corresponding video and audio clips for each utterance are stored in .MP4 and .WAV files, respectively. These files follow a specific naming convention: "dia_{Dia_No}_utt_{Utt_No}.mp4" or "dia_{Dia_No}_utt_{Utt_No}.wav". The samples of audio and video files in MC-EIU are shown in Figure 3. We partitioned the MC-EIU dataset into training, validation, and test subsets with proportions of 70%, 10%, and 20%, respectively. In dividing the dataset, we adhered to the following principles: 1) Random assignment of dialogues. 2) Ensuring that all utterances within the same dialogue are assigned to the same subset. 3) Striving to maintain a consistent distribution of labels across each subset. The statistics of the MC-EIU dataset are presented in Table 7.

Refer to caption
(a) The samples of English audio in MC-EIU
Refer to caption
(b) The samples of English video in MC-EIU
Figure 3: The samples of audio and video files in MC-EIU dataset. All files are named according to a consistent format.
Statistics English Mandarin
Train Validation Test Train Validation Test
# Modalities (t, a, v) (t, a, v) (t, a, v) (t, a, v) (t, a, v) (t, a, v)
# Conversations 2,807 400 806 667 95 195
# Utterances 31,451 4,509 9,049 7,643 1,148 2,212
# Duration (hours) 28.51 4.02 8.22 8.51 1.36 2.42
Avg. Words per Utterance 12.68 12.49 12.76 19.11 19.91 18.14
Avg. # of Duration per Utterance (seconds) 3.26 3.21 3.27 4.01 4.26 3.94
Avg. # of Utterances per Conversation 11.20 11.27 11.23 11.46 12.08 11.34
Avg. # of Emotions per Conversation 2.58 2.57 2.60 2.41 2.54 2.42
Avg. # of Intents per Conversation 3.29 3.86 3.87 3.18 3.24 3.10
Table 7: Statistic of our MC-EIU. t, a, and v stand for text, acoustic, and visual, respectively.

A.3 Collection Process and License

Collection Process: In this work, we choose 3 famous English TV series and 4 Mandarin TV series as our domain. Such TV series consist of conversations with utterances in the forms of text, video, and audio, and cover different genres (i.e. family, romance, comedy, lifestyle). The total number of episodes for English TV series is 716, and for Mandarin TV series, it is 119. Table 8 represents the details of TV series. In terms of data sources, our resources are more than 1.5 times larger than the resources of M3ED Zhao et al. [2022] and closely 4 times more than those of MELD Poria et al. [2019]. It covers the majority of emotional dialogue scenarios in real life. To avoid the inclusion of offensive, discriminatory, or otherwise unethical data in our dataset, we exclude TV series belonging to genres such as horror, thriller, war, crime, etc. Furthermore, we promptly removed those clips from the dataset if any unethical clips were found in the selected TV series.

License terms: To avoid copyright disputes, we ensure that our resources are sourced from publicly accessible platforms. The MC-EIU dataset is strictly available for research purposes only. We have designed an appropriate license, specifically the CC BY-NC 4.0, which clearly outlines the proper and responsible usage of the MC-EIU dataset. It will help guide the user of the MC-EIU dataset in making informed decisions about how the MC-EIU dataset can and cannot be used444https://github.com/MC-EIU/MC-EIU.

Language TV series # Sea # Epi Genres
Mandarin "{CJK}UTF8gbsn大江大河" (The Great River) 1 39 Lifestyle
"{CJK}UTF8gbsn父母爱情" (Parental Love) 1 44 Family, Romance
"{CJK}UTF8gbsn曾少年之小时候" (The Childhood of Once a Youth) 1 24 Family, Romance
"{CJK}UTF8gbsn回家的女儿" (The Returning Daughter) 1 12 Family
English Friends 10 235 Romance, Comedy
The Big Bang Theory 10 231 Comedy
Modern Family 11 250 Family, Romance
Table 8: The details of the data collection for the MC-EIU dataset. ‘Sea’ and ‘Epi’ mean the Season and Episode, respectively

A.4 Preprocessing/Cleaning/Labeling

Please see the Section 3.2 in the main paper.

A.5 Uses and Distribution

We state that the MC-EIU dataset is suitable for multimodal emotion and intent joint understanding tasks, including emotion recognition, intent recognition, and emotion and intent joint recognition in multimodal conversations. The copyright of the MC-EIU dataset belongs to the S2 Lab of the College of Computer Science of Inner Mongolia University. To ensure standardized experimentation and evaluation, we currently release only the feature files for all the data, along with a .CSV file containing the text information and annotations. Please refer to Appendix H.4 for details on the feature extraction methodology. Following paper acceptance, the dataset will be made available at https://github.com/MC-EIU/MC-EIU.

A.6 Maintenance

Regarding the dataset update iteration, our laboratory will have special personnel to inspect and maintain the MC-EIU dataset every six months. This includes correcting labeling errors, adding new instances, and deleting outdated instances. Meanwhile, we welcome other research teams to use this dataset.

A.7 Uses

  • Has the dataset been used for any tasks already? If so, please provide a description.

    It is proposed to be used for the multimodal emotion and intent joint understanding task.

  • Is there a repository that links to any or all papers or systems that use the dataset? If so, please provide a link or other access point.

    It is a new dataset. We develop an Emotion and Intent Interaction (EI2) framework as a reference system and release the code at Github.com.

  • What (other) tasks could the dataset be used for?

    The MC-EIU dataset can also be used for various other tasks such as multimodal emotion recognition, multimodal intent recognition, and dialogue understanding.

  • Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? For example, is there anything that a future user might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereoty**, quality of service issues) or other undesirable harms (e.g., financial harms, legal risks) If so, please provide a description. Is there anything a future user could do to mitigate these undesirable harms?

    N/A

  • Are there tasks for which the dataset should not be used? If so, please provide a description.

    N/A

Appendix B Dataset nutrition labels

Name Description
Subtitle The textual content of utterance
Dia_No The dialogue number to which the current utterance belongs
Utt_No The position of utterance within that dialogue
Video_name The TV seies which the utterance is taken
Season The season which the utterance is taken
Episode The episode which the utterance is taken
Begin_timestamp The starting time of utterance in the TV series
End_timestamp The ending time of utterance in the TV series
Emotion Emotion label (happy, surprise, sad, disgust, anger, fear, neutral)
Intent
Intent label (questioning, agreeing, acknowledging, sympathizing,
encouraging, consoling, suggesting, wishing, neutral)
Speaker Speaker label (0, 1)
Table 9: Dataset Nutrition Labels

Table 9 shows the different modules of the MC-EIU dataset nutrition label and their corresponding description.

Appendix C Data Statements for Natural Language Processing

We state the MC-EIU dataset from the following aspects:

  • The full name of the MC-EIU dataset is the Multimodal Conversational Emotion and Intent Joint Understand dataset. This dataset addresses the requirements of annotation diversity, modality diversity, language diversity, and accessibility, making it well-suited for the task of joint emotion and intent recognition. The copyright belongs to the S2 Lab of the College of Computer Science of Inner Mongolia University.

  • The dataset is mainly composed of three parts of files (namely .CSV, .MP4, .WAV files), which record text, video, and audio information respectively. To ensure standardized experimentation and evaluation, we currently release only the feature files for all the data, along with a .CSV file containing the text information and annotations.

  • The dataset is mainly manual crawling, which was then processed using a simple Python script we designed to segment all the videos into clips. Subsequently, a manual screening and annotation process was conducted to identify and label the relevant data. Please refer to Section 3 in the main paper for details.

  • The MC-EIU dataset follows the usage rules and restrictions of the license description dataset, and commercial use, reorganization, and conversion are not allowed. If any organization or individual wants to expand the dataset, permission must be obtained from the copyright owner.

Appendix D Data Accessibility

Appendix E Accountability frameworks

E.1 Data collection and processing specifications

We declare that our data collection is all in publicly accessible links and follow the following rules in data processing.

  • Data collection and processing should comply with relevant regulations and ethical requirements, and use appropriate technical tools and methods to ensure data quality and integrity.

  • Data collection should clarify key information such as data type, source, time, and location.

  • Data processing should establish a clear data cleaning, conversion, and integration process, and carry out data verification and deduplication.

  • Data sampling and sample selection should fully consider the impact of research design and sampling error, and carry out statistical inference and reliability analysis.

E.2 Dataset usage and evaluation mechanisms

Dataset usage and evaluation mechanisms should follow the following principles:

  • The use of data sets should follow relevant laws and regulations, respect data privacy and intellectual property rights, and prevent abuse and discriminatory results.

  • Dataset results should be interpreted and applied within a reasonable margin of error, with a full explanation of their limitations and applicability of inferences.

  • Dataset users should assign corresponding responsibilities and obligations, including data protection, fair use, and social responsibility.

  • The process of using data sets should be regularly audited and evaluated in order to detect and correct problems in a timely manner and improve the value and credibility of datasets.

Appendix F Author Statement

On behalf of all the authors, we hereby state that we will assume full responsibility for any violations of rights or issues related to data licensing.

Appendix G Hosting, Licensing, and Maintenance Plan for MC-EIU Dataset

Our MC-EIU dataset is a groundbreaking multimodal dataset designed for joint understanding of conversational emotions and intents, which we have created and publicly released. To ensure its accessibility and long-term availability, we have developed a comprehensive plan for hosting, licensing, and maintenance.

Hosting: The MC-EIU dataset will be hosted on a reliable and secure server infrastructure, i.e., Github.com and Baidu Drive (refer to Section D). We will ensure fast and uninterrupted access to the dataset for researchers, developers, and interested parties.

Licensing: We have designed a proper license (CC BY-NC 4.0) attached to the MC-EIU dataset to clearly describe how to properly and responsibly use the MC-EIU dataset (refer to Section A.3).

Maintenance: We are committed to the continuous maintenance and improvement of the MC-EIU dataset. Regular updates will be provided to address any identified issues and ensure data quality (refer to Section A.6).

Appendix H Data Annotation

H.1 Data Collection

Please refer to Section A.3.

H.2 Annotation Guidelines and Annotation Website

We presented the relevant details of data annotation in Section 3.2 of the main paper, including the annotation scheme and annotation process. To facilitate the annotation process for the annotators, we provided a unified data annotation platform, as illustrated in Figure 4.

Refer to caption
Figure 4: Layout of the annotation platform.

H.3 Data Format

We select some samples from the English and Mandarin datasets of MC-EIU respectively to demonstrate the data format, as shown in Table 10.

Subtitle Dia_No Utt_No Video_name #Sea #Epi Begin_timestamp End_timestamp Emotion Intent Speaker
Turns out this sweater is made for a woman. 34 0 Friends 10 9 00:24:09,900 00:24:12,530 neutral neutral 0
So why are you still wearing it? 34 1 Friends 10 9 00:24:14,910 00:24:16,910 neutral questioning 1
{CJK}UTF8gbsn怎么又是这个呀? 519 0 Parental Love - 4 00:34:49,034 00:34:50,894 sad questioning 0
{CJK}UTF8gbsn怎么了?这不好啊。 519 1 Parental Love - 4 00:34:51,154 00:34:52,854 neutral questioning 1
Table 10: The dataset format of MC-EIU. Dia_No represents the dialogue number to which the current sentence belongs, while Utt_No indicates the sentence’s position within that dialogue. Video_name, #Sea, and #Epi specify the TV series, season, and episode from which the sentence is taken. Begin_timestamp and End_timestamp are used to pinpoint the exact location of the sentence within the episode.

In the MC-EIU dataset, each utterance is uniquely identified by a Dia_No and an Utt_No. All of the utterances are stored in .CSV file. In addition, the corresponding video and audio clips for each utterance are stored as .MP4 and .WAV files, respectively. These files follow a specific naming format: "dia_{Dia_No}_utt_{Utt_No}.mp4" or "dia_{Dia_No}_utt_{Utt_No}.wav". The samples of audio and video files in MC-EIU are shown in Figure 3.

H.4 Feature Extraction

We utilize several state-of-the-art pre-trained models to extract the raw features from our datasets.

Refer to caption
Figure 5: The structure of the feature set.

Textual Features: To extract word-level textual features in English and Chinese, we employ separate RoBERTa Yu et al. [2020] models 555English: https://huggingface.co/roberta-base; Chinese: https://huggingface.co/hfl/chinese-roberta-wwm-ext. that have been pre-trained on data from each respective language. The embedding size of the textual features for both languages is 768 dimensions.

Acoustic Features: We extract frame-level acoustic features using the Wav2Vec Schneider et al. [2019] model666https://github.com/pytorch/fairseq/tree/master/examples/wav2vec pre-trained on large-scale Chinese and English audio data. The embedding size of the audio features is 512 dimensions.

Visual Feature: We employ the OpenCV tool to extract scene pictures from each video clip, capturing frames at a 10-frame interval. Subsequently, we utilize the Resnet-50 He et al. [2016] model777https://huggingface.co/microsoft/resnet-50 to generate frame-level features for the extracted scene pictures in the video clips. The embedding size of the video features is 342 dimensions.

All the features are named in the format "dia_{Dia_No}_utt_{Utt_No}.npy". To differentiate between different modality features, we create separate folders for each modality, naming them according to the corresponding pre-trained model. All the features are stored within their respective modality folders. For instance, if we extracted text features using the RoBERTa model, the folder corresponding to the text modality would be named ‘RoBERTa’. And all the text features would be stored in the "RoBERTa" folder. The structure of the feature set is illustrated in Figure 5. To ensure standardized experimentation and evaluation, we currently release only the feature files for all the data, along with a .CSV file containing the text information and annotations.

H.5 Category Distribution

Refer to caption
(a) Emotion distribution of Mandarin (left) and English (right).
Refer to caption
(b) Intent distribution of Mandarin (left) and English (right)
Figure 6: The data distribution of the MC-EIU dataset.

We show the distribution of emotion and intent categories for the MC-EIU dataset in Figure 6. Similar to M3ED, MELD, and IEMOCAP, our dataset also exhibits category imbalance. In terms of emotion labels, the top three categories in our dataset are neutral, happy, and anger, while disgust and fear are relatively less represented, similar to the distribution in MELD. Regarding intent labels, the most abundant categories are neutral, questioning, and suggesting, while sympathizing is the least represented, aligning with the OSED dataset. We present a detailed distribution of the data across the train, validation, and test sets in Table 11.

The presence of category imbalance in our dataset can be attributed to the following reasons:

  • Deliberate maintenance of category imbalance: Although researchers tend to use balanced datasets in experiments to explore model performance, category imbalance is a more common and challenging scenario in the real world Japkowicz and Stephen [2002], He and Garcia [2009], Fernández et al. [2018], Johnson and Khoshgoftaar [2019]. For instance, in work environments, individuals may conceal their emotions during interactions. In daily life, a stable and harmonious social environment promotes the expression of positive emotions (such as happiness and joy) while reducing the expression of negative emotions (such as anger and sadness). Therefore, intentionally maintaining this imbalance helps better simulate and address real-world problems, enhancing the applicability and generalization of the models.

  • Bias During Data Collection: As described in Section A.3, we deliberately excluded TV shows such as thrillers and crime dramas to avoid aggressive or unethical content. However, in genres like family dramas, comedies, and soap operas, conflicts and contradictions among characters are less common. Consequently, it becomes challenging to obtain data related to certain negative emotions, such as disgust and fear. Additionally, the categories of intent can be influenced by the emotion categories. For instance, when one party in a conversation displays fear, the other party typically expresses sympathy and encouragement. If there is a relative scarcity of fear-related data, there will also be a corresponding scarcity of data related to sympathy and encouragement. This bias in data collection contributes to the category imbalance observed in the dataset.

Annotation English Mandarin
Train Validation Test Train Validation Test
Emotion happy 8810 1286 2526 1447 189 400
surprise 1242 158 332 89 21 27
sad 1739 251 660 243 31 87
disgust 607 108 129 116 10 28
anger 3146 418 912 1608 242 453
fear 869 114 273 184 20 18
neutral 15038 2174 4217 3956 635 1199
Intent questioning 7133 963 2035 2804 385 844
agreeing 2102 293 603 199 37 54
acknowledging 410 50 124 43 10 15
sympathizing 188 28 49 5 1 1
encouraging 1227 177 388 85 23 22
consoling 2841 383 785 535 78 147
suggesting 3270 516 962 519 82 128
wishing 1881 297 536 236 25 60
neutral 12399 1802 356 3217 507 941
Table 11: The statistic of data distribution across the train, validation, and test sets.

H.6 Ethical Considerations

In data selection, we ensured that our resources came from public platforms to avoid copyright disputes. Additionally, we adopted the CC BY-NC 4.0 license to restrict the dataset’s usage to non-commercial purposes only. Furthermore, to avoid incorporating offensive or unethical content, we imposed limitations on selecting dataset resources, excluding TV shows of genres such as crime and thriller. Please refer to Section A.3 for more details.

In terms of employee compensation, our pricing model is based on the quantity of annotated data, with a payment of 0.1 CNY per annotated utterance. As shown in Table 7, the average duration per utterance of our data is approximately 3.7 seconds. Assuming each utterance requires watching the video twice for accurate annotation, annotating one utterance would take a maximum of 10 seconds. Therefore, an annotator can annotate 360 utterances per hour, equivalent to an hourly wage of 36 CNY. This wage level is significantly higher than that of typical part-time jobs in our city. To ensure accuracy, we use a majority voting method for final annotations, meaning each utterance undergoes three rounds of annotation. Thus, the total annotation cost is calculated as follows: 56012 (total number of utterances) * 3 * 0.1 CNY (cost per utterance) = 16803.6 CNY.

Refer to caption
(a) EmoInt-Trans
Refer to caption
(b) EI2
Refer to caption
(c) EmoInt-Trans
Refer to caption
(d) EI2
Figure 7: Confusion matrices of emotion and intent recognition results for the EmoInt-Trans and EI2 models on the MC-EIU English dataset. Specifically, (a) and (b) represent the confusion matrices for emotion recognition, whereas (c) and (d) correspond to the confusion matrices for intent recognition. Each cell represents the ratio of samples from each true category that are predicted as the corresponding category.

In a nutshell, it is important to note that the dataset being studied in this paper does not involve any ethical concerns.

Appendix I Extra Experiment Results and Analysis

I.1 Implementation Details

Our proposed framework is implemented using PyTorch. The hidden size of Fvsuperscript𝐹𝑣F^{v}italic_F start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT, Fasuperscript𝐹𝑎F^{a}italic_F start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, and Ftsuperscript𝐹𝑡F^{t}italic_F start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is 128. We set the attention head number for the Transformer Network, Cross Attention, and Fusion Attention as 4. We select the Adam optimizer Kingma and Ba [2015] and initialize the learning rate to 0.0002. We dynamically update the learning rate using the Lambda LR Wu et al. [2020] approach. The batch size is 32, and the epoch of both pre-training and training phases is set to 60. To ensure the objectivity of the experimental results, we conducted all experiments three times and reported the average results. All experiments were performed on a single NVIDIA A100 graphics card.

I.2 Category Imbalance Issue

As mentioned in Appendix H.5, our MC-EIU dataset, like many other datasets, suffers from category imbalance issues. Taking the advanced model EmoInt-Trans for the MC-EIU task as an example, we compare our model with its predictions on more detailed categories. In our experiments on the English data, we display the detailed F1 scores for the predictions of all emotion and intent categories in the confusion matrix of Figure 7.

In the upper part of Figure 7, we can find that both models face difficulties in recognizing fear, sad, and disgust, which can be attributed to the insufficiency of samples for these categories. Similar observations can be made in Figure 7(c), (d). These two models encounter difficulties in recognizing the categories of sympathizing and encouraging, as evident from Figure 7(c) and (d). This is further supported by Figure 6, which illustrates that these two categories have the lowest number of samples.

These observations serve as evidence that the category imbalance issue poses a significant challenge for both models. Moving forward, we will actively explore effective solutions to tackle this challenge. Specifically, this includes the following aspects:

(1) Introducing effective constraints: In the design of our framework, we have already incorporated Focal Loss to mitigate the issue of data imbalance (as discussed in Section 4.3 of the main paper). Ablation experiments have shown that the introduction of Focal Loss improves the model’s performance, indicating that the use of specific constraint methods can alleviate category imbalance. However, Focal Loss is a relatively basic method and may not be highly effective when there are significant differences in the number of instances per category. Therefore, we will further explore more suitable constraints to alleviate the category imbalance issue

(2) Data augmentation: As mentioned in Section A.3, we avoided TV series of genres such as thrillers, horror, and crime when selecting the data. However, these genres often contain richer negative emotions and corresponding intents compared to genres like romantic or family dramas. By avoiding these types of TV shows, we inadvertently created a scarcity of data related to negative emotions. To address the category imbalance issue, we will supplement the dataset with new data during the maintenance process. It is important to note that this does not mean we aim to transform the dataset into a perfectly balanced one. We intend to maintain the presence of the category imbalance in the dataset while controlling the disparities between different categories.

I.3 Case Study

In this section, we continue to use the EmoInt-Trans system as the baseline for comparison. To further validate the effectiveness of our EI2 method, we present the case study.

Refer to caption
Figure 8: A conversational sample, that consists of 3 utterances, from the MC-EIU dataset. The red text means the emotion label, while the blue text means the intent label.

Figure 8 presents a conversational sample, that consists of 3 utterances, from the MC-EIU dataset. We adopt the EmoInt-Trans and EI2 models to predict emotion and intent labels, respectively. The Ground Truth labels and the prediction results are shown in Table 12.

We can observe that the predictions by EI2 are consistently correct, whereas the results of EmoInt-Trans are not. Taking utterance (a) as an example, EmoInt-Trans fails to make accurate predictions due to its limited ability to model deep interactions between emotion and intent. As shown in Figure 1 in the main paper, the proportion of “Fea-Wis" is smaller compared to “Neu-Wis", indicating a weaker correlation between fear and wishing. Therefore, in the joint recognition process, the impact of interaction between fear and wishing on the final prediction is relatively low. EmoInt-Trans cannot dynamically adjust the impact of interaction information on the final prediction, leading to incorrect predictions. Different from EmoInt-Trans, our model incorporates the Emotion-Intent Interaction Encoder, which enables deep interactions between emotion and intent to be modeled while controlling the weight of interaction information during final prediction. As a result, our model accurately predicts the emotion and intent labels of this example.

Sample ID Ground Truth Predicted Labels
EmoInt-Trans EI2
Emo Int Emo Int Emo Int
(a) Fear Wishing Neutral Wishing Fear Wishing
(b) Anger Wishing Anger Neutral Anger Wishing
(c) Happy Suggesting Neutral Suggesting Happy Suggesting
Table 12: The prediction performance of EmoInt-Trans and EI2 for the sample shown in Figure 8. The red text indicates the wrong results, and the green text means the right results.

As for the incorrect prediction of utterance (c) by EmoInt-Trans, it is due to its limited ability to model contextual information beyond adjacent sentences. Since the request “please don’t" in utterance (a) is fulfilled in utterance (b) (“Fine, whatever. I get it."), utterance (c) exhibits a happy emotion. However, EmoInt-Trans fails to capture the distant historical information, resulting in the inability to recognize the reason for the emotional transition and leading to incorrect predictions.

Appendix J Impact

Positive Impacts:

  • Innovation: As the first open-source dataset in the emotion and intent joint understanding field, The MC-EIU dataset provides a new resource that can spur innovation and development in emotion and intent recognition. Furthermore, this dataset enables researchers to explore the intricate interactions between emotion and intent, ultimately advancing the development of emotion recognition tasks.

  • Collaboration: The availability of the MC-EIU dataset encourages collaboration among researchers across different institutions and industries, fostering a more unified and rapid advancement in the field.

  • Improved Systems: The dataset can help improve the accuracy and reliability of systems that rely on emotion and intent recognition, such as virtual assistants, customer service bots, and interactive entertainment systems.

Negative Impacts:

  • Category Imbalance: Due to the category imbalance phenomenon in the MC-EIU dataset, it is possible for models adapted to category balance to inaccurately recognize minority categories, leading to misjudgments in model performance. Therefore, we encourage more researchers to pay attention to the category imbalance issue.

Appendix K Limitation

(1) Our proposed dataset does not reach the requirement of "balance". As described in Appendix H.5 and I.2, our dataset exhibits category imbalance. This is a common issue in real-world scenarios and a challenge that researchers need to address. In our future work, we will actively explore methods to tackle the category imbalance issue and encourage future researchers to pay attention to such issues as well.

(2) The framework we established is a basic model. The main focus of this paper is the dataset release, with the baseline model serving as an accompanying system. The baseline system has emphasized the interaction between emotion and intent, which is crucial for multimodal emotion-intent joint recognition, even though the ideas behind the baseline may seem simple. In the future, we will further explore new methods to delve deeper into the complex relationship between emotion and intent.

(3) We do not explore the performance of the large language model on the emotion and intent joint understanding task. Given the effective modeling of long-range dependencies in conversations by large language models (LLMs) Touvron et al. [2023], this characteristic plays a crucial role in understanding the intricate interaction between emotions and intentions, both within individuals and between speakers Deng et al. [2023]. We have already discussed the possibility of using the LLMs for emotion and intent joint understanding in Section 6 of the main paper. While we attempted to fine-tune large-scale multimodal models for this task, we faced challenges during training due to limited computational resources. In the future, we will seek opportunities to collaborate with other laboratories to conduct joint research on using LLMs for emotion and intent recognition. Additionally, we hope this direction becomes a new research hotspot, attracting more researchers to contribute to this field.