Joint Multimodal Transformer for Emotion Recognition in the Wild

Waligora, Paul; Aslam, Haseeb; Zeeshan, Osama; Belharbi, Soufiane; Koerich, Alessandro Lameiras; Pedersoli, Marco; Bacon, Simon; Granger, Eric

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.10488v2 (cs)

[Submitted on 15 Mar 2024 (v1), revised 2 Apr 2024 (this version, v2), latest version 20 Apr 2024 (v3)]

Title:Joint Multimodal Transformer for Emotion Recognition in the Wild

Authors:Paul Waligora, Haseeb Aslam, Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger

View PDF HTML (experimental)

Abstract:Systems for multimodal emotion recognition (MMER) can typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. In this paper, an MMER method is proposed that relies on a joint multimodal transformer for fusion with key-based cross-attention. This framework aims to exploit the diverse and complementary nature of different modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, a joint multimodal transformer fusion architecture integrates the individual modality embeddings, allowing the model to capture inter-modal and intra-modal relationships effectively. Extensive experiments on two challenging expression recognition tasks: (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice), and (2) pain estimation on the Biovid dataset (with face and biosensors), indicate that the proposed method can work effectively with different modalities. Empirical results show that MMER systems with our proposed fusion method allow us to outperform relevant baseline and state-of-the-art methods.

Comments:	10 pages, 4 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2403.10488 [cs.CV]
	(or arXiv:2403.10488v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.10488

Submission history

From: Soufiane Belharbi [view email]
[v1] Fri, 15 Mar 2024 17:23:38 UTC (627 KB)
[v2] Tue, 2 Apr 2024 15:34:04 UTC (5,682 KB)
[v3] Sat, 20 Apr 2024 16:24:44 UTC (788 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Joint Multimodal Transformer for Emotion Recognition in the Wild

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Joint Multimodal Transformer for Emotion Recognition in the Wild

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators