Search | arXiv e-print repository

Joint Multimodal Transformer for Emotion Recognition in the Wild

Authors: Paul Waligora, Haseeb Aslam, Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger

Abstract: Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of dive… ▽ More Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods. △ Less

Submitted 20 April, 2024; v1 submitted 15 March, 2024; originally announced March 2024.

Comments: 10 pages, 4 figures, 6 tables, CVPRw 2024

arXiv:2401.15489 [pdf, other]

Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

Authors: Muhammad Haseeb Aslam, Muhammad Osama Zeeshan, Soufiane Belharbi, Marco Pedersoli, Alessandro Koerich, Simon Bacon, Eric Granger

Abstract: Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments because of their ability to learn complementary and redundant semantic information. However, these models struggle in the wild, mainly because of the unavailability and quality of modalities used for training. In practice, only a subset of the training-time modalitie… ▽ More Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments because of their ability to learn complementary and redundant semantic information. However, these models struggle in the wild, mainly because of the unavailability and quality of modalities used for training. In practice, only a subset of the training-time modalities may be available at test time. Learning with privileged information enables models to exploit data from additional modalities that are only available during training. State-of-the-art knowledge distillation (KD) methods have been proposed to distill information from multiple teacher models (each trained on a modality) to a common student model. These privileged KD methods typically utilize point-to-point matching, yet have no explicit mechanism to capture the structural information in the teacher representation space formed by introducing the privileged modality. Experiments were performed on two challenging problems - pain estimation on the Biovid dataset (ordinal classification) and arousal-valance prediction on the Affwild2 dataset (regression). Results show that our proposed method can outperform state-of-the-art privileged KD methods on these problems. The diversity among modalities and fusion architectures indicates that PKDOT is modality- and model-agnostic. △ Less

Submitted 28 April, 2024; v1 submitted 27 January, 2024; originally announced January 2024.

arXiv:2401.08947 [pdf]

AntiPhishStack: LSTM-based Stacked Generalization Model for Optimized Phishing URL Detection

Authors: Saba Aslam, Hafsa Aslam, Arslan Manzoor, Chen Hui, Abdur Rasool

Abstract: The escalating reliance on revolutionary online web services has introduced heightened security risks, with persistent challenges posed by phishing despite extensive security measures. Traditional phishing systems, reliant on machine learning and manual features, struggle with evolving tactics. Recent advances in deep learning offer promising avenues for tackling novel phishing challenges and mali… ▽ More The escalating reliance on revolutionary online web services has introduced heightened security risks, with persistent challenges posed by phishing despite extensive security measures. Traditional phishing systems, reliant on machine learning and manual features, struggle with evolving tactics. Recent advances in deep learning offer promising avenues for tackling novel phishing challenges and malicious URLs. This paper introduces a two-phase stack generalized model named AntiPhishStack, designed to detect phishing sites. The model leverages the learning of URLs and character-level TF-IDF features symmetrically, enhancing its ability to combat emerging phishing threats. In Phase I, features are trained on a base machine learning classifier, employing K-fold cross-validation for robust mean prediction. Phase II employs a two-layered stacked-based LSTM network with five adaptive optimizers for dynamic compilation, ensuring premier prediction on these features. Additionally, the symmetrical predictions from both phases are optimized and integrated to train a meta-XGBoost classifier, contributing to a final robust prediction. The significance of this work lies in advancing phishing detection with AntiPhishStack, operating without prior phishing-specific feature knowledge. Experimental validation on two benchmark datasets, comprising benign and phishing or malicious URLs, demonstrates the model's exceptional performance, achieving a notable 96.04% accuracy compared to existing studies. This research adds value to the ongoing discourse on symmetry and asymmetry in information security and provides a forward-thinking solution for enhancing network security in the face of evolving cyber threats. △ Less

Submitted 21 January, 2024; v1 submitted 16 January, 2024; originally announced January 2024.

arXiv:2312.05632 [pdf, other]

Subject-Based Domain Adaptation for Facial Expression Recognition

Authors: Muhammad Osama Zeeshan, Muhammad Haseeb Aslam, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger

Abstract: Adapting a deep learning model to a specific target individual is a challenging facial expression recognition (FER) task that may be achieved using unsupervised domain adaptation (UDA) methods. Although several UDA methods have been proposed to adapt deep FER models across source and target data sets, multiple subject-specific source domains are needed to accurately represent the intra- and inter-… ▽ More Adapting a deep learning model to a specific target individual is a challenging facial expression recognition (FER) task that may be achieved using unsupervised domain adaptation (UDA) methods. Although several UDA methods have been proposed to adapt deep FER models across source and target data sets, multiple subject-specific source domains are needed to accurately represent the intra- and inter-person variability in subject-based adaption. This paper considers the setting where domains correspond to individuals, not entire datasets. Unlike UDA, multi-source domain adaptation (MSDA) methods can leverage multiple source datasets to improve the accuracy and robustness of the target model. However, previous methods for MSDA adapt image classification models across datasets and do not scale well to a more significant number of source domains. This paper introduces a new MSDA method for subject-based domain adaptation in FER. It efficiently leverages information from multiple source subjects (labeled source domain data) to adapt a deep FER model to a single target individual (unlabeled target domain data). During adaptation, our subject-based MSDA first computes a between-source discrepancy loss to mitigate the domain shift among data from several source subjects. Then, a new strategy is employed to generate augmented confident pseudo-labels for the target subject, allowing a reduction in the domain shift between source and target subjects. Experiments performed on the challenging BioVid heat and pain dataset with 87 subjects and the UNBC-McMaster shoulder pain dataset with 25 subjects show that our subject-based MSDA can outperform state-of-the-art methods yet scale well to multiple subject-based source domains. △ Less

Submitted 26 April, 2024; v1 submitted 9 December, 2023; originally announced December 2023.

arXiv:2203.14779 [pdf, other]

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Authors: R. Gnana Praveen, Wheidima Carneiro de Melo, Nasib Ullah, Haseeb Aslam, Osama Zeeshan, Théo Denorme, Marco Pedersoli, Alessandro Koerich, Simon Bacon, Patrick Cardinal, Eric Granger

Abstract: Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively lever… ▽ More Multimodal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities (e.g., audio, visual, biosignals, etc.), and can provide some robustness to noisy modalities. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between the features. In particular, it computes the cross-attention weights based on correlation between the combined feature representation and individual modalities. By deploying the combined A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on validation-set videos from the AffWild2 dataset indicate that our proposed A-V fusion model provides a cost-effective solution that can outperform state-of-the-art approaches. The code is available on GitHub: https://github.com/praveena2j/JointCrossAttentional-AV-Fusion. △ Less

Submitted 6 July, 2024; v1 submitted 28 March, 2022; originally announced March 2022.

Comments: arXiv admin note: text overlap with arXiv:2111.05222

arXiv:2004.08774 [pdf, other]

Code Review in the Classroom

Authors: Victor Rivera, Hamna Aslam, Alexandr Naumchev, Daniel de Carvalho, Mansur Khazeev, Manuel Mazzara

Abstract: This paper presents a case study to examine the affinity of the code review process among young developers in an academic setting. Code review is indispensable considering the positive outcomes it generates. However, it is not an individual activity and requires substantial interaction among stakeholders, deliverance, and acceptance of feedback, timely actions upon feedback as well as the ability… ▽ More This paper presents a case study to examine the affinity of the code review process among young developers in an academic setting. Code review is indispensable considering the positive outcomes it generates. However, it is not an individual activity and requires substantial interaction among stakeholders, deliverance, and acceptance of feedback, timely actions upon feedback as well as the ability to agree on a solution in the wake of diverse viewpoints. Young developers in a classroom setting provide a clear picture of the potential favourable and problematic areas of the code review process. Their feedback suggests that the process has been well received with some points to better the process. This paper can be used as guidelines to perform code reviews in the classroom. △ Less

Submitted 19 April, 2020; originally announced April 2020.

arXiv:1906.01430 [pdf, ps, other]

Towards A Broader Acceptance Of Formal Verification Tools: The Role Of Education

Authors: Mansur Khazeev, Manuel Mazzara, Daniel De Carvalho, Hamna Aslam

Abstract: Formal methods yet advantageous, face challenges towards wide acceptance and adoption in software development practices. The major reason being presumed complexity. The issue can be addressed by academia with a thoughtful plan of teaching and practise. The user study detailed in this paper is examining AutoProof tool with the motivation to identify complexities attributed to formal methods. Partic… ▽ More Formal methods yet advantageous, face challenges towards wide acceptance and adoption in software development practices. The major reason being presumed complexity. The issue can be addressed by academia with a thoughtful plan of teaching and practise. The user study detailed in this paper is examining AutoProof tool with the motivation to identify complexities attributed to formal methods. Participants' (students of Masters program in Computer Science) performance and feedback on the experience with formal methods assisted us in extracting specific problem areas that effect tool usability. The study results infer, along with improvements in verification tool functionalities, teaching program must be modified to include pre-requisite courses to make formal methods easily adapted by students and promoting their usage in software development process. △ Less

Submitted 4 June, 2019; originally announced June 2019.

arXiv:1404.3079 [pdf, other]

Jessen's Inequality and Exponential Convexity for Positive C0-Semigroups

Authors: Gul I hina Aslam, Matloob Anwar

Abstract: In this paper the Jessen's type inequality for normalized positive $C_0$-semigroups is obtained. An adjoint of Jessen's type inequality has also been derived for the corresponding adjoint-semigroup, which does not give the analogous results but the behavior is still interesting. Moreover, it is followed by some results regarding positive definiteness and exponential convexity of complex structures… ▽ More In this paper the Jessen's type inequality for normalized positive $C_0$-semigroups is obtained. An adjoint of Jessen's type inequality has also been derived for the corresponding adjoint-semigroup, which does not give the analogous results but the behavior is still interesting. Moreover, it is followed by some results regarding positive definiteness and exponential convexity of complex structures involving operators from a semigroup. △ Less

Submitted 7 April, 2015; v1 submitted 11 April, 2014; originally announced April 2014.

Comments: 8 pages, 2 figures

Showing 1–8 of 8 results for author: Aslam, H