Search | arXiv e-print repository

Multi-Modal Emotion Recognition by Text, Speech and Video Using Pretrained Transformers

Authors: Minoo Shayaninasab, Bagher Babaali

Abstract: Due to the complex nature of human emotions and the diversity of emotion representation methods in humans, emotion recognition is a challenging field. In this research, three input modalities, namely text, audio (speech), and video, are employed to generate multimodal feature vectors. For generating features for each of these modalities, pre-trained Transformer models with fine-tuning are utilized… ▽ More Due to the complex nature of human emotions and the diversity of emotion representation methods in humans, emotion recognition is a challenging field. In this research, three input modalities, namely text, audio (speech), and video, are employed to generate multimodal feature vectors. For generating features for each of these modalities, pre-trained Transformer models with fine-tuning are utilized. In each modality, a Transformer model is used with transfer learning to extract feature and emotional structure. These features are then fused together, and emotion recognition is performed using a classifier. To select an appropriate fusion method and classifier, various feature-level and decision-level fusion techniques have been experimented with, and ultimately, the best model, which combines feature-level fusion by concatenating feature vectors and classification using a Support Vector Machine on the IEMOCAP multimodal dataset, achieves an accuracy of 75.42%. Keywords: Multimodal Emotion Recognition, IEMOCAP, Self-Supervised Learning, Transfer Learning, Transformer. △ Less

Submitted 11 February, 2024; originally announced February 2024.

arXiv:2402.07326 [pdf]

Persian Speech Emotion Recognition by Fine-Tuning Transformers

Authors: Minoo Shayaninasab, Bagher Babaali

Abstract: Given the significance of speech emotion recognition, numerous methods have been developed in recent years to create effective and efficient systems in this domain. One of these methods involves the use of pretrained transformers, fine-tuned to address this specific problem, resulting in high accuracy. Despite extensive discussions and global-scale efforts to enhance these systems, the application… ▽ More Given the significance of speech emotion recognition, numerous methods have been developed in recent years to create effective and efficient systems in this domain. One of these methods involves the use of pretrained transformers, fine-tuned to address this specific problem, resulting in high accuracy. Despite extensive discussions and global-scale efforts to enhance these systems, the application of this innovative and effective approach has received less attention in the context of Persian speech emotion recognition. In this article, we review the field of speech emotion recognition and its background, with an emphasis on the importance of employing transformers in this context. We present two models, one based on spectrograms and the other on the audio itself, fine-tuned using the shEMO dataset. These models significantly enhance the accuracy of previous systems, increasing it from approximately 65% to 80% on the mentioned dataset. Subsequently, to investigate the effect of multilinguality on the fine-tuning process, these same models are fine-tuned twice. First, they are fine-tuned using the English IEMOCAP dataset, and then they are fine-tuned with the Persian shEMO dataset. This results in an improved accuracy of 82% for the Persian emotion recognition system. Keywords: Persian Speech Emotion Recognition, shEMO, Self-Supervised Learning △ Less

Submitted 11 February, 2024; originally announced February 2024.

arXiv:2310.11640 [pdf, other]

Free-text Keystroke Authentication using Transformers: A Comparative Study of Architectures and Loss Functions

Authors: Saleh Momeni, Bagher BabaAli

Abstract: Keystroke biometrics is a promising approach for user identification and verification, leveraging the unique patterns in individuals' ty** behavior. In this paper, we propose a Transformer-based network that employs self-attention to extract informative features from keystroke sequences, surpassing the performance of traditional Recurrent Neural Networks. We explore two distinct architectures, n… ▽ More Keystroke biometrics is a promising approach for user identification and verification, leveraging the unique patterns in individuals' ty** behavior. In this paper, we propose a Transformer-based network that employs self-attention to extract informative features from keystroke sequences, surpassing the performance of traditional Recurrent Neural Networks. We explore two distinct architectures, namely bi-encoder and cross-encoder, and compare their effectiveness in keystroke authentication. Furthermore, we investigate different loss functions, including triplet, batch-all triplet, and WDCL loss, along with various distance metrics such as Euclidean, Manhattan, and cosine distances. These experiments allow us to optimize the training process and enhance the performance of our model. To evaluate our proposed model, we employ the Aalto desktop keystroke dataset. The results demonstrate that the bi-encoder architecture with batch-all triplet loss and cosine distance achieves the best performance, yielding an exceptional Equal Error Rate of 0.0186%. Furthermore, alternative algorithms for calculating similarity scores are explored to enhance accuracy. Notably, the utilization of a one-class Support Vector Machine reduces the Equal Error Rate to an impressive 0.0163%. The outcomes of this study indicate that our model surpasses the previous state-of-the-art in free-text keystroke authentication. These findings contribute to advancing the field of keystroke authentication and offer practical implications for secure user verification systems. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2310.06645 [pdf, other]

Self-Supervised Representation Learning for Online Handwriting Text Classification

Authors: Pouya Mehralian, Bagher BabaAli, Ashena Gorgan Mohammadi

Abstract: Self-supervised learning offers an efficient way of extracting rich representations from various types of unlabeled data while avoiding the cost of annotating large-scale datasets. This is achievable by designing a pretext task to form pseudo labels with respect to the modality and domain of the data. Given the evolving applications of online handwritten texts, in this study, we propose the novel… ▽ More Self-supervised learning offers an efficient way of extracting rich representations from various types of unlabeled data while avoiding the cost of annotating large-scale datasets. This is achievable by designing a pretext task to form pseudo labels with respect to the modality and domain of the data. Given the evolving applications of online handwritten texts, in this study, we propose the novel Part of Stroke Masking (POSM) as a pretext task for pretraining models to extract informative representations from the online handwriting of individuals in English and Chinese languages, along with two suggested pipelines for fine-tuning the pretrained models. To evaluate the quality of the extracted representations, we use both intrinsic and extrinsic evaluation methods. The pretrained models are fine-tuned to achieve state-of-the-art results in tasks such as writer identification, gender classification, and handedness classification, also highlighting the superiority of utilizing the pretrained models over the models trained from scratch. △ Less

Submitted 10 October, 2023; originally announced October 2023.

arXiv:2307.15045 [pdf, other]

A Transformer-based Approach for Arabic Offline Handwritten Text Recognition

Authors: Saleh Momeni, Bagher BabaAli

Abstract: Handwriting recognition is a challenging and critical problem in the fields of pattern recognition and machine learning, with applications spanning a wide range of domains. In this paper, we focus on the specific issue of recognizing offline Arabic handwritten text. Existing approaches typically utilize a combination of convolutional neural networks for image feature extraction and recurrent neura… ▽ More Handwriting recognition is a challenging and critical problem in the fields of pattern recognition and machine learning, with applications spanning a wide range of domains. In this paper, we focus on the specific issue of recognizing offline Arabic handwritten text. Existing approaches typically utilize a combination of convolutional neural networks for image feature extraction and recurrent neural networks for temporal modeling, with connectionist temporal classification used for text generation. However, these methods suffer from a lack of parallelization due to the sequential nature of recurrent neural networks. Furthermore, these models cannot account for linguistic rules, necessitating the use of an external language model in the post-processing stage to boost accuracy. To overcome these issues, we introduce two alternative architectures, namely the Transformer Transducer and the standard sequence-to-sequence Transformer, and compare their performance in terms of accuracy and speed. Our approach can model language dependencies and relies only on the attention mechanism, thereby making it more parallelizable and less complex. We employ pre-trained Transformers for both image understanding and language modeling. Our evaluation on the Arabic KHATT dataset demonstrates that our proposed method outperforms the current state-of-the-art approaches for recognizing offline Arabic handwritten text. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2012.06186 [pdf, other]

doi 10.1049/bme2.12039

Writer Identification and Writer Retrieval Based on NetVLAD with Re-ranking

Authors: Shervin Rasoulzadeh, Bagher Babaali

Abstract: This paper addresses writer identification and writer retrieval which is considered as a challenging problem in the document analysis and recognition field. In this work, a novel pipeline is proposed for the problem at hand by employing a unified neural network architecture consisting of the ResNet-20 as a feature extractor and an integrated NetVLAD layer, inspired by the vector of locally aggrega… ▽ More This paper addresses writer identification and writer retrieval which is considered as a challenging problem in the document analysis and recognition field. In this work, a novel pipeline is proposed for the problem at hand by employing a unified neural network architecture consisting of the ResNet-20 as a feature extractor and an integrated NetVLAD layer, inspired by the vector of locally aggregated descriptors (VLAD), in the head of the latter part. Having defined this architecture, the triplet semi-hard loss function is used to directly learn an embedding for individual input image patches. Subsequently, generalized max-pooling technique is employed for the aggregation of embedded descriptors of each handwritten image. Also, a novel re-ranking strategy is introduced for the task of identification and retrieval based on $k$-reciprocal nearest neighbors, and it is shown that the pipeline can benefit tremendously from this step. Experimental evaluation has been done on the three publicly available datasets: the ICDAR 2013, CVL, and KHATT datasets. Results indicate that while we perform comparably to the state-of-the-art on the KHATT, our writer identification and writer retrieval pipeline achieves superior performance on the ICDAR 2013 and CVL datasets in terms of mAP. △ Less

Submitted 22 February, 2021; v1 submitted 11 December, 2020; originally announced December 2020.

Comments: 22 pages, 12 figures

arXiv:1910.00330 [pdf, other]

A Multi-Modal Feature Embedding Approach to Diagnose Alzheimer Disease from Spoken Language

Authors: S. Soroush Haj Zargarbashi, Bagher Babaali

Abstract: Introduction: Alzheimer's disease is a type of dementia in which early diagnosis plays a major rule in the quality of treatment. Among new works in the diagnosis of Alzheimer's disease, there are many of them analyzing the voice stream acoustically, syntactically or both. The mostly used tools to perform these analysis usually include machine learning techniques. Objective: Designing an automatic… ▽ More Introduction: Alzheimer's disease is a type of dementia in which early diagnosis plays a major rule in the quality of treatment. Among new works in the diagnosis of Alzheimer's disease, there are many of them analyzing the voice stream acoustically, syntactically or both. The mostly used tools to perform these analysis usually include machine learning techniques. Objective: Designing an automatic machine learning based diagnosis system will help in the procedure of early detection. Also, systems, using noninvasive data are preferable. Methods: We used are classification system based on spoken language. We use three (statistical and neural) approaches to classify audio signals from spoken language into two classes of dementia and control. Result: This work designs a multi-modal feature embedding on the spoken language audio signal using three approaches; N-gram, i-vector, and x-vector. The evaluation of the system is done on the cookie picture description task from Pitt Corpus dementia bank with the accuracy of 83:6 △ Less

Submitted 1 October, 2019; originally announced October 2019.

Comments: 14 pages, 4 figures

arXiv:1904.11914 [pdf, other]

doi 10.2478/jee-2019-0056

Statistical feature embedding for heart sound classification

Authors: Mohammad Adiban, Bagher BabaAli, Saeedreza Shehnepoor

Abstract: Cardiovascular Disease (CVD) is considered as one of the principal causes of death in the world. Over recent years, this field of study has attracted researchers' attention to investigate heart sounds' patterns for disease diagnostics. In this study, an approach is proposed for normal/abnormal heart sound classification on the Physionet challenge 2016 dataset. For the first time, a fixed-length fe… ▽ More Cardiovascular Disease (CVD) is considered as one of the principal causes of death in the world. Over recent years, this field of study has attracted researchers' attention to investigate heart sounds' patterns for disease diagnostics. In this study, an approach is proposed for normal/abnormal heart sound classification on the Physionet challenge 2016 dataset. For the first time, a fixed-length feature vector; called i-vector; is extracted from each heart sound using Mel Frequency Cepstral Coefficient (MFCC) features. Afterwards, Principal Component Analysis (PCA) transform and Variational Autoencoder (VAE) are applied on the i-vector to achieve dimension reduction. Eventually, the reduced size vector is fed to Gaussian Mixture Models (GMMs) and Support Vector Machine (SVM) for classification purpose. Experimental results demonstrate the proposed method could achieve a performance improvement of 16% based on Modified Accuracy (MAcc) compared with the baseline system on the Physoinet dataset. △ Less

Submitted 9 November, 2020; v1 submitted 26 April, 2019; originally announced April 2019.

Journal ref: Journal of Electrical Engineering, 70(4), 259-272 (2019)

arXiv:1712.02781 [pdf, other]

doi 10.1007/s00521-018-3844-z

On Usage of Autoencoders and Siamese Networks for Online Handwritten Signature Verification

Authors: Kian Ahrabian, Bagher Babaali

Abstract: In this paper, we propose a novel writer-independent global feature extraction framework for the task of automatic signature verification which aims to make robust systems for automatically distinguishing negative and positive samples. Our method consists of an autoencoder for modeling the sample space into a fixed length latent space and a Siamese Network for classifying the fixed-length samples… ▽ More In this paper, we propose a novel writer-independent global feature extraction framework for the task of automatic signature verification which aims to make robust systems for automatically distinguishing negative and positive samples. Our method consists of an autoencoder for modeling the sample space into a fixed length latent space and a Siamese Network for classifying the fixed-length samples obtained from the autoencoder based on the reference samples of a subject as being "Genuine" or "Forged." During our experiments, usage of Attention Mechanism and applying Downsampling significantly improved the accuracy of the proposed framework. We evaluated our proposed framework using SigWiComp2013 Japanese and GPDSsyntheticOnLineOffLineSignature datasets. On the SigWiComp2013 Japanese dataset, we achieved 8.65% EER that means 1.2% relative improvement compared to the best-reported result. Furthermore, on the GPDSsyntheticOnLineOffLineSignature dataset, we achieved average EERs of 0.13%, 0.12%, 0.21% and 0.25% respectively for 150, 300, 1000 and 2000 test subjects which indicates improvement of relative EER on the best-reported result by 95.67%, 95.26%, 92.9% and 91.52% respectively. Apart from the accuracy gain, because of the nature of our proposed framework which is based on neural networks and consequently is as simple as some consecutive matrix multiplications, it has less computational cost than conventional methods such as DTW and could be used concurrently on devices such as GPU, TPU, etc. △ Less

Submitted 29 December, 2017; v1 submitted 7 December, 2017; originally announced December 2017.

Comments: 13 pages, 10 figures, Submitted to Neural Computing and Applications journal

Showing 1–9 of 9 results for author: Babaali, B