-
ExHuBERT: Enhancing HuBERT Through Block Extension and Fine-Tuning on 37 Emotion Datasets
Authors:
Shahin Amiriparian,
Filip Packań,
Maurice Gerczuk,
Björn W. Schuller
Abstract:
Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150…
▽ More
Foundation models have shown great promise in speech emotion recognition (SER) by leveraging their pre-trained representations to capture emotion patterns in speech signals. To further enhance SER performance across various languages and domains, we propose a novel twofold approach. First, we gather EmoSet++, a comprehensive multi-lingual, multi-cultural speech emotion corpus with 37 datasets, 150,907 samples, and a total duration of 119.5 hours. Second, we introduce ExHuBERT, an enhanced version of HuBERT achieved by backbone extension and fine-tuning on EmoSet++. We duplicate each encoder layer and its weights, then freeze the first duplicate, integrating an extra zero-initialized linear layer and skip connections to preserve functionality and ensure its adaptability for subsequent fine-tuning. Our evaluation on unseen datasets shows the efficacy of ExHuBERT, setting a new benchmark for various SER tasks. Model and details on EmoSet++: https://huggingface.co/amiriparian/ExHuBERT.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
The MuSe 2024 Multimodal Sentiment Analysis Challenge: Social Perception and Humor Recognition
Authors:
Shahin Amiriparian,
Lukas Christ,
Alexander Kathan,
Maurice Gerczuk,
Niklas Müller,
Steffen Klug,
Lukas Stappen,
Andreas König,
Erik Cambria,
Björn Schuller,
Simone Eulitz
Abstract:
The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems: In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals such as assertiveness, dominance, likability, and sincerity based on the provided audio-visual data. The Cross-Cultural Humor Detection…
▽ More
The Multimodal Sentiment Analysis Challenge (MuSe) 2024 addresses two contemporary multimodal affect and sentiment analysis problems: In the Social Perception Sub-Challenge (MuSe-Perception), participants will predict 16 different social attributes of individuals such as assertiveness, dominance, likability, and sincerity based on the provided audio-visual data. The Cross-Cultural Humor Detection Sub-Challenge (MuSe-Humor) dataset expands upon the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset, focusing on the detection of spontaneous humor in a cross-lingual and cross-cultural setting. The main objective of MuSe 2024 is to unite a broad audience from various research domains, including multimodal sentiment analysis, audio-visual affective computing, continuous signal processing, and natural language processing. By fostering collaboration and exchange among experts in these fields, the MuSe 2024 endeavors to advance the understanding and application of sentiment analysis and affective computing across multiple modalities. This baseline paper provides details on each sub-challenge and its corresponding dataset, extracted features from each data modality, and discusses challenge baselines. For our baseline system, we make use of a range of Transformers and expert-designed features and train Gated Recurrent Unit (GRU)-Recurrent Neural Network (RNN) models on them, resulting in a competitive baseline system. On the unseen test datasets of the respective sub-challenges, it achieves a mean Pearson's Correlation Coefficient ($ρ$) of 0.3573 for MuSe-Perception and an Area Under the Curve (AUC) value of 0.8682 for MuSe-Humor.
△ Less
Submitted 11 June, 2024;
originally announced June 2024.
-
Sustained Vowels for Pre- vs Post-Treatment COPD Classification
Authors:
Andreas Triantafyllopoulos,
Anton Batliner,
Wolfgang Mayr,
Markus Fendler,
Florian Pokorny,
Maurice Gerczuk,
Shahin Amiriparian,
Thomas Berghaus,
Björn Schuller
Abstract:
Chronic obstructive pulmonary disease (COPD) is a serious inflammatory lung disease affecting millions of people around the world. Due to an obstructed airflow from the lungs, it also becomes manifest in patients' vocal behaviour. Of particular importance is the detection of an exacerbation episode, which marks an acute phase and often requires hospitalisation and treatment. Previous work has show…
▽ More
Chronic obstructive pulmonary disease (COPD) is a serious inflammatory lung disease affecting millions of people around the world. Due to an obstructed airflow from the lungs, it also becomes manifest in patients' vocal behaviour. Of particular importance is the detection of an exacerbation episode, which marks an acute phase and often requires hospitalisation and treatment. Previous work has shown that it is possible to distinguish between a pre- and a post-treatment state using automatic analysis of read speech. In this contribution, we examine whether sustained vowels can provide a complementary lens for telling apart these two states. Using a cohort of 50 patients, we show that the inclusion of sustained vowels can improve performance to up to 79\% unweighted average recall, from a 71\% baseline using read speech. We further identify and interpret the most important acoustic features that characterise the manifestation of COPD in sustained vowels.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Enhancing Suicide Risk Assessment: A Speech-Based Automated Approach in Emergency Medicine
Authors:
Shahin Amiriparian,
Maurice Gerczuk,
Justina Lutz,
Wolfgang Strube,
Irina Papazova,
Alkomiet Hasan,
Alexander Kathan,
Björn W. Schuller
Abstract:
The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we have coll…
▽ More
The delayed access to specialized psychiatric assessments and care for patients at risk of suicidal tendencies in emergency departments creates a notable gap in timely intervention, hindering the provision of adequate mental health support during critical situations. To address this, we present a non-invasive, speech-based approach for automatic suicide risk assessment. For our study, we have collected a novel dataset of speech recordings from $20$ patients from which we extract three sets of features, including wav2vec, interpretable speech and acoustic features, and deep learning-based spectral representations. We proceed by conducting a binary classification to assess suicide risk in a leave-one-subject-out fashion. Our most effective speech model achieves a balanced accuracy of $66.2\,\%$. Moreover, we show that integrating our speech model with a series of patients' metadata, such as the history of suicide attempts or access to firearms, improves the overall result. The metadata integration yields a balanced accuracy of $94.4\,\%$, marking an absolute improvement of $28.2\,\%$, demonstrating the efficacy of our proposed approaches for automatic suicide risk assessment in emergency medicine.
△ Less
Submitted 18 April, 2024;
originally announced April 2024.
-
The ACM Multimedia 2023 Computational Paralinguistics Challenge: Emotion Share & Requests
Authors:
Björn W. Schuller,
Anton Batliner,
Shahin Amiriparian,
Alexander Barnhill,
Maurice Gerczuk,
Andreas Triantafyllopoulos,
Alice Baird,
Panagiotis Tzirakis,
Chris Gagne,
Alan S. Cowen,
Nikola Lackovic,
Marie-José Caraty,
Claude Montacié
Abstract:
The ACM Multimedia 2023 Computational Paralinguistics Challenge addresses two different problems for the first time in a research competition under well-defined conditions: In the Emotion Share Sub-Challenge, a regression on speech has to be made; and in the Requests Sub-Challenges, requests and complaints need to be detected. We describe the Sub-Challenges, baseline feature extraction, and classi…
▽ More
The ACM Multimedia 2023 Computational Paralinguistics Challenge addresses two different problems for the first time in a research competition under well-defined conditions: In the Emotion Share Sub-Challenge, a regression on speech has to be made; and in the Requests Sub-Challenges, requests and complaints need to be detected. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the usual ComPaRE features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectRum toolkit; in addition, wav2vec2 models are used.
△ Less
Submitted 1 May, 2023; v1 submitted 28 April, 2023;
originally announced April 2023.
-
HEAR4Health: A blueprint for making computer audition a staple of modern healthcare
Authors:
Andreas Triantafyllopoulos,
Alexander Kathan,
Alice Baird,
Lukas Christ,
Alexander Gebhard,
Maurice Gerczuk,
Vincent Karas,
Tobias Hübner,
Xin **g,
Shuo Liu,
Adria Mallol-Ragolta,
Manuel Milling,
Sandra Ottl,
Anastasia Semertzidou,
Srividya Tirunellai Rajamani,
Tianhao Yan,
Zijiang Yang,
Judith Dineley,
Shahin Amiriparian,
Katrin D. Bartl-Pokorny,
Anton Batliner,
Florian B. Pokorny,
Björn W. Schuller
Abstract:
Recent years have seen a rapid increase in digital medicine research in an attempt to transform traditional healthcare systems to their modern, intelligent, and versatile equivalents that are adequately equipped to tackle contemporary challenges. This has led to a wave of applications that utilise AI technologies; first and foremost in the fields of medical imaging, but also in the use of wearable…
▽ More
Recent years have seen a rapid increase in digital medicine research in an attempt to transform traditional healthcare systems to their modern, intelligent, and versatile equivalents that are adequately equipped to tackle contemporary challenges. This has led to a wave of applications that utilise AI technologies; first and foremost in the fields of medical imaging, but also in the use of wearables and other intelligent sensors. In comparison, computer audition can be seen to be lagging behind, at least in terms of commercial interest. Yet, audition has long been a staple assistant for medical practitioners, with the stethoscope being the quintessential sign of doctors around the world. Transforming this traditional technology with the use of AI entails a set of unique challenges. We categorise the advances needed in four key pillars: Hear, corresponding to the cornerstone technologies needed to analyse auditory signals in real-life conditions; Earlier, for the advances needed in computational and data efficiency; Attentively, for accounting to individual differences and handling the longitudinal nature of medical data; and, finally, Responsibly, for ensuring compliance to the ethical standards accorded to the field of medicine.
△ Less
Submitted 25 January, 2023;
originally announced January 2023.
-
Distinguishing between pre- and post-treatment in the speech of patients with chronic obstructive pulmonary disease
Authors:
Andreas Triantafyllopoulos,
Markus Fendler,
Anton Batliner,
Maurice Gerczuk,
Shahin Amiriparian,
Thomas M. Berghaus,
Björn W. Schuller
Abstract:
Chronic obstructive pulmonary disease (COPD) causes lung inflammation and airflow blockage leading to a variety of respiratory symptoms; it is also a leading cause of death and affects millions of individuals around the world. Patients often require treatment and hospitalisation, while no cure is currently available. As COPD predominantly affects the respiratory system, speech and non-linguistic v…
▽ More
Chronic obstructive pulmonary disease (COPD) causes lung inflammation and airflow blockage leading to a variety of respiratory symptoms; it is also a leading cause of death and affects millions of individuals around the world. Patients often require treatment and hospitalisation, while no cure is currently available. As COPD predominantly affects the respiratory system, speech and non-linguistic vocalisations present a major avenue for measuring the effect of treatment. In this work, we present results on a new COPD dataset of 20 patients, showing that, by employing personalisation through speaker-level feature normalisation, we can distinguish between pre- and post-treatment speech with an unweighted average recall (UAR) of up to 82\,\% in (nested) leave-one-speaker-out cross-validation. We further identify the most important features and link them to pathological voice properties, thus enabling an auditory interpretation of treatment effects. Monitoring tools based on such approaches may help objectivise the clinical status of COPD patients and facilitate personalised treatment plans.
△ Less
Submitted 26 July, 2022;
originally announced July 2022.
-
The ACM Multimedia 2022 Computational Paralinguistics Challenge: Vocalisations, Stuttering, Activity, & Mosquitoes
Authors:
Björn W. Schuller,
Anton Batliner,
Shahin Amiriparian,
Christian Bergler,
Maurice Gerczuk,
Natalie Holz,
Pauline Larrouy-Maestri,
Sebastian P. Bayerl,
Korbinian Riedhammer,
Adria Mallol-Ragolta,
Maria Pateraki,
Harry Coppock,
Ivan Kiskin,
Marianne Sinka,
Stephen Roberts
Abstract:
The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audio human activity recognition from smartwatch senso…
▽ More
The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audio human activity recognition from smartwatch sensor data; and in the Mosquitoes Sub-Challenge, mosquitoes need to be detected. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the usual ComPaRE and BoAW features, the auDeep toolkit, and deep feature extraction from pre-trained CNNs using the DeepSpectRum toolkit; in addition, we add end-to-end sequential modelling, and a log-mel-128-BNN.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Fatigue Prediction in Outdoor Running Conditions using Audio Data
Authors:
Andreas Triantafyllopoulos,
Sandra Ottl,
Alexander Gebhard,
Esther Rituerto-González,
Mirko Jaumann,
Steffen Hüttner,
Valerie Dieter,
Patrick Schneeweiß,
Inga Krauß,
Maurice Gerczuk,
Shahin Amiriparian,
Björn W. Schuller
Abstract:
Although running is a common leisure activity and a core training regiment for several athletes, between $29\%$ and $79\%$ of runners sustain an overuse injury each year. These injuries are linked to excessive fatigue, which alters how someone runs. In this work, we explore the feasibility of modelling the Borg received perception of exertion (RPE) scale (range: $[6-20]$), a well-validated subject…
▽ More
Although running is a common leisure activity and a core training regiment for several athletes, between $29\%$ and $79\%$ of runners sustain an overuse injury each year. These injuries are linked to excessive fatigue, which alters how someone runs. In this work, we explore the feasibility of modelling the Borg received perception of exertion (RPE) scale (range: $[6-20]$), a well-validated subjective measure of fatigue, using audio data captured in realistic outdoor environments via smartphones attached to the runners' arms. Using convolutional neural networks (CNNs) on log-Mel spectrograms, we obtain a mean absolute error of $2.35$ in subject-dependent experiments, demonstrating that audio can be effectively used to model fatigue, while being more easily and non-invasively acquired than by signals from other sensors.
△ Less
Submitted 9 May, 2022;
originally announced May 2022.
-
A Summary of the ComParE COVID-19 Challenges
Authors:
Harry Coppock,
Alican Akman,
Christian Bergler,
Maurice Gerczuk,
Chloë Brown,
Jagmohan Chauhan,
Andreas Grammenos,
Apinan Hasthanasombat,
Dimitris Spathis,
Tong Xia,
Pietro Cicuta,
**g Han,
Shahin Amiriparian,
Alice Baird,
Lukas Stappen,
Sandra Ottl,
Panagiotis Tzirakis,
Anton Batliner,
Cecilia Mascolo,
Björn W. Schuller
Abstract:
The COVID-19 pandemic has caused massive humanitarian and economic damage. Teams of scientists from a broad range of disciplines have searched for methods to help governments and communities combat the disease. One avenue from the machine learning field which has been explored is the prospect of a digital mass test which can detect COVID-19 from infected individuals' respiratory sounds. We present…
▽ More
The COVID-19 pandemic has caused massive humanitarian and economic damage. Teams of scientists from a broad range of disciplines have searched for methods to help governments and communities combat the disease. One avenue from the machine learning field which has been explored is the prospect of a digital mass test which can detect COVID-19 from infected individuals' respiratory sounds. We present a summary of the results from the INTERSPEECH 2021 Computational Paralinguistics Challenges: COVID-19 Cough, (CCS) and COVID-19 Speech, (CSS).
△ Less
Submitted 17 February, 2022;
originally announced February 2022.
-
A Machine Learning Framework for Automatic Prediction of Human Semen Motility
Authors:
Sandra Ottl,
Shahin Amiriparian,
Maurice Gerczuk,
Björn Schuller
Abstract:
In this paper, human semen samples from the visem dataset collected by the Simula Research Laboratory are automatically assessed with machine learning methods for their quality in respect to sperm motility. Several regression models are trained to automatically predict the percentage (0 to 100) of progressive, non-progressive, and immotile spermatozoa in a given sample. The video samples are adopt…
▽ More
In this paper, human semen samples from the visem dataset collected by the Simula Research Laboratory are automatically assessed with machine learning methods for their quality in respect to sperm motility. Several regression models are trained to automatically predict the percentage (0 to 100) of progressive, non-progressive, and immotile spermatozoa in a given sample. The video samples are adopted for three different feature extraction methods, in particular custom movement statistics, displacement features, and motility specific statistics have been utilised. Furthermore, four machine learning models, including linear Support Vector Regressor (SVR), Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), and Recurrent Neural Network (RNN), have been trained on the extracted features for the task of automatic motility prediction. Best results for predicting motility are achieved by using the Crocker-Grier algorithm to track sperm cells in an unsupervised way and extracting individual mean squared displacement features for each detected track. These features are then aggregated into a histogram representation applying a Bag-of-Words approach. Finally, a linear SVR is trained on this feature representation. Compared to the best submission of the Medico Multimedia for Medicine challenge, which used the same dataset and splits, the Mean Absolute Error (MAE) could be reduced from 8.83 to 7.31. For the sake of reproducibility, we provide the source code for our experiments on GitHub.
△ Less
Submitted 17 September, 2021; v1 submitted 16 September, 2021;
originally announced September 2021.
-
DeepSpectrumLite: A Power-Efficient Transfer Learning Framework for Embedded Speech and Audio Processing from Decentralised Data
Authors:
Shahin Amiriparian,
Tobias Hübner,
Maurice Gerczuk,
Sandra Ottl,
Björn W. Schuller
Abstract:
Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilise them for real-time, real-world applications. We tackle these limitations by introducing DeepSpectrumL…
▽ More
Deep neural speech and audio processing systems have a large number of trainable parameters, a relatively complex architecture, and require a vast amount of training data and computational power. These constraints make it more challenging to integrate such systems into embedded devices and utilise them for real-time, real-world applications. We tackle these limitations by introducing DeepSpectrumLite, an open-source, lightweight transfer learning framework for on-device speech and audio recognition using pre-trained image convolutional neural networks (CNNs). The framework creates and augments Mel-spectrogram plots on-the-fly from raw audio signals which are then used to finetune specific pre-trained CNNs for the target classification task. Subsequently, the whole pipeline can be run in real-time with a mean inference lag of 242.0 ms when a DenseNet121 model is used on a consumer-grade Motorola moto e7 plus smartphone. DeepSpectrumLite operates decentralised, eliminating the need for data upload for further processing. By obtaining state-of-the-art results on a set of paralinguistics tasks, we demonstrate the suitability of the proposed transfer learning approach for embedded audio signal processing, even when data is scarce. We provide an extensive command-line interface for users and developers which is comprehensively documented and publicly available at https://github.com/DeepSpectrum/DeepSpectrumLite.
△ Less
Submitted 23 April, 2021;
originally announced April 2021.
-
On the Impact of Word Error Rate on Acoustic-Linguistic Speech Emotion Recognition: An Update for the Deep Learning Era
Authors:
Shahin Amiriparian,
Artem Sokolov,
Ilhan Aslan,
Lukas Christ,
Maurice Gerczuk,
Tobias Hübner,
Dmitry Lamanov,
Manuel Milling,
Sandra Ottl,
Ilya Poduremennykh,
Evgeniy Shuranov,
Björn W. Schuller
Abstract:
Text encodings from automatic speech recognition (ASR) transcripts and audio representations have shown promise in speech emotion recognition (SER) ever since. Yet, it is challenging to explain the effect of each information stream on the SER systems. Further, more clarification is required for analysing the impact of ASR's word error rate (WER) on linguistic emotion recognition per se and in the…
▽ More
Text encodings from automatic speech recognition (ASR) transcripts and audio representations have shown promise in speech emotion recognition (SER) ever since. Yet, it is challenging to explain the effect of each information stream on the SER systems. Further, more clarification is required for analysing the impact of ASR's word error rate (WER) on linguistic emotion recognition per se and in the context of fusion with acoustic information exploitation in the age of deep ASR systems. In order to tackle the above issues, we create transcripts from the original speech by applying three modern ASR systems, including an end-to-end model trained with recurrent neural network-transducer loss, a model with connectionist temporal classification loss, and a wav2vec framework for self-supervised learning. Afterwards, we use pre-trained textual models to extract text representations from the ASR outputs and the gold standard. For extraction and learning of acoustic speech features, we utilise openSMILE, openXBoW, DeepSpectrum, and auDeep. Finally, we conduct decision-level fusion on both information streams -- acoustics and linguistics. Using the best development configuration, we achieve state-of-the-art unweighted average recall values of $73.6\,\%$ and $73.8\,\%$ on the speaker-independent development and test partitions of IEMOCAP, respectively.
△ Less
Submitted 20 April, 2021;
originally announced April 2021.
-
EmoNet: A Transfer Learning Framework for Multi-Corpus Speech Emotion Recognition
Authors:
Maurice Gerczuk,
Shahin Amiriparian,
Sandra Ottl,
Björn Schuller
Abstract:
In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EmoSet, is assembled from a number of existing SER corpora. In total, EmoSet contains 84181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framewor…
▽ More
In this manuscript, the topic of multi-corpus Speech Emotion Recognition (SER) is approached from a deep transfer learning perspective. A large corpus of emotional speech data, EmoSet, is assembled from a number of existing SER corpora. In total, EmoSet contains 84181 audio recordings from 26 SER corpora with a total duration of over 65 hours. The corpus is then utilised to create a novel framework for multi-corpus speech emotion recognition, namely EmoNet. A combination of a deep ResNet architecture and residual adapters is transferred from the field of multi-domain visual recognition to multi-corpus SER on EmoSet. Compared against two suitable baselines and more traditional training and transfer settings for the ResNet, the residual adapter approach enables parameter efficient training of a multi-domain SER model on all 26 corpora. A shared model with only $3.5$ times the number of parameters of a model trained on a single database leads to increased performance for 21 of the 26 corpora in EmoSet. Measured by McNemar's test, these improvements are further significant for ten datasets at $p<0.05$ while there are just two corpora that see only significant decreases across the residual adapter transfer experiments. Finally, we make our EmoNet framework publicly available for users and developers at https://github.com/EIHW/EmoNet. EmoNet provides an extensive command line interface which is comprehensively documented and can be used in a variety of multi-corpus transfer learning settings.
△ Less
Submitted 10 March, 2021;
originally announced March 2021.
-
The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates
Authors:
Björn W. Schuller,
Anton Batliner,
Christian Bergler,
Cecilia Mascolo,
**g Han,
Iulia Lefter,
Heysem Kaya,
Shahin Amiriparian,
Alice Baird,
Lukas Stappen,
Sandra Ottl,
Maurice Gerczuk,
Panagiotis Tzirakis,
Chloë Brown,
Jagmohan Chauhan,
Andreas Grammenos,
Apinan Hasthanasombat,
Dimitris Spathis,
Tong Xia,
Pietro Cicuta,
Leon J. M. Rothkrantz,
Joeri Zwerts,
Jelle Treep,
Casper Kaandorp
Abstract:
The INTERSPEECH 2021 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the COVID-19 Cough and COVID-19 Speech Sub-Challenges, a binary classification on COVID-19 infection has to be made based on coughing sounds and speech; in the Escalation SubChallenge, a three-way assessment of the level of es…
▽ More
The INTERSPEECH 2021 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the COVID-19 Cough and COVID-19 Speech Sub-Challenges, a binary classification on COVID-19 infection has to be made based on coughing sounds and speech; in the Escalation SubChallenge, a three-way assessment of the level of escalation in a dialogue is featured; and in the Primates Sub-Challenge, four species vs background need to be classified. We describe the Sub-Challenges, baseline feature extraction, and classifiers based on the 'usual' COMPARE and BoAW features as well as deep unsupervised representation learning using the AuDeep toolkit, and deep feature extraction from pre-trained CNNs using the Deep Spectrum toolkit; in addition, we add deep end-to-end sequential modelling, and partially linguistic analysis.
△ Less
Submitted 24 February, 2021;
originally announced February 2021.
-
A Novel Fusion of Attention and Sequence to Sequence Autoencoders to Predict Sleepiness From Speech
Authors:
Shahin Amiriparian,
Pawel Winokurow,
Vincent Karas,
Sandra Ottl,
Maurice Gerczuk,
Björn W. Schuller
Abstract:
Motivated by the attention mechanism of the human visual system and recent developments in the field of machine translation, we introduce our attention-based and recurrent sequence to sequence autoencoders for fully unsupervised representation learning from audio files. In particular, we test the efficacy of our novel approach on the task of speech-based sleepiness recognition. We evaluate the lea…
▽ More
Motivated by the attention mechanism of the human visual system and recent developments in the field of machine translation, we introduce our attention-based and recurrent sequence to sequence autoencoders for fully unsupervised representation learning from audio files. In particular, we test the efficacy of our novel approach on the task of speech-based sleepiness recognition. We evaluate the learnt representations from both autoencoders, and then conduct an early fusion to ascertain possible complementarity between them. In our frameworks, we first extract Mel-spectrograms from raw audio files. Second, we train recurrent autoencoders on these spectrograms which are considered as time-dependent frequency vectors. Afterwards, we extract the activations of specific fully connected layers of the autoencoders which represent the learnt features of spectrograms for the corresponding audio instances. Finally, we train support vector regressors on these representations to obtain the predictions. On the development partition of the data, we achieve Spearman's correlation coefficients of .324, .283, and .320 with the targets on the Karolinska Sleepiness Scale by utilising attention and non-attention autoencoders, and the fusion of both autoencoders' representations, respectively. In the same order, we achieve .311, .359, and .367 Spearman's correlation coefficients on the test data, indicating the suitability of our proposed fusion strategy.
△ Less
Submitted 19 May, 2020; v1 submitted 15 May, 2020;
originally announced May 2020.