-
Are you sure? Analysing Uncertainty Quantification Approaches for Real-world Speech Emotion Recognition
Authors:
Oliver Schrüfer,
Manuel Milling,
Felix Burkhardt,
Florian Eyben,
Björn Schuller
Abstract:
Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Rel…
▽ More
Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) data or, in general, poor recording conditions. Reliable UQ methods are thus of particular interest as in many SER applications no prediction is better than a faulty prediction. While the effects of label ambiguity on uncertainty are well documented in the literature, we focus our work on an evaluation of UQ methods for SER under common challenges in real-world application, such as corrupted signals, and the absence of speech. We show that simple UQ methods can already give an indication of the uncertainty of a prediction and that training with additional OOD data can greatly improve the identification of such signals.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
Testing Speech Emotion Recognition Machine Learning Models
Authors:
Anna Derington,
Hagen Wierstorf,
Ali Özkil,
Florian Eyben,
Felix Burkhardt,
Björn W. Schuller
Abstract:
Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated on the basis of a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest…
▽ More
Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated on the basis of a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper investigates behavior of speech emotion recognition models with a testing framework which requires models to fulfill conditions in terms of correctness, fairness, and robustness.
△ Less
Submitted 11 December, 2023;
originally announced December 2023.
-
Going Retro: Astonishingly Simple Yet Effective Rule-based Prosody Modelling for Speech Synthesis Simulating Emotion Dimensions
Authors:
Felix Burkhardt,
Uwe Reichel,
Florian Eyben,
Björn Schuller
Abstract:
We introduce two rule-based models to modify the prosody of speech synthesis in order to modulate the emotion to be expressed. The prosody modulation is based on speech synthesis markup language (SSML) and can be used with any commercial speech synthesizer. The models as well as the optimization result are evaluated against human emotion annotations. Results indicate that with a very simple method…
▽ More
We introduce two rule-based models to modify the prosody of speech synthesis in order to modulate the emotion to be expressed. The prosody modulation is based on speech synthesis markup language (SSML) and can be used with any commercial speech synthesizer. The models as well as the optimization result are evaluated against human emotion annotations. Results indicate that with a very simple method both dimensions arousal (.76 UAR) and valence (.43 UAR) can be simulated.
△ Less
Submitted 5 July, 2023;
originally announced July 2023.
-
Speech-based Age and Gender Prediction with Transformers
Authors:
Felix Burkhardt,
Johannes Wagner,
Hagen Wierstorf,
Florian Eyben,
Björn Schuller
Abstract:
We report on the curation of several publicly available datasets for age and gender prediction. Furthermore, we present experiments to predict age and gender with models based on a pre-trained wav2vec 2.0. Depending on the dataset, we achieve an MAE between 7.1 years and 10.8 years for age, and at least 91.1% ACC for gender (female, male, child). Compared to a modelling approach built on handcraft…
▽ More
We report on the curation of several publicly available datasets for age and gender prediction. Furthermore, we present experiments to predict age and gender with models based on a pre-trained wav2vec 2.0. Depending on the dataset, we achieve an MAE between 7.1 years and 10.8 years for age, and at least 91.1% ACC for gender (female, male, child). Compared to a modelling approach built on handcrafted features, our proposed system shows an improvement of 9% UAR for age and 4% UAR for gender. To make our findings reproducible, we release the best performing model to the community as well as the sample lists of the data splits.
△ Less
Submitted 29 June, 2023;
originally announced June 2023.
-
Happy or Evil Laughter? Analysing a Database of Natural Audio Samples
Authors:
Aljoscha Düsterhöft,
Felix Burkhardt,
Björn W. Schuller
Abstract:
We conducted a data collection on the basis of the Google AudioSet database by selecting a subset of the samples annotated with \textit{laughter}. The selection criterion was to be present a communicative act with clear connotation of being either positive (laughing with) or negative (being laughed at). On the basis of this annotated data, we performed two experiments: on the one hand, we manually…
▽ More
We conducted a data collection on the basis of the Google AudioSet database by selecting a subset of the samples annotated with \textit{laughter}. The selection criterion was to be present a communicative act with clear connotation of being either positive (laughing with) or negative (being laughed at). On the basis of this annotated data, we performed two experiments: on the one hand, we manually extract and analyze phonetic features. On the other hand, we conduct several machine learning experiments by systematically combining several automatically extracted acoustic feature sets with machine learning algorithms. This shows that the best performing models can achieve and unweighted average recall of .7.
△ Less
Submitted 23 May, 2023;
originally announced May 2023.
-
audb -- Sharing and Versioning of Audio and Annotation Data in Python
Authors:
Hagen Wierstorf,
Johannes Wagner,
Florian Eyben,
Felix Burkhardt,
Björn W. Schuller
Abstract:
Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, maintain, and access the annotations and audio files of…
▽ More
Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, maintain, and access the annotations and audio files of a dataset. To efficiently store the data on a server, audb automatically resolves dependencies between versions of a dataset and only uploads newly added or altered files when a new version is published. The library supports partial loading of a dataset and local caching for fast access. audb is a lightweight library and can be interfaced from any machine learning library. It supports the management of datasets on a single PC, within a university or company, or within a whole research community.
△ Less
Submitted 10 May, 2023; v1 submitted 1 March, 2023;
originally announced March 2023.
-
Dawn of the transformer era in speech emotion recognition: closing the valence gap
Authors:
Johannes Wagner,
Andreas Triantafyllopoulos,
Hagen Wierstorf,
Maximilian Schmitt,
Felix Burkhardt,
Florian Eyben,
Björn W. Schuller
Abstract:
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream perform…
▽ More
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of .638 on MSP-Podcast. Furthermore, our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline and fair with respect to biological sex groups, but not towards individual speakers. Finally, we are the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during fine-tuning of the transformer layers, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. Our findings collectively paint the following picture: transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues. To make our findings reproducible, we release the best performing model to the community.
△ Less
Submitted 7 September, 2023; v1 submitted 14 March, 2022;
originally announced March 2022.
-
White Paper on Critical and Massive Machine Type Communication Towards 6G
Authors:
Nurul Huda Mahmood,
Stefan Böcker,
Andrea Munari,
Federico Clazzer,
Ingrid Moerman,
Konstantin Mikhaylov,
Onel Lopez,
Ok-Sun Park,
Eric Mercier,
Hannes Bartz,
Riku Jäntti,
Ravikumar Pragada,
Yihua Ma,
Elina Annanperä,
Christian Wietfeld,
Martin Andraud,
Gianluigi Liva,
Yan Chen,
Eduardo Garro,
Frank Burkhardt,
Hirley Alves,
Chen-Feng Liu,
Yalcin Sadi,
Jean-Baptiste Dore,
Eunah Kim
, et al. (6 additional authors not shown)
Abstract:
The society as a whole, and many vertical sectors in particular, is becoming increasingly digitalized. Machine Type Communication (MTC), encompassing its massive and critical aspects, and ubiquitous wireless connectivity are among the main enablers of such digitization at large. The recently introduced 5G New Radio is natively designed to support both aspects of MTC to promote the digital transfor…
▽ More
The society as a whole, and many vertical sectors in particular, is becoming increasingly digitalized. Machine Type Communication (MTC), encompassing its massive and critical aspects, and ubiquitous wireless connectivity are among the main enablers of such digitization at large. The recently introduced 5G New Radio is natively designed to support both aspects of MTC to promote the digital transformation of the society. However, it is evident that some of the more demanding requirements cannot be fully supported by 5G networks. Alongside, further development of the society towards 2030 will give rise to new and more stringent requirements on wireless connectivity in general, and MTC in particular. Driven by the societal trends towards 2030, the next generation (6G) will be an agile and efficient convergent network serving a set of diverse service classes and a wide range of key performance indicators (KPI). This white paper explores the main drivers and requirements of an MTC-optimized 6G network, and discusses the following six key research questions:
- Will the main KPIs of 5G continue to be the dominant KPIs in 6G; or will there emerge new key metrics?
- How to deliver different E2E service mandates with different KPI requirements considering joint-optimization at the physical up to the application layer?
- What are the key enablers towards designing ultra-low power receivers and highly efficient sleep modes?
- How to tackle a disruptive rather than incremental joint design of a massively scalable waveform and medium access policy for global MTC connectivity?
- How to support new service classes characterizing mission-critical and dependable MTC in 6G?
- What are the potential enablers of long term, lightweight and flexible privacy and security schemes considering MTC device requirements?
△ Less
Submitted 4 May, 2020; v1 submitted 29 April, 2020;
originally announced April 2020.
-
A Spatially Consistent Geometric D2D Small-Scale Fading Model for Multiple Frequencies
Authors:
Stephan Jaeckel,
Leszek Raschkowski,
Frank Burkhardt,
Lars Thiele
Abstract:
The 3GPP new radio (NR) channel model introduced spatial consistency and a correlation model for multiple frequencies. Future extensions of this model will incorporate mobility at both ends of the link. These features are essential for many emerging wireless technologies in the 5G era. However, the existing small-scale-fading (SSF) model does not integrate these features coherently. To solve this…
▽ More
The 3GPP new radio (NR) channel model introduced spatial consistency and a correlation model for multiple frequencies. Future extensions of this model will incorporate mobility at both ends of the link. These features are essential for many emerging wireless technologies in the 5G era. However, the existing small-scale-fading (SSF) model does not integrate these features coherently. To solve this problem, we propose a new SSF model that seamlessly integrates with the remaining NR model and allows the simultaneous simulation of all three features. We demonstrate this integration by showing that the output of the new SSF model agrees well with large-scale fading (LSF) parameter distributions provided by 3GPP. This enables the simulation of new wireless technology proposals that were difficult to realize with existing geometry-based stochastic channel models (GSCMs).
△ Less
Submitted 28 June, 2019;
originally announced June 2019.
-
Industrial Indoor Measurements from 2-6 GHz for the 3GPP-NR and QuaDRiGa Channel Model
Authors:
Stephan Jaeckel,
Nick Turay,
Leszek Raschkowski,
Lars Thiele,
Risto Vuohtoniemi,
Marko Sonkki,
Veikko Hovinen,
Frank Burkhardt,
Prasanth Karunakaran,
Thomas Heyn
Abstract:
Providing reliable low latency wireless links for advanced manufacturing and processing systems is a vision of Industry 4.0. Develo**, testing and rating requires accurate models of the radio propagation channel. The current 3GPP-NR model as well as the QuaDRiGa model lack the propagation parameters for the industrial indoor scenario. To close this gap, measurements were conducted at 2.37 GHz an…
▽ More
Providing reliable low latency wireless links for advanced manufacturing and processing systems is a vision of Industry 4.0. Develo**, testing and rating requires accurate models of the radio propagation channel. The current 3GPP-NR model as well as the QuaDRiGa model lack the propagation parameters for the industrial indoor scenario. To close this gap, measurements were conducted at 2.37 GHz and 5.4 GHz at operational Siemens premises in Nuremberg, Germany. Furthermore, the campaign was planned to allow the test and parameterization of new features of the QuaDRiGa channel model such as support for device-to-device (D2D) radio links and spatial consistency. A total of 5.9 km measurement track was used to extract the statistical model parameters for line of sight (LOS) and Non-LOS propagation conditions. It was found that the metallic walls and objects in the halls create a rich scattering environment, where a large number of multipath components arrive at the receiver from all directions. This leads to a robust communication link, provided that the transceivers can handle the interference. The extracted parameters can be used in geometric-stochastic channel models such as QuaDRiGa to support simulation studies, both on link and system level.
△ Less
Submitted 28 June, 2019;
originally announced June 2019.
-
Efficient Sum-of-Sinusoids based Spatial Consistency for the 3GPP New-Radio Channel Model
Authors:
Stephan Jaeckel,
Leszek Raschkowski,
Frank Burkhardt,
Lars Thiele
Abstract:
Spatial consistency was proposed in the 3GPP TR 38.901 channel model to ensure that closely spaced mobile terminals have similar channels. Future extensions of this model might incorporate mobility at both ends of the link. This requires that all random variables in the model must be correlated in 3 (single-mobility) and up to 6 spatial dimensions (dual-mobility). Existing filtering methods cannot…
▽ More
Spatial consistency was proposed in the 3GPP TR 38.901 channel model to ensure that closely spaced mobile terminals have similar channels. Future extensions of this model might incorporate mobility at both ends of the link. This requires that all random variables in the model must be correlated in 3 (single-mobility) and up to 6 spatial dimensions (dual-mobility). Existing filtering methods cannot be used due to the large requirements of memory and computing time. The sum-of-sinusoids model promises to be an efficient solution. To use it in the 3GPP channel model, we extended the existing model to a higher number of spatial dimensions and propose a new method to calculate the sinusoid coefficients in order to control the shape of the autocorrelation function. The proposed method shows good results for 2, 3, and 6 dimensions and achieves a four times better approximation accuracy compared to the existing model. This provides a very efficient implementation of the 3GPP proposal and enables the simulation of many communication scenarios that were thought to be impossible to realize with geometry-based stochastic channel models.
△ Less
Submitted 14 August, 2018;
originally announced August 2018.