Search | arXiv e-print repository

On Crowdsourcing-design with Comparison Category Rating for Evaluating Speech Enhancement Algorithms

Authors: Angélica S. Z. Suárez, Clément Laroche, Line H. Clemmensen, Sneha Das

Abstract: Speech enhancement techniques improve the quality or the intelligibility of an audio signal by removing unwanted noise. It is used as preprocessing in numerous applications such as speech recognition, hearing aids, broadcasting and telephony. The evaluation of such algorithms often relies on reference-based objective metrics that are shown to correlate poorly with human perception. In order to eva… ▽ More Speech enhancement techniques improve the quality or the intelligibility of an audio signal by removing unwanted noise. It is used as preprocessing in numerous applications such as speech recognition, hearing aids, broadcasting and telephony. The evaluation of such algorithms often relies on reference-based objective metrics that are shown to correlate poorly with human perception. In order to evaluate audio quality as perceived by human observers it is thus fundamental to resort to subjective quality assessment. In this paper, a user evaluation based on crowdsourcing (subjective) and the Comparison Category Rating (CCR) method is compared against the DNSMOS, ViSQOL and 3QUEST (objective) metrics. The overall quality scores of three speech enhancement algorithms from real time communications (RTC) are used in the comparison using the P.808 toolkit. Results indicate that while the CCR scale allows participants to identify differences between processed and unprocessed audio samples, two groups of preferences emerge: some users rate positively by focusing on noise suppression processing, while others rate negatively by focusing mainly on speech quality. We further present results on the parameters, size considerations and speaker variations that are critical and should be considered when designing the CCR-based crowdsourcing evaluation. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: Published at ICASSP 2023

arXiv:2204.11550 [pdf, other]

Speech Detection For Child-Clinician Conversations In Danish For Low-Resource In-The-Wild Conditions: A Case Study

Authors: Sneha Das, Nicole Nadine Lønfeldt, Anne Katrine Pagsberg, Line. H. Clemmensen

Abstract: Use of speech models for automatic speech processing tasks can improve efficiency in the screening, analysis, diagnosis and treatment in medicine and psychiatry. However, the performance of pre-processing speech tasks like segmentation and diarization can drop considerably on in-the-wild clinical data, specifically when the target dataset comprises of atypical speech. In this paper we study the pe… ▽ More Use of speech models for automatic speech processing tasks can improve efficiency in the screening, analysis, diagnosis and treatment in medicine and psychiatry. However, the performance of pre-processing speech tasks like segmentation and diarization can drop considerably on in-the-wild clinical data, specifically when the target dataset comprises of atypical speech. In this paper we study the performance of a pre-trained speech model on a dataset comprising of child-clinician conversations in Danish with respect to the classification threshold. Since we do not have access to sufficient labelled data, we propose few-instance threshold adaptation, wherein we employ the first minutes of the speech conversation to obtain the optimum classification threshold. Through our work in this paper, we learned that the model with default classification threshold performs worse on children from the patient group. Furthermore, the error rates of the model is directly correlated to the severity of diagnosis in the patients. Lastly, our study on few-instance adaptation shows that three-minutes of clinician-child conversation is sufficient to obtain the optimum classification threshold. △ Less

Submitted 25 April, 2022; originally announced April 2022.

Comments: 5 pages. Submitted to Interspeech 2022

arXiv:2203.14867 [pdf, other]

Continuous Metric Learning For Transferable Speech Emotion Recognition and Embedding Across Low-resource Languages

Authors: Sneha Das, Nicklas Leander Lund, Nicole Nadine Lønfeldt, Anne Katrine Pagsberg, Line H. Clemmensen

Abstract: Speech emotion recognition~(SER) refers to the technique of inferring the emotional state of an individual from speech signals. SERs continue to garner interest due to their wide applicability. Although the domain is mainly founded on signal processing, machine learning, and deep learning, generalizing over languages continues to remain a challenge. However, develo** generalizable and transferab… ▽ More Speech emotion recognition~(SER) refers to the technique of inferring the emotional state of an individual from speech signals. SERs continue to garner interest due to their wide applicability. Although the domain is mainly founded on signal processing, machine learning, and deep learning, generalizing over languages continues to remain a challenge. However, develo** generalizable and transferable models are critical due to a lack of sufficient resources in terms of data and labels for languages beyond the most commonly spoken ones. To improve performance over languages, we propose a denoising autoencoder with semi-supervision using a continuous metric loss based on either activation or valence. The novelty of this work lies in our proposal of continuous metric learning, which is among the first proposals on the topic to the best of our knowledge. Furthermore, to address the lack of activation and valence labels in the transfer datasets, we annotate the signal samples with activation and valence levels corresponding to a dimensional model of emotions, which were then used to evaluate the quality of the embedding over the transfer datasets. We show that the proposed semi-supervised model consistently outperforms the baseline unsupervised method, which is a conventional denoising autoencoder, in terms of emotion classification accuracy as well as correlation with respect to the dimensional variables. Further evaluation of classification accuracy with respect to the reference, a BERT based speech representation model, shows that the proposed method is comparable to the reference method in classifying specific emotion classes at a much lower complexity. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: Preprint of paper accepted to be presented at the Northern Lights Deep Learning Conference (NLDL), 2022. The labels are available at: https://bit.ly/3rg6VsA

arXiv:2203.14865 [pdf, other]

Towards Transferable Speech Emotion Representation: On loss functions for cross-lingual latent representations

Authors: Sneha Das, Nicole Nadine Lønfeldt, Anne Katrine Pagsberg, Line H. Clemmensen

Abstract: In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we add… ▽ More In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques which provide transfer learning possibilities. However, generalizing over languages, corpora and recording conditions is still an open challenge. In this work we address this gap by exploring loss functions that aid in transferability, specifically to non-tonal languages. We propose a variational autoencoder (VAE) with KL annealing and a semi-supervised VAE to obtain more consistent latent embedding distributions across data sets. To ensure transferability, the distribution of the latent embedding should be similar across non-tonal languages (data sets). We start by presenting a low-complexity SER based on a denoising-autoencoder, which achieves an unweighted classification accuracy of over 52.09% for four-class emotion classification. This performance is comparable to that of similar baseline methods. Following this, we employ a VAE, the semi-supervised VAE and the VAE with KL annealing to obtain a more regularized latent space. We show that while the DAE has the highest classification accuracy among the methods, the semi-supervised VAE has a comparable classification accuracy and a more consistent latent embedding distribution over data sets. △ Less

Submitted 28 March, 2022; originally announced March 2022.

Comments: Preprint of paper accepted to be presented at the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022. Source code at https://bit.ly/34CgkSZ. arXiv admin note: text overlap with arXiv:2105.02055

arXiv:2203.07033 [pdf, other]

doi 10.7557/18.6282

Compressing CNN Kernels for Videos Using Tucker Decompositions: Towards Lightweight CNN Applications

Authors: Tobias Engelhardt Rasmussen, Line H Clemmensen, Andreas Baum

Abstract: Convolutional Neural Networks (CNN) are the state-of-the-art in the field of visual computing. However, a major problem with CNNs is the large number of floating point operations (FLOPs) required to perform convolutions for large inputs. When considering the application of CNNs to video data, convolutional filters become even more complex due to the extra temporal dimension. This leads to problems… ▽ More Convolutional Neural Networks (CNN) are the state-of-the-art in the field of visual computing. However, a major problem with CNNs is the large number of floating point operations (FLOPs) required to perform convolutions for large inputs. When considering the application of CNNs to video data, convolutional filters become even more complex due to the extra temporal dimension. This leads to problems when respective applications are to be deployed on mobile devices, such as smart phones, tablets, micro-controllers or similar, indicating less computational power. Kim et al. (2016) proposed using a Tucker-decomposition to compress the convolutional kernel of a pre-trained network for images in order to reduce the complexity of the network, i.e. the number of FLOPs. In this paper, we generalize the aforementioned method for application to videos (and other 3D signals) and evaluate the proposed method on a modified version of the THETIS data set, which contains videos of individuals performing tennis shots. We show that the compressed network reaches comparable accuracy, while indicating a memory compression by a factor of 51. However, the actual computational speed-up (factor 1.4) does not meet our theoretically derived expectation (factor 6). △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: Presented at the Northern Lights Deep Learning Conference 2022 in Tromsø, Norway

arXiv:2203.04706 [pdf, other]

Data Representativity for Machine Learning and AI Systems

Authors: Line H. Clemmensen, Rune D. Kjærsgaard

Abstract: Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a repre… ▽ More Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates. △ Less

Submitted 3 February, 2023; v1 submitted 9 March, 2022; originally announced March 2022.

arXiv:2105.02055 [pdf, other]

Towards Interpretable and Transferable Speech Emotion Recognition: Latent Representation Based Analysis of Features, Methods and Corpora

Authors: Sneha Das, Nicole Nadine Lønfeldt, Anne Katrine Pagsberg, Line H. Clemmensen

Abstract: In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques. However, generalizing over languages, corpora and recording conditions is still an open challenge in the field. Furthermore, due to the black-box nature of deep lea… ▽ More In recent years, speech emotion recognition (SER) has been used in wide ranging applications, from healthcare to the commercial sector. In addition to signal processing approaches, methods for SER now also use deep learning techniques. However, generalizing over languages, corpora and recording conditions is still an open challenge in the field. Furthermore, due to the black-box nature of deep learning algorithms, a newer challenge is the lack of interpretation and transparency in the models and the decision making process. This is critical when the SER systems are deployed in applications that influence human lives. In this work we address this gap by providing an in-depth analysis of the decision making process of the proposed SER system. Towards that end, we present low-complexity SER based on undercomplete- and denoising- autoencoders that achieve an average classification accuracy of over 55\% for four-class emotion classification. Following this, we investigate the clustering of emotions in the latent space to understand the influence of the corpora on the model behavior and to obtain a physical interpretation of the latent embedding. Lastly, we explore the role of each input feature towards the performance of the SER. △ Less

Submitted 5 May, 2021; originally announced May 2021.

arXiv:2006.01671 [pdf, other]

A generalized linear joint trained framework for semi-supervised learning of sparse features

Authors: Juan C. Laria, Line H. Clemmensen, Bjarne K. Ersbøll

Abstract: The elastic-net is among the most widely used types of regularization algorithms, commonly associated with the problem of supervised generalized linear model estimation via penalized maximum likelihood. Its nice properties originate from a combination of $\ell_1$ and $\ell_2$ norms, which endow this method with the ability to select variables taking into account the correlations between them. In t… ▽ More The elastic-net is among the most widely used types of regularization algorithms, commonly associated with the problem of supervised generalized linear model estimation via penalized maximum likelihood. Its nice properties originate from a combination of $\ell_1$ and $\ell_2$ norms, which endow this method with the ability to select variables taking into account the correlations between them. In the last few years, semi-supervised approaches, that use both labeled and unlabeled data, have become an important component in the statistical research. Despite this interest, however, few researches have investigated semi-supervised elastic-net extensions. This paper introduces a novel solution for semi-supervised learning of sparse features in the context of generalized linear model estimation: the generalized semi-supervised elastic-net (s2net), which extends the supervised elastic-net method, with a general mathematical formulation that covers, but is not limited to, both regression and classification problems. We develop a flexible and fast implementation for s2net in R, and its advantages are illustrated using both real and synthetic data sets. △ Less

Submitted 2 October, 2020; v1 submitted 2 June, 2020; originally announced June 2020.

arXiv:1605.09196 [pdf, other]

Forest Floor Visualizations of Random Forests

Authors: Soeren H. Welling, Hanne H. F. Refsgaard, Per B. Brockhoff, Line H. Clemmensen

Abstract: We propose a novel methodology, forest floor, to visualize and interpret random forest (RF) models. RF is a popular and useful tool for non-linear multi-variate classification and regression, which yields a good trade-off between robustness (low variance) and adaptiveness (low bias). Direct interpretation of a RF model is difficult, as the explicit ensemble model of hundreds of deep trees is compl… ▽ More We propose a novel methodology, forest floor, to visualize and interpret random forest (RF) models. RF is a popular and useful tool for non-linear multi-variate classification and regression, which yields a good trade-off between robustness (low variance) and adaptiveness (low bias). Direct interpretation of a RF model is difficult, as the explicit ensemble model of hundreds of deep trees is complex. Nonetheless, it is possible to visualize a RF model fit by its map** from feature space to prediction space. Hereby the user is first presented with the overall geometrical shape of the model structure, and when needed one can zoom in on local details. Dimensional reduction by projection is used to visualize high dimensional shapes. The traditional method to visualize RF model structure, partial dependence plots, achieve this by averaging multiple parallel projections. We suggest to first use feature contributions, a method to decompose trees by splitting features, and then subsequently perform projections. The advantages of forest floor over partial dependence plots is that interactions are not masked by averaging. As a consequence, it is possible to locate interactions, which are not visualized in a given projection. Furthermore, we introduce: a goodness-of-visualization measure, use of colour gradients to identify interactions and an out-of-bag cross validated variant of feature contributions. △ Less

Submitted 4 July, 2016; v1 submitted 30 May, 2016; originally announced May 2016.

Comments: 25 pages, 12 figures, supplementary materials. v2->v3: minor proofing, moderated comments on ICE-plots, replaced ψ-operator with the subset named H in equation 13 and 14 to improve simplicity

Showing 1–9 of 9 results for author: Clemmensen, L H