-
Large Language Models for Dysfluency Detection in Stuttered Speech
Authors:
Dominik Wagner,
Sebastian P. Bayerl,
Ilja Baumann,
Korbinian Riedhammer,
Elmar Nöth,
Tobias Bocklet
Abstract:
Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we appr…
▽ More
Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we approach the task of multi-label dysfluency detection as a language modeling problem. We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech. The experimental results show that our system effectively combines acoustic and lexical information and achieves competitive results on the multi-label stuttering detection task.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models
Authors:
Dominik Wagner,
Ilja Baumann,
Korbinian Riedhammer,
Tobias Bocklet
Abstract:
This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we demonstrate that these outliers are also present whe…
▽ More
This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we demonstrate that these outliers are also present when transformer-based models are trained to perform automatic speech recognition, necessitating mitigation strategies for PTQ. We show that outliers can be reduced by a recently proposed gating mechanism in the attention blocks of the student model, enabling effective 8-bit quantization, and lower word error rates compared to student models without the gating mechanism in place.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Optimized Speculative Sampling for GPU Hardware Accelerators
Authors:
Dominik Wagner,
Seanie Lee,
Ilja Baumann,
Philipp Seeberger,
Korbinian Riedhammer,
Tobias Bocklet
Abstract:
In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. Additionally, we…
▽ More
In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. Additionally, we use fast on-chip memory to store intermediate results, thereby minimizing the frequency of slow read and write operations across different types of memory. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a slight decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
A Survey of Music Generation in the Context of Interaction
Authors:
Ismael Agchar,
Ilja Baumann,
Franziska Braun,
Paula Andrea Perez-Toro,
Korbinian Riedhammer,
Sebastian Trump,
Martin Ullrich
Abstract:
In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amo…
▽ More
In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks
Authors:
Dominik Wagner,
Ilja Baumann,
Tobias Bocklet
Abstract:
Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn map**s between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conve…
▽ More
Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn map**s between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conversion and waveform synthesis. This work unifies conversion and synthesis into a single model, thereby eliminating the need for a separate vocoder. By leveraging cycle-consistent training and a self-supervised auxiliary training task, our model is able to efficiently generate converted high-quality raw audio waveforms. Subjective listening tests show that our method outperforms the baseline in whispered speech conversion (up to 6.7% relative improvement), and mean opinion score predictions yield competitive results in conventional VC (between 0.5% and 2.4% relative improvement).
△ Less
Submitted 10 June, 2023;
originally announced June 2023.
-
A Stutter Seldom Comes Alone -- Cross-Corpus Stuttering Detection as a Multi-label Problem
Authors:
Sebastian P. Bayerl,
Dominik Wagner,
Ilja Baumann,
Florian Hönig,
Tobias Bocklet,
Elmar Nöth,
Korbinian Riedhammer
Abstract:
Most stuttering detection and classification research has viewed stuttering as a multi-class classification problem or a binary detection task for each dysfluency type; however, this does not match the nature of stuttering, in which one dysfluency seldom comes alone but rather co-occurs with others. This paper explores multi-language and cross-corpus end-to-end stuttering detection as a multi-labe…
▽ More
Most stuttering detection and classification research has viewed stuttering as a multi-class classification problem or a binary detection task for each dysfluency type; however, this does not match the nature of stuttering, in which one dysfluency seldom comes alone but rather co-occurs with others. This paper explores multi-language and cross-corpus end-to-end stuttering detection as a multi-label problem using a modified wav2vec 2.0 system with an attention-based classification head and multi-task learning. We evaluate the method using combinations of three datasets containing English and German stuttered speech, one containing speech modified by fluency sha**. The experimental results and an error analysis show that multi-label stuttering detection systems trained on cross-corpus and multi-language data achieve competitive results but performance on samples with multiple labels stays below over-all detection results.
△ Less
Submitted 30 May, 2023;
originally announced May 2023.
-
Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments
Authors:
Dominik Wagner,
Ilja Baumann,
Sebastian P. Bayerl,
Korbinian Riedhammer,
Tobias Bocklet
Abstract:
We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the prove…
▽ More
We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the proven method of concatenating speaker vectors to the acoustic features and supplying them as auxiliary model inputs remains a viable option to increase the robustness of end-to-end architectures. The effect on transformer models is stronger, when more noise is added to the input speech. The most substantial benefits for systems based on wav2vec 2.0 are achieved under moderate or no noise conditions. Both x-vectors and ECAPA-TDNN embeddings outperform i-vectors as speaker representations. The optimal embedding size depends on the dataset and also varies with the noise condition.
△ Less
Submitted 7 December, 2023; v1 submitted 16 November, 2022;
originally announced November 2022.
-
Influence of Utterance and Speaker Characteristics on the Classification of Children with Cleft Lip and Palate
Authors:
Ilja Baumann,
Dominik Wagner,
Franziska Braun,
Sebastian P. Bayerl,
Elmar Nöth,
Korbinian Riedhammer,
Tobias Bocklet
Abstract:
Recent findings show that pre-trained wav2vec 2.0 models are reliable feature extractors for various speaker characteristics classification tasks. We show that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be used as features for binary classification to distinguish between children with Cleft Lip and Palate (CLP) and a healthy control group. The resu…
▽ More
Recent findings show that pre-trained wav2vec 2.0 models are reliable feature extractors for various speaker characteristics classification tasks. We show that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be used as features for binary classification to distinguish between children with Cleft Lip and Palate (CLP) and a healthy control group. The results indicate that the distinction between CLP and healthy voices, especially with latent representations from the lower and middle encoder layers, reaches an accuracy of 100%. We test the classifier to find influencing factors for classification using unseen out-of-domain healthy and pathologic corpora with varying characteristics: age, spoken content, and acoustic conditions. Cross-pathology and cross-healthy tests reveal that the trained classifiers are unreliable if there is a mismatch between training and out-of-domain test data in, e.g., age, spoken content, or acoustic conditions.
△ Less
Submitted 1 August, 2023; v1 submitted 28 October, 2022;
originally announced October 2022.
-
Multi-class Detection of Pathological Speech with Latent Features: How does it perform on unseen data?
Authors:
Dominik Wagner,
Ilja Baumann,
Franziska Braun,
Sebastian P. Bayerl,
Elmar Nöth,
Korbinian Riedhammer,
Tobias Bocklet
Abstract:
The detection of pathologies from speech features is usually defined as a binary classification task with one class representing a specific pathology and the other class representing healthy speech. In this work, we train neural networks, large margin classifiers, and tree boosting machines to distinguish between four pathologies: Parkinson's disease, laryngeal cancer, cleft lip and palate, and or…
▽ More
The detection of pathologies from speech features is usually defined as a binary classification task with one class representing a specific pathology and the other class representing healthy speech. In this work, we train neural networks, large margin classifiers, and tree boosting machines to distinguish between four pathologies: Parkinson's disease, laryngeal cancer, cleft lip and palate, and oral squamous cell carcinoma. We show that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be effectively used to classify these types of pathological voices. We evaluate the robustness of our classifiers by adding room impulse responses to the test data and by applying them to unseen speech corpora. Our approach achieves unweighted average F1-Scores between 74.1% and 97.0%, depending on the model and the noise conditions used. The systems generalize and perform well on unseen data of healthy speakers sampled from a variety of different sources.
△ Less
Submitted 1 August, 2023; v1 submitted 27 October, 2022;
originally announced October 2022.
-
Nonwords Pronunciation Classification in Language Development Tests for Preschool Children
Authors:
Ilja Baumann,
Dominik Wagner,
Sebastian Bayerl,
Tobias Bocklet
Abstract:
This work aims to automatically evaluate whether the language development of children is age-appropriate. Validated speech and language tests are used for this purpose to test the auditory memory. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures: Low-level features (F…
▽ More
This work aims to automatically evaluate whether the language development of children is age-appropriate. Validated speech and language tests are used for this purpose to test the auditory memory. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures: Low-level features (FFT), speaker embeddings (ECAPA-TDNN), grapheme-motivated embeddings (wav2vec 2.0), and phonetic embeddings in form of senones (ASR acoustic model). Each of the approaches provides input for VGG-like 5-layer CNN classifiers. We also examine the adaptation per nonword. The evaluation of the proposed systems was performed using recordings from different kindergartens of spoken nonwords. ECAPA-TDNN and low-level FFT features do not explicitly model phonetic information; wav2vec2.0 is trained on grapheme labels, our ASR acoustic model features contain (sub-)phonetic information. We found that the more granular the phonetic modeling is, the higher are the achieved recognition rates. The best system trained on ASR acoustic model features with VTLN achieved an accuracy of 89.4% and an area under the ROC (Receiver Operating Characteristic) curve (AUC) of 0.923. This corresponds to an improvement in accuracy of 20.2% and AUC of 0.309 relative compared to the FFT-baseline.
△ Less
Submitted 17 June, 2022; v1 submitted 16 June, 2022;
originally announced June 2022.
-
Detecting Vocal Fatigue with Neural Embeddings
Authors:
Sebastian P. Bayerl,
Dominik Wagner,
Ilja Baumann,
Korbinian Riedhammer,
Tobias Bocklet
Abstract:
Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional map**s of the data reveal that neural embeddings capture information about the cha…
▽ More
Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional map**s of the data reveal that neural embeddings capture information about the change in vocal characteristics of a speaker during prolonged voice usage. We show that vocal fatigue can be reliably predicted using all three kinds of neural embeddings after only 50 minutes of continuous speaking when temporal smoothing and normalization are applied to the extracted embeddings. We employ support vector machines for classification and achieve accuracy scores of 81% using x-vectors, 85% using ECAPA-TDNN embeddings, and 82% using wav2vec 2.0 embeddings as input features. We obtain an accuracy score of 76%, when the trained system is applied to a different speaker and recording environment without any adaptation.
△ Less
Submitted 17 January, 2023; v1 submitted 7 April, 2022;
originally announced April 2022.
-
On the size distribution of sunspot groups in the Greenwich sunspot record 1874-1976
Authors:
I. Baumann,
S. K. Solanki
Abstract:
We investigate the size distribution of the maximum areas and the instantaneous distribution of areas of sunspot groups using the Greenwich sunspot group record spanning the interval 1874-1976. Both distributions are found to be well described by log-normal functions. Using a simple model we can transform the maximum area distribution into the instantaneous area distribution if the sunspot area…
▽ More
We investigate the size distribution of the maximum areas and the instantaneous distribution of areas of sunspot groups using the Greenwich sunspot group record spanning the interval 1874-1976. Both distributions are found to be well described by log-normal functions. Using a simple model we can transform the maximum area distribution into the instantaneous area distribution if the sunspot area decay rates are also distributed log-normally. For single-valued decay rates the resulting snapshot distribution is incompatible with the observations. The current analysis therefore supports the results of Howard (1992) and MartinezPillet (1993). It is not possible to distinguish between a linear and a quadratic decay law, however, with the employed data set.
△ Less
Submitted 18 October, 2005;
originally announced October 2005.
-
A necessary extension of the surface flux transport model
Authors:
I. Baumann,
D. Schmitt,
M. Schuessler
Abstract:
Customary two-dimensional flux transport models for the evolution of the magnetic field at the solar surface do not account for the radial structure and the volume diffusion of the magnetic field. When considering the long-term evolution of magnetic flux, this omission can lead to an unrealistic long-term memory of the system and to the suppression of polar field reversals. In order to avoid suc…
▽ More
Customary two-dimensional flux transport models for the evolution of the magnetic field at the solar surface do not account for the radial structure and the volume diffusion of the magnetic field. When considering the long-term evolution of magnetic flux, this omission can lead to an unrealistic long-term memory of the system and to the suppression of polar field reversals. In order to avoid such effects, we propose an extension of the flux transport model by a linear decay term derived consistently on the basis of the eigenmodes of the diffusion operator in a spherical shell. A decay rate for each eigenmode of the system is determined and applied to the corresponding surface part of the mode evolved in the flux transport model. The value of the volume diffusivity associated with this decay term can be estimated to be in the range 50--100 km^2/s by considering the reversals of the polar fields in comparison of flux transport simulations with observations. We show that the decay term prohibits a secular drift of the polar field in the case of cycles of varying strength, like those exhibited by the historical sunspot record.
△ Less
Submitted 11 October, 2005;
originally announced October 2005.
-
Pressure dependence of the Mg $3s4s^3S_1 \to 3s3p^3P_{0,1,2}$ transition in superfluid $^4$He
Authors:
I. Baumann,
A. Breidenassel,
C. Zühlke,
A. Kasimov,
G. zu Putlitz,
I. Reinhard,
K. Jungmann
Abstract:
The pressure shifts of the $3s4s^3S_1 \to 3s3p^3P_{0,1,2}$ transition of magnesium atoms immersed in superfluid helium have been measured at $(1.3\pm0.1 )$K between saturated vapour pressure and $24 $bar. The wavelength is blue shifted linearly by $(0.07\pm0.01) \frac{nm}{bar}$. This value can be satisfactorily described in the framework of the standard bubble model.
The pressure shifts of the $3s4s^3S_1 \to 3s3p^3P_{0,1,2}$ transition of magnesium atoms immersed in superfluid helium have been measured at $(1.3\pm0.1 )$K between saturated vapour pressure and $24 $bar. The wavelength is blue shifted linearly by $(0.07\pm0.01) \frac{nm}{bar}$. This value can be satisfactorily described in the framework of the standard bubble model.
△ Less
Submitted 28 February, 2000;
originally announced February 2000.