Search | arXiv e-print repository

Large Language Models for Dysfluency Detection in Stuttered Speech

Authors: Dominik Wagner, Sebastian P. Bayerl, Ilja Baumann, Korbinian Riedhammer, Elmar Nöth, Tobias Bocklet

Abstract: Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we appr… ▽ More Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we approach the task of multi-label dysfluency detection as a language modeling problem. We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech. The experimental results show that our system effectively combines acoustic and lexical information and achieves competitive results on the multi-label stuttering detection task. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2406.11022 [pdf, other]

Outlier Reduction with Gated Attention for Improved Post-training Quantization in Large Sequence-to-sequence Speech Foundation Models

Authors: Dominik Wagner, Ilja Baumann, Korbinian Riedhammer, Tobias Bocklet

Abstract: This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we demonstrate that these outliers are also present whe… ▽ More This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we demonstrate that these outliers are also present when transformer-based models are trained to perform automatic speech recognition, necessitating mitigation strategies for PTQ. We show that outliers can be reduced by a recently proposed gating mechanism in the attention blocks of the student model, enabling effective 8-bit quantization, and lower word error rates compared to student models without the gating mechanism in place. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: Accepted at Interspeech 2024

arXiv:2406.11016 [pdf, other]

Optimized Speculative Sampling for GPU Hardware Accelerators

Authors: Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet

Abstract: In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. Additionally, we… ▽ More In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. Additionally, we use fast on-chip memory to store intermediate results, thereby minimizing the frequency of slow read and write operations across different types of memory. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a slight decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2402.15294 [pdf, other]

A Survey of Music Generation in the Context of Interaction

Authors: Ismael Agchar, Ilja Baumann, Franziska Braun, Paula Andrea Perez-Toro, Korbinian Riedhammer, Sebastian Trump, Martin Ullrich

Abstract: In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amo… ▽ More In recent years, machine learning, and in particular generative adversarial neural networks (GANs) and attention-based neural networks (transformers), have been successfully used to compose and generate music, both melodies and polyphonic pieces. Current research focuses foremost on style replication (eg. generating a Bach-style chorale) or style transfer (eg. classical to jazz) based on large amounts of recorded or transcribed music, which in turn also allows for fairly straight-forward "performance" evaluation. However, most of these models are not suitable for human-machine co-creation through live interaction, neither is clear, how such models and resulting creations would be evaluated. This article presents a thorough review of music representation, feature analysis, heuristic algorithms, statistical and parametric modelling, and human and automatic evaluation measures, along with a discussion of which approaches and models seem most suitable for live interaction. △ Less

Submitted 23 February, 2024; originally announced February 2024.

arXiv:2306.06514 [pdf, other]

Vocoder-Free Non-Parallel Conversion of Whispered Speech With Masked Cycle-Consistent Generative Adversarial Networks

Authors: Dominik Wagner, Ilja Baumann, Tobias Bocklet

Abstract: Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn map**s between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conve… ▽ More Cycle-consistent generative adversarial networks have been widely used in non-parallel voice conversion (VC). Their ability to learn map**s between source and target features without relying on parallel training data eliminates the need for temporal alignments. However, most methods decouple the conversion of acoustic features from synthesizing the audio signal by using separate models for conversion and waveform synthesis. This work unifies conversion and synthesis into a single model, thereby eliminating the need for a separate vocoder. By leveraging cycle-consistent training and a self-supervised auxiliary training task, our model is able to efficiently generate converted high-quality raw audio waveforms. Subjective listening tests show that our method outperforms the baseline in whispered speech conversion (up to 6.7% relative improvement), and mean opinion score predictions yield competitive results in conventional VC (between 0.5% and 2.4% relative improvement). △ Less

Submitted 10 June, 2023; originally announced June 2023.

arXiv:2305.19255 [pdf, other]

A Stutter Seldom Comes Alone -- Cross-Corpus Stuttering Detection as a Multi-label Problem

Authors: Sebastian P. Bayerl, Dominik Wagner, Ilja Baumann, Florian Hönig, Tobias Bocklet, Elmar Nöth, Korbinian Riedhammer

Abstract: Most stuttering detection and classification research has viewed stuttering as a multi-class classification problem or a binary detection task for each dysfluency type; however, this does not match the nature of stuttering, in which one dysfluency seldom comes alone but rather co-occurs with others. This paper explores multi-language and cross-corpus end-to-end stuttering detection as a multi-labe… ▽ More Most stuttering detection and classification research has viewed stuttering as a multi-class classification problem or a binary detection task for each dysfluency type; however, this does not match the nature of stuttering, in which one dysfluency seldom comes alone but rather co-occurs with others. This paper explores multi-language and cross-corpus end-to-end stuttering detection as a multi-label problem using a modified wav2vec 2.0 system with an attention-based classification head and multi-task learning. We evaluate the method using combinations of three datasets containing English and German stuttered speech, one containing speech modified by fluency sha**. The experimental results and an error analysis show that multi-label stuttering detection systems trained on cross-corpus and multi-language data achieve competitive results but performance on samples with multiple labels stays below over-all detection results. △ Less

Submitted 30 May, 2023; originally announced May 2023.

Comments: Accepted for presentation at Interspeech 2023. arXiv admin note: substantial text overlap with arXiv:2210.15982

arXiv:2211.08774 [pdf, other]

Speaker Adaptation for End-To-End Speech Recognition Systems in Noisy Environments

Authors: Dominik Wagner, Ilja Baumann, Sebastian P. Bayerl, Korbinian Riedhammer, Tobias Bocklet

Abstract: We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the prove… ▽ More We analyze the impact of speaker adaptation in end-to-end automatic speech recognition models based on transformers and wav2vec 2.0 under different noise conditions. By including speaker embeddings obtained from x-vector and ECAPA-TDNN systems, as well as i-vectors, we achieve relative word error rate improvements of up to 16.3% on LibriSpeech and up to 14.5% on Switchboard. We show that the proven method of concatenating speaker vectors to the acoustic features and supplying them as auxiliary model inputs remains a viable option to increase the robustness of end-to-end architectures. The effect on transformer models is stronger, when more noise is added to the input speech. The most substantial benefits for systems based on wav2vec 2.0 are achieved under moderate or no noise conditions. Both x-vectors and ECAPA-TDNN embeddings outperform i-vectors as speaker representations. The optimal embedding size depends on the dataset and also varies with the noise condition. △ Less

Submitted 7 December, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: Accepted at ASRU 2023

arXiv:2210.15941 [pdf, other]

Influence of Utterance and Speaker Characteristics on the Classification of Children with Cleft Lip and Palate

Authors: Ilja Baumann, Dominik Wagner, Franziska Braun, Sebastian P. Bayerl, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet

Abstract: Recent findings show that pre-trained wav2vec 2.0 models are reliable feature extractors for various speaker characteristics classification tasks. We show that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be used as features for binary classification to distinguish between children with Cleft Lip and Palate (CLP) and a healthy control group. The resu… ▽ More Recent findings show that pre-trained wav2vec 2.0 models are reliable feature extractors for various speaker characteristics classification tasks. We show that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be used as features for binary classification to distinguish between children with Cleft Lip and Palate (CLP) and a healthy control group. The results indicate that the distinction between CLP and healthy voices, especially with latent representations from the lower and middle encoder layers, reaches an accuracy of 100%. We test the classifier to find influencing factors for classification using unseen out-of-domain healthy and pathologic corpora with varying characteristics: age, spoken content, and acoustic conditions. Cross-pathology and cross-healthy tests reveal that the trained classifiers are unreliable if there is a mismatch between training and out-of-domain test data in, e.g., age, spoken content, or acoustic conditions. △ Less

Submitted 1 August, 2023; v1 submitted 28 October, 2022; originally announced October 2022.

Comments: INTERSPEECH 2023

arXiv:2210.15336 [pdf, ps, other]

Multi-class Detection of Pathological Speech with Latent Features: How does it perform on unseen data?

Authors: Dominik Wagner, Ilja Baumann, Franziska Braun, Sebastian P. Bayerl, Elmar Nöth, Korbinian Riedhammer, Tobias Bocklet

Abstract: The detection of pathologies from speech features is usually defined as a binary classification task with one class representing a specific pathology and the other class representing healthy speech. In this work, we train neural networks, large margin classifiers, and tree boosting machines to distinguish between four pathologies: Parkinson's disease, laryngeal cancer, cleft lip and palate, and or… ▽ More The detection of pathologies from speech features is usually defined as a binary classification task with one class representing a specific pathology and the other class representing healthy speech. In this work, we train neural networks, large margin classifiers, and tree boosting machines to distinguish between four pathologies: Parkinson's disease, laryngeal cancer, cleft lip and palate, and oral squamous cell carcinoma. We show that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be effectively used to classify these types of pathological voices. We evaluate the robustness of our classifiers by adding room impulse responses to the test data and by applying them to unseen speech corpora. Our approach achieves unweighted average F1-Scores between 74.1% and 97.0%, depending on the model and the noise conditions used. The systems generalize and perform well on unseen data of healthy speakers sampled from a variety of different sources. △ Less

Submitted 1 August, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

Comments: INTERSPEECH 2023

arXiv:2206.08058 [pdf, other]

Nonwords Pronunciation Classification in Language Development Tests for Preschool Children

Authors: Ilja Baumann, Dominik Wagner, Sebastian Bayerl, Tobias Bocklet

Abstract: This work aims to automatically evaluate whether the language development of children is age-appropriate. Validated speech and language tests are used for this purpose to test the auditory memory. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures: Low-level features (F… ▽ More This work aims to automatically evaluate whether the language development of children is age-appropriate. Validated speech and language tests are used for this purpose to test the auditory memory. In this work, the task is to determine whether spoken nonwords have been uttered correctly. We compare different approaches that are motivated to model specific language structures: Low-level features (FFT), speaker embeddings (ECAPA-TDNN), grapheme-motivated embeddings (wav2vec 2.0), and phonetic embeddings in form of senones (ASR acoustic model). Each of the approaches provides input for VGG-like 5-layer CNN classifiers. We also examine the adaptation per nonword. The evaluation of the proposed systems was performed using recordings from different kindergartens of spoken nonwords. ECAPA-TDNN and low-level FFT features do not explicitly model phonetic information; wav2vec2.0 is trained on grapheme labels, our ASR acoustic model features contain (sub-)phonetic information. We found that the more granular the phonetic modeling is, the higher are the achieved recognition rates. The best system trained on ASR acoustic model features with VTLN achieved an accuracy of 89.4% and an area under the ROC (Receiver Operating Characteristic) curve (AUC) of 0.923. This corresponds to an improvement in accuracy of 20.2% and AUC of 0.309 relative compared to the FFT-baseline. △ Less

Submitted 17 June, 2022; v1 submitted 16 June, 2022; originally announced June 2022.

Comments: Accepted at Interspeech 2022

arXiv:2204.03428 [pdf, other]

Detecting Vocal Fatigue with Neural Embeddings

Authors: Sebastian P. Bayerl, Dominik Wagner, Ilja Baumann, Korbinian Riedhammer, Tobias Bocklet

Abstract: Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional map**s of the data reveal that neural embeddings capture information about the cha… ▽ More Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional map**s of the data reveal that neural embeddings capture information about the change in vocal characteristics of a speaker during prolonged voice usage. We show that vocal fatigue can be reliably predicted using all three kinds of neural embeddings after only 50 minutes of continuous speaking when temporal smoothing and normalization are applied to the extracted embeddings. We employ support vector machines for classification and achieve accuracy scores of 81% using x-vectors, 85% using ECAPA-TDNN embeddings, and 82% using wav2vec 2.0 embeddings as input features. We obtain an accuracy score of 76%, when the trained system is applied to a different speaker and recording environment without any adaptation. △ Less

Submitted 17 January, 2023; v1 submitted 7 April, 2022; originally announced April 2022.

Comments: Accepted for Publication in the Journal of Voice

arXiv:astro-ph/0510516 [pdf, ps, other]

doi 10.1051/0004-6361:20053415

On the size distribution of sunspot groups in the Greenwich sunspot record 1874-1976

Authors: I. Baumann, S. K. Solanki

Abstract: We investigate the size distribution of the maximum areas and the instantaneous distribution of areas of sunspot groups using the Greenwich sunspot group record spanning the interval 1874-1976. Both distributions are found to be well described by log-normal functions. Using a simple model we can transform the maximum area distribution into the instantaneous area distribution if the sunspot area… ▽ More We investigate the size distribution of the maximum areas and the instantaneous distribution of areas of sunspot groups using the Greenwich sunspot group record spanning the interval 1874-1976. Both distributions are found to be well described by log-normal functions. Using a simple model we can transform the maximum area distribution into the instantaneous area distribution if the sunspot area decay rates are also distributed log-normally. For single-valued decay rates the resulting snapshot distribution is incompatible with the observations. The current analysis therefore supports the results of Howard (1992) and MartinezPillet (1993). It is not possible to distinguish between a linear and a quadratic decay law, however, with the employed data set. △ Less

Submitted 18 October, 2005; originally announced October 2005.

Comments: accepted by A&A

arXiv:astro-ph/0510322 [pdf, ps, other]

A necessary extension of the surface flux transport model

Authors: I. Baumann, D. Schmitt, M. Schuessler

Abstract: Customary two-dimensional flux transport models for the evolution of the magnetic field at the solar surface do not account for the radial structure and the volume diffusion of the magnetic field. When considering the long-term evolution of magnetic flux, this omission can lead to an unrealistic long-term memory of the system and to the suppression of polar field reversals. In order to avoid suc… ▽ More Customary two-dimensional flux transport models for the evolution of the magnetic field at the solar surface do not account for the radial structure and the volume diffusion of the magnetic field. When considering the long-term evolution of magnetic flux, this omission can lead to an unrealistic long-term memory of the system and to the suppression of polar field reversals. In order to avoid such effects, we propose an extension of the flux transport model by a linear decay term derived consistently on the basis of the eigenmodes of the diffusion operator in a spherical shell. A decay rate for each eigenmode of the system is determined and applied to the corresponding surface part of the mode evolved in the flux transport model. The value of the volume diffusivity associated with this decay term can be estimated to be in the range 50--100 km^2/s by considering the reversals of the polar fields in comparison of flux transport simulations with observations. We show that the decay term prohibits a secular drift of the polar field in the case of cycles of varying strength, like those exhibited by the historical sunspot record. △ Less

Submitted 11 October, 2005; originally announced October 2005.

Comments: for further information visit: http://solweb.oma.be/users/baumann/

arXiv:quant-ph/0002086 [pdf, ps, other]

doi 10.1007/s100530070046

Pressure dependence of the Mg $3s4s^3S_1 \to 3s3p^3P_{0,1,2}$ transition in superfluid $^4$He

Authors: I. Baumann, A. Breidenassel, C. Zühlke, A. Kasimov, G. zu Putlitz, I. Reinhard, K. Jungmann

Abstract: The pressure shifts of the $3s4s^3S_1 \to 3s3p^3P_{0,1,2}$ transition of magnesium atoms immersed in superfluid helium have been measured at $(1.3\pm0.1 )$K between saturated vapour pressure and $24 $bar. The wavelength is blue shifted linearly by $(0.07\pm0.01) \frac{nm}{bar}$. This value can be satisfactorily described in the framework of the standard bubble model. The pressure shifts of the $3s4s^3S_1 \to 3s3p^3P_{0,1,2}$ transition of magnesium atoms immersed in superfluid helium have been measured at $(1.3\pm0.1 )$K between saturated vapour pressure and $24 $bar. The wavelength is blue shifted linearly by $(0.07\pm0.01) \frac{nm}{bar}$. This value can be satisfactorily described in the framework of the standard bubble model. △ Less

Submitted 28 February, 2000; originally announced February 2000.

Comments: submitted to EPJD

Journal ref: Eur. Phys. J. D 12, 117-122 (2000)

Showing 1–14 of 14 results for author: Baumann, I