Search | arXiv e-print repository

On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition

Authors: Nick Rossenbach, Benedikt Hilmes, Ralf Schlüter

Abstract: Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show h… ▽ More Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive (NAR) TTS. To get reference phoneme durations we use two common alignment methods, a hidden Markov Gaussian-mixture model (HMM-GMM) aligner and a neural connectionist temporal classification (CTC) aligner. Using a simple algorithm based on random walks we shift phoneme duration distributions of the TTS system closer to real durations, resulting in an improvement of an ASR system using synthetic data in a semi-supervised setting. △ Less

Submitted 12 October, 2023; originally announced October 2023.

Comments: To appear at ASRU 2023

arXiv:2306.03557 [pdf, ps, other]

Take the Hint: Improving Arabic Diacritization with Partially-Diacritized Text

Authors: Parnia Bahar, Mattia Di Gangi, Nick Rossenbach, Mohammad Zeineldeen

Abstract: Automatic Arabic diacritization is useful in many applications, ranging from reading support for language learners to accurate pronunciation predictor for downstream tasks like speech synthesis. While most of the previous works focused on models that operate on raw non-diacritized text, production systems can gain accuracy by first letting humans partly annotate ambiguous words. In this paper, we… ▽ More Automatic Arabic diacritization is useful in many applications, ranging from reading support for language learners to accurate pronunciation predictor for downstream tasks like speech synthesis. While most of the previous works focused on models that operate on raw non-diacritized text, production systems can gain accuracy by first letting humans partly annotate ambiguous words. In this paper, we propose 2SDiac, a multi-source model that can effectively support optional diacritics in input to inform all predictions. We also introduce Guided Learning, a training scheme to leverage given diacritics in input with different levels of random masking. We show that the provided hints during test affect more output positions than those annotated. Moreover, experiments on two common benchmarks show that our approach i) greatly outperforms the baseline also when evaluated on non-diacritized text; and ii) achieves state-of-the-art results while reducing the parameter count by over 60%. △ Less

Submitted 31 July, 2023; v1 submitted 6 June, 2023; originally announced June 2023.

Comments: Keywords: Arabic text diacritization, partially-diacritized text, Arabic natural language processing, Accepted at INTERSPEECH 2023

arXiv:2104.05379 [pdf, other]

Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

Authors: Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

Abstract: Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems… ▽ More Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems, but only very limited in the context of other ASR architectures. We investigate the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training. Additionally, we also consider internal language model subtraction for the first time, resulting in up to 38% relative improvement. We compare the AED results to a state-of-the-art hybrid ASR system, a monophone based system using connectionist-temporal-classification (CTC) and a monotonic transducer based system. We show that for the later systems the addition of synthetic data has no relevant effect, but they still outperform the AED systems on LibriSpeech-100h. We achieve a final word-error-rate of 3.3%/10.0% with a hybrid system on the clean/noisy test-sets, surpassing any previous state-of-the-art systems on Librispeech-100h that do not include unlabeled audio data. △ Less

Submitted 13 July, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

Comments: Submitted to ASRU 2021

arXiv:1912.09257 [pdf, other]

Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

Authors: Nick Rossenbach, Albert Zeyer, Ralf Schlüter, Hermann Ney

Abstract: Recent advances in text-to-speech (TTS) led to the development of flexible multi-speaker end-to-end TTS systems. We extend state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself. ASR and TTS systems are built separately to show that text-only data can be used to enhance existing end-to-end AS… ▽ More Recent advances in text-to-speech (TTS) led to the development of flexible multi-speaker end-to-end TTS systems. We extend state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself. ASR and TTS systems are built separately to show that text-only data can be used to enhance existing end-to-end ASR systems without the necessity of parameter or architecture changes. We compare our method with language model integration of the same text data and with simple data augmentation methods like SpecAugment and show that performance improvements are mostly independent. We achieve improvements of up to 33% relative in word-error-rate (WER) over a strong baseline with data-augmentation in a low-resource environment (LibriSpeech-100h), closing the gap to a comparable oracle experiment by more than 50\%. We also show improvements of up to 5% relative WER over our most recent ASR baseline on LibriSpeech-960h. △ Less

Submitted 17 February, 2020; v1 submitted 19 December, 2019; originally announced December 2019.

Comments: Accepted to ICASSP 2020

arXiv:1906.01942 [pdf, other]

Learning Bilingual Sentence Embeddings via Autoencoding and Computing Similarities with a Multilayer Perceptron

Authors: Yunsu Kim, Hendrik Rosendahl, Nick Rossenbach, Jan Rosendahl, Shahram Khadivi, Hermann Ney

Abstract: We propose a novel model architecture and training algorithm to learn bilingual sentence embeddings from a combination of parallel and monolingual data. Our method connects autoencoding and neural machine translation to force the source and target sentence embeddings to share the same space without the help of a pivot language or an additional transformation. We train a multilayer perceptron on to… ▽ More We propose a novel model architecture and training algorithm to learn bilingual sentence embeddings from a combination of parallel and monolingual data. Our method connects autoencoding and neural machine translation to force the source and target sentence embeddings to share the same space without the help of a pivot language or an additional transformation. We train a multilayer perceptron on top of the sentence embeddings to extract good bilingual sentence pairs from nonparallel or noisy parallel data. Our approach shows promising performance on sentence alignment recovery and the WMT 2018 parallel corpus filtering tasks with only a single model. △ Less

Submitted 5 June, 2019; originally announced June 2019.

Comments: ACL 2019 Repl4NLP camera-ready

Showing 1–5 of 5 results for author: Rossenbach, N