Skip to main content

Showing 1–50 of 61 results for author: Schlüter, R

Searching in archive cs. Search in all archives.
.
  1. arXiv:2310.08132  [pdf, other

    cs.CL cs.SD eess.AS

    On the Relevance of Phoneme Duration Variability of Synthesized Training Data for Automatic Speech Recognition

    Authors: Nick Rossenbach, Benedikt Hilmes, Ralf Schlüter

    Abstract: Synthetic data generated by text-to-speech (TTS) systems can be used to improve automatic speech recognition (ASR) systems in low-resource or domain mismatch tasks. It has been shown that TTS-generated outputs still do not have the same qualities as real data. In this work we focus on the temporal structure of synthetic data and its relation to ASR training. By using a novel oracle setup we show h… ▽ More

    Submitted 12 October, 2023; originally announced October 2023.

    Comments: To appear at ASRU 2023

  2. arXiv:2310.07345  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers

    Authors: Zijian Yang, Wei Zhou, Ralf Schlüter, Hermann Ney

    Abstract: In this work, we investigate the effect of language models (LMs) with different context lengths and label units (phoneme vs. word) used in sequence discriminative training for phoneme-based neural transducers. Both lattice-free and N-best-list approaches are examined. For lattice-free methods with phoneme-level LMs, we propose a method to approximate the context history to employ LMs with full-con… ▽ More

    Submitted 11 October, 2023; originally announced October 2023.

    Comments: accepted at ASRU 2023

  3. arXiv:2310.02724  [pdf, other

    cs.LG cs.SD eess.AS

    End-to-End Training of a Neural HMM with Label and Transition Probabilities

    Authors: Daniel Mann, Tina Raissi, Wilfried Michel, Ralf Schlüter, Hermann Ney

    Abstract: We investigate a novel modeling approach for end-to-end neural network training using hidden Markov models (HMM) where the transition probabilities between hidden states are modeled and learned explicitly. Most contemporary sequence-to-sequence models allow for from-scratch training by summing over all possible label segmentations in a given topology. In our approach there are explicit, learnable… ▽ More

    Submitted 9 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

    Comments: Accepted for Presentation at ASRU2023

  4. arXiv:2309.14130  [pdf, ps, other

    cs.SD cs.CL cs.LG eess.AS

    On the Relation between Internal Language Model and Sequence Discriminative Training for Neural Transducers

    Authors: Zijian Yang, Wei Zhou, Ralf Schlüter, Hermann Ney

    Abstract: Internal language model (ILM) subtraction has been widely applied to improve the performance of the RNN-Transducer with external language model (LM) fusion for speech recognition. In this work, we show that sequence discriminative training has a strong correlation with ILM subtraction from both theoretical and empirical points of view. Theoretically, we derive that the global optimum of maximum mu… ▽ More

    Submitted 13 April, 2024; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: accepted at ICASSP 2024

  5. arXiv:2309.08454  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Mixture Encoder Supporting Continuous Speech Separation for Meeting Recognition

    Authors: Peter Vieting, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

    Abstract: Many real-life applications of automatic speech recognition (ASR) require processing of overlapped speech. A commonmethod involves first separating the speech into overlap-free streams and then performing ASR on the resulting signals. Recently, the inclusion of a mixture encoder in the ASR model has been proposed. This mixture encoder leverages the original overlapped speech to mitigate the effect… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  6. arXiv:2309.08436  [pdf, other

    eess.AS cs.SD stat.ML

    Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

    Authors: Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transduc… ▽ More

    Submitted 17 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  7. arXiv:2308.04286  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Comparative Analysis of the wav2vec 2.0 Feature Extractor

    Authors: Peter Vieting, Ralf Schlüter, Hermann Ney

    Abstract: Automatic speech recognition (ASR) systems typically use handcrafted feature extraction pipelines. To avoid their inherent information loss and to achieve more consistent modeling from speech to transcribed text, neural raw waveform feature extractors (FEs) are an appealing approach. Also the wav2vec 2.0 model, which has recently gained large popularity, uses a convolutional FE which operates dire… ▽ More

    Submitted 8 August, 2023; originally announced August 2023.

    Comments: Accepted at ITG 2023

  8. arXiv:2306.12173  [pdf, other

    cs.CL cs.LG

    Mixture Encoder for Joint Speech Separation and Recognition

    Authors: Simon Berger, Peter Vieting, Christoph Boeddeker, Ralf Schlüter, Reinhold Haeb-Umbach

    Abstract: Multi-speaker automatic speech recognition (ASR) is crucial for many real-world applications, but it requires dedicated modeling techniques. Existing approaches can be divided into modular and end-to-end methods. Modular approaches separate speakers and recognize each of them with a single-speaker ASR system. End-to-end models process overlapped speech directly in a single, powerful neural network… ▽ More

    Submitted 21 June, 2023; originally announced June 2023.

    Comments: Accepted at Interspeech 2023

  9. arXiv:2306.09517  [pdf, ps, other

    cs.SD eess.AS

    Competitive and Resource Efficient Factored Hybrid HMM Systems are Simpler Than You Think

    Authors: Tina Raissi, Christoph Lüscher, Moritz Gunz, Ralf Schlüter, Hermann Ney

    Abstract: Building competitive hybrid hidden Markov model~(HMM) systems for automatic speech recognition~(ASR) requires a complex multi-stage pipeline consisting of several training criteria. The recent sequence-to-sequence models offer the advantage of having simpler pipelines that can start from-scratch. We propose a purely neural based single-stage from-scratch pipeline for a context-dependent hybrid HMM… ▽ More

    Submitted 15 June, 2023; originally announced June 2023.

    Comments: Accepted for presentation at InterSpeech 2023

  10. RASR2: The RWTH ASR Toolkit for Generic Sequence-to-sequence Speech Recognition

    Authors: Wei Zhou, Eugen Beck, Simon Berger, Ralf Schlüter, Hermann Ney

    Abstract: Modern public ASR tools usually provide rich support for training various sequence-to-sequence (S2S) models, but rather simple support for decoding open-vocabulary scenarios only. For closed-vocabulary scenarios, public tools supporting lexical-constrained decoding are usually only for classical ASR, or do not support all S2S models. To eliminate this restriction on research possibilities such as… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: accepted at Interspeech 2023

  11. arXiv:2303.03329  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speech Recognition: A Survey

    Authors: Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schlüter, Shinji Watanabe

    Abstract: In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning. In the wake of this transition, a number of all-neural ASR architectures were introduced. These so-called end-to-end (E2E) models provide highly integrated, completely neural AS… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language Processing

  12. arXiv:2301.04571  [pdf, other

    cs.CL eess.AS stat.ML

    Analyzing And Improving Neural Speaker Embeddings for ASR

    Authors: Christoph Lüscher, **g**g Xu, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney

    Abstract: Neural speaker embeddings encode the speaker's speech characteristics through a DNN model and are prevalent for speaker verification tasks. However, few studies have investigated the usage of neural speaker embeddings for an ASR system. In this work, we present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system. For ASR, our improved embedding extr… ▽ More

    Submitted 20 September, 2023; v1 submitted 11 January, 2023; originally announced January 2023.

    Comments: Accepted at ITG Speech Communications 2023

  13. arXiv:2212.04325  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Lattice-Free Sequence Discriminative Training for Phoneme-Based Neural Transducers

    Authors: Zijian Yang, Wei Zhou, Ralf Schlüter, Hermann Ney

    Abstract: Recently, RNN-Transducers have achieved remarkable results on various automatic speech recognition tasks. However, lattice-free sequence discriminative training methods, which obtain superior performance in hybrid models, are rarely investigated in RNN-Transducers. In this work, we propose three lattice-free training objectives, namely lattice-free maximum mutual information, lattice-free segment-… ▽ More

    Submitted 25 May, 2023; v1 submitted 7 December, 2022; originally announced December 2022.

    Comments: accepted at ICASSP 2023

  14. arXiv:2211.06369  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Enhancing and Adversarial: Improve ASR with Speaker Labels

    Authors: Wei Zhou, Haotian Wu, **g**g Xu, Mohammad Zeineldeen, Christoph Lüscher, Ralf Schlüter, Hermann Ney

    Abstract: ASR can be improved by multi-task learning (MTL) with domain enhancing or domain adversarial training, which are two opposite objectives with the aim to increase/decrease domain variance towards domain-aware/agnostic ASR, respectively. In this work, we study how to best apply these two opposite objectives with speaker labels to improve conformer-based ASR. We also propose a novel adaptive gradient… ▽ More

    Submitted 24 February, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

    Comments: accepted at ICASSP 2023

  15. arXiv:2210.15445  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Efficient Utilization of Large Pre-Trained Models for Low Resource ASR

    Authors: Peter Vieting, Christoph Lüscher, Julian Dierkes, Ralf Schlüter, Hermann Ney

    Abstract: Unsupervised representation learning has recently helped automatic speech recognition (ASR) to tackle tasks with limited labeled data. Following this, hardware limitations and applications give rise to the question how to take advantage of large pre-trained models efficiently and reduce their complexity. In this work, we study a challenging low resource conversational telephony speech corpus from… ▽ More

    Submitted 17 August, 2023; v1 submitted 26 October, 2022; originally announced October 2022.

    Comments: Accepted at ICASSP SASB 2023

  16. arXiv:2210.14742  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Monotonic segmental attention for automatic speech recognition

    Authors: Albert Zeyer, Robin Schmitt, Wei Zhou, Ralf Schlüter, Hermann Ney

    Abstract: We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, on… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: accepted at SLT: https://slt2022.org/

  17. arXiv:2210.13397  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Development of Hybrid ASR Systems for Low Resource Medical Domain Conversational Telephone Speech

    Authors: Christoph Lüscher, Mohammad Zeineldeen, Zijian Yang, Tina Raissi, Peter Vieting, Khai Le-Duc, Weiyue Wang, Ralf Schlüter, Hermann Ney

    Abstract: Language barriers present a great challenge in our increasingly connected and global world. Especially within the medical domain, e.g. hospital or emergency room, communication difficulties and delays may lead to malpractice and non-optimal patient care. In the HYKIST project, we consider patient-physician communication, more specifically between a German-speaking physician and an Arabic- or Vietn… ▽ More

    Submitted 22 September, 2023; v1 submitted 24 October, 2022; originally announced October 2022.

    Comments: ASR System Paper for HYKIST project

  18. arXiv:2210.09951  [pdf, other

    cs.SD eess.AS

    HMM vs. CTC for Automatic Speech Recognition: Comparison Based on Full-Sum Training from Scratch

    Authors: Tina Raissi, Wei Zhou, Simon Berger, Ralf Schlüter, Hermann Ney

    Abstract: In this work, we compare from-scratch sequence-level cross-entropy (full-sum) training of Hidden Markov Model (HMM) and Connectionist Temporal Classification (CTC) topologies for automatic speech recognition (ASR). Besides accuracy, we further analyze their capability for generating high-quality time alignment between the speech signal and the transcription, which can be crucial for many subsequen… ▽ More

    Submitted 18 October, 2022; originally announced October 2022.

    Comments: Accepted for Presentation at IEEE SLT 2022

  19. arXiv:2206.12955  [pdf, other

    cs.CL eess.AS stat.ML

    Improving the Training Recipe for a Robust Conformer-based Hybrid Model

    Authors: Mohammad Zeineldeen, **g**g Xu, Christoph Lüscher, Ralf Schlüter, Hermann Ney

    Abstract: Speaker adaptation is important to build robust automatic speech recognition (ASR) systems. In this work, we investigate various methods for speaker adaptive training (SAT) based on feature-space approaches for a conformer-based acoustic model (AM) on the Switchboard 300h dataset. We propose a method, called Weighted-Simple-Add, which adds weighted speaker information vectors to the input of the m… ▽ More

    Submitted 26 June, 2022; originally announced June 2022.

    Comments: Accepted at INTERSPEECH 2022

  20. Efficient Training of Neural Transducer for Speech Recognition

    Authors: Wei Zhou, Wilfried Michel, Ralf Schlüter, Hermann Ney

    Abstract: As one of the most popular sequence-to-sequence modeling approaches for speech recognition, the RNN-Transducer has achieved evolving performance with more and more sophisticated neural network models of growing size and increasing training epochs. While strong computation resources seem to be the prerequisite of training superior models, we try to overcome it by carefully designing a more efficien… ▽ More

    Submitted 8 August, 2022; v1 submitted 22 April, 2022; originally announced April 2022.

    Comments: accepted at Interspeech 2022

  21. arXiv:2201.09692  [pdf, ps, other

    cs.SD eess.AS

    Improving Factored Hybrid HMM Acoustic Modeling without State Tying

    Authors: Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney

    Abstract: In this work, we show that a factored hybrid hidden Markov model (FH-HMM) which is defined without any phonetic state-tying outperforms a state-of-the-art hybrid HMM. The factored hybrid HMM provides a link to transducer models in the way it models phonetic (label) context while preserving the strict separation of acoustic and language model of the hybrid HMM approach. Furthermore, we show that th… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: Accepted for presentation at IEEE ICASSP 2022

    MSC Class: 68T10 ACM Class: I.2.7

  22. arXiv:2111.07130  [pdf, other

    cs.CL

    Prediction of Listener Perception of Argumentative Speech in a Crowdsourced Dataset Using (Psycho-)Linguistic and Fluency Features

    Authors: Yu Qiao, Sourabh Zanwar, Rishab Bhattacharyya, Daniel Wiechmann, Wei Zhou, Elma Kerz, Ralf Schlüter

    Abstract: One of the key communicative competencies is the ability to maintain fluency in monologic speech and the ability to produce sophisticated language to argue a position convincingly. In this paper we aim to predict TED talk-style affective ratings in a crowdsourced dataset of argumentative speech consisting of 7 hours of speech from 110 individuals. The speech samples were elicited through task prom… ▽ More

    Submitted 30 November, 2021; v1 submitted 13 November, 2021; originally announced November 2021.

  23. arXiv:2111.06310  [pdf, other

    cs.CL cs.SD eess.AS

    Self-Normalized Importance Sampling for Neural Language Modeling

    Authors: Zijian Yang, Yingbo Gao, Alexander Gerstenberger, **tao Jiang, Ralf Schlüter, Hermann Ney

    Abstract: To mitigate the problem of having to traverse over the full vocabulary in the softmax normalization of a neural language model, sampling-based training criteria are proposed and investigated in the context of large vocabulary word-based neural language models. These training criteria typically enjoy the benefit of faster training and testing, at a cost of slightly degraded performance in terms of… ▽ More

    Submitted 17 June, 2022; v1 submitted 11 November, 2021; originally announced November 2021.

    Comments: Accepted at INTERSPEECH 2022

  24. arXiv:2111.03442  [pdf, other

    cs.CL eess.AS stat.ML

    Conformer-based Hybrid ASR System for Switchboard Dataset

    Authors: Mohammad Zeineldeen, **g**g Xu, Christoph Lüscher, Wilfried Michel, Alexander Gerstenberger, Ralf Schlüter, Hermann Ney

    Abstract: The recently proposed conformer architecture has been successfully used for end-to-end automatic speech recognition (ASR) architectures achieving state-of-the-art performance on different datasets. To our best knowledge, the impact of using conformer acoustic model for hybrid ASR is not investigated. In this paper, we present and evaluate a competitive conformer-based hybrid model training recipe.… ▽ More

    Submitted 19 February, 2022; v1 submitted 5 November, 2021; originally announced November 2021.

    Comments: Accepted at ICASSP 2022

  25. arXiv:2110.09324  [pdf, other

    cs.CL cs.SD eess.AS

    Automatic Learning of Subword Dependent Model Scales

    Authors: Felix Meyer, Wilfried Michel, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney

    Abstract: To improve the performance of state-of-the-art automatic speech recognition systems it is common practice to include external knowledge sources such as language models or prior corrections. This is usually done via log-linear model combination using separate scaling parameters for each model. Typically these parameters are manually optimized on some held-out data. In this work we propose to opti… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

    Comments: submitted to ICASSP 2022

  26. arXiv:2110.09245  [pdf, other

    cs.CL cs.SD eess.AS

    Efficient Sequence Training of Attention Models using Approximative Recombination

    Authors: Nils-Philipp Wynands, Wilfried Michel, Jan Rosendahl, Ralf Schlüter, Hermann Ney

    Abstract: Sequence discriminative training is a great tool to improve the performance of an automatic speech recognition system. It does, however, necessitate a sum over all possible word sequences, which is intractable to compute in practice. Current state-of-the-art systems with unlimited label context circumvent this problem by limiting the summation to an n-best list of relevant competing hypotheses obt… ▽ More

    Submitted 21 April, 2022; v1 submitted 18 October, 2021; originally announced October 2021.

  27. arXiv:2110.06841  [pdf, ps, other

    cs.CL eess.AS

    On Language Model Integration for RNN Transducer based Speech Recognition

    Authors: Wei Zhou, Zuoyun Zheng, Ralf Schlüter, Hermann Ney

    Abstract: The mismatch between an external language model (LM) and the implicitly learned internal LM (ILM) of RNN-Transducer (RNN-T) can limit the performance of LM integration such as simple shallow fusion. A Bayesian interpretation suggests to remove this sequence prior as ILM correction. In this work, we study various ILM correction-based LM integration methods formulated in a common RNN-T framework. We… ▽ More

    Submitted 16 February, 2022; v1 submitted 13 October, 2021; originally announced October 2021.

    Comments: accepted at ICASSP2022

  28. arXiv:2105.14849  [pdf, other

    cs.LG cs.AI cs.CL cs.NE cs.SD eess.AS math.ST

    Why does CTC result in peaky behavior?

    Authors: Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: The peaky behavior of CTC models is well known experimentally. However, an understanding about why peaky behavior occurs is missing, and whether this is a good property. We provide a formal analysis of the peaky behavior and gradient descent convergence properties of the CTC loss and related training criteria. Our analysis provides a deep understanding why peaky behavior occurs and when it is subo… ▽ More

    Submitted 3 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

  29. arXiv:2104.10507  [pdf, ps, other

    cs.CL cs.SD eess.AS stat.ML

    On Sampling-Based Training Criteria for Neural Language Modeling

    Authors: Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter, Hermann Ney

    Abstract: As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack o… ▽ More

    Submitted 17 June, 2021; v1 submitted 21 April, 2021; originally announced April 2021.

    Comments: Accepted at INTERSPEECH 2021

  30. Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition

    Authors: Wei Zhou, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter, Hermann Ney

    Abstract: Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline. With a fully acoustic-oriented label design and learning process, A… ▽ More

    Submitted 27 August, 2021; v1 submitted 19 April, 2021; originally announced April 2021.

    Comments: accepted at Interspeech2021

  31. arXiv:2104.08529  [pdf, other

    cs.CL

    The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech

    Authors: Yu Qiao, Wei Zhou, Elma Kerz, Ralf Schlüter

    Abstract: In recent years, automated approaches to assessing linguistic complexity in second language (L2) writing have made significant progress in gauging learner performance, predicting human ratings of the quality of learner productions, and benchmarking L2 development. In contrast, there is comparatively little work in the area of speaking, particularly with respect to fully automated approaches to ass… ▽ More

    Submitted 16 June, 2021; v1 submitted 17 April, 2021; originally announced April 2021.

    Comments: accepted at Interspeech2021

  32. Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

    Authors: Wei Zhou, Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

    Abstract: With the advent of direct models in automatic speech recognition (ASR), the formerly prevalent frame-wise acoustic modeling based on hidden Markov models (HMM) diversified into a number of modeling architectures like encoder-decoder attention models, transducer models and segmental models (direct HMM). While transducer models stay with a frame-level model definition, segmental models are defined o… ▽ More

    Submitted 15 June, 2021; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: accepted at Interspeech2021

  33. arXiv:2104.05544  [pdf, ps, other

    cs.CL cs.SD eess.AS stat.ML

    Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

    Authors: Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A Bayesian interpretation as in the hybrid autoregressive transducer (HAT) suggests dividing by the prior of the discriminative acoustic model, which corresponds to… ▽ More

    Submitted 17 June, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

    Comments: accepted to Interspeech 2021

  34. arXiv:2104.05379  [pdf, other

    cs.CL cs.LG

    Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures

    Authors: Nick Rossenbach, Mohammad Zeineldeen, Benedikt Hilmes, Ralf Schlüter, Hermann Ney

    Abstract: Recent publications on automatic-speech-recognition (ASR) have a strong focus on attention encoder-decoder (AED) architectures which tend to suffer from over-fitting in low resource scenarios. One solution to tackle this issue is to generate synthetic data with a trained text-to-speech system (TTS) if additional text is available. This was successfully applied in many publications with AED systems… ▽ More

    Submitted 13 July, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

    Comments: Submitted to ASRU 2021

  35. arXiv:2104.04298  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    On Architectures and Training for Raw Waveform Feature Extraction in ASR

    Authors: Peter Vieting, Christoph Lüscher, Wilfried Michel, Ralf Schlüter, Hermann Ney

    Abstract: With the success of neural network based modeling in automatic speech recognition (ASR), many studies investigated acoustic modeling and learning of feature extractors directly based on the raw waveform. Recently, one line of research has focused on unsupervised pre-training of feature extractors on audio-only data to improve downstream ASR performance. In this work, we investigate the usefulness… ▽ More

    Submitted 5 October, 2021; v1 submitted 9 April, 2021; originally announced April 2021.

    Comments: Accepted for ASRU 2021

  36. arXiv:2104.03006  [pdf, other

    cs.CL cs.AI stat.ML

    Librispeech Transducer Model with Internal Language Model Prior Correction

    Authors: Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter, Hermann Ney

    Abstract: We present our transducer model on Librispeech. We study variants to include an external language model (LM) with shallow fusion and subtract an estimated internal LM. This is justified by a Bayesian interpretation where the transducer model prior is given by the estimated internal LM. The subtraction of the internal LM gives us over 14% relative improvement over normal shallow fusion. Our transdu… ▽ More

    Submitted 12 June, 2021; v1 submitted 7 April, 2021; originally announced April 2021.

    Comments: accepted at Interspeech 2021

  37. arXiv:2104.02387  [pdf, other

    cs.SD eess.AS

    Towards Consistent Hybrid HMM Acoustic Modeling

    Authors: Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney

    Abstract: High-performance hybrid automatic speech recognition (ASR) systems are often trained with clustered triphone outputs, and thus require a complex training pipeline to generate the clustering. The same complex pipeline is often utilized in order to generate an alignment for use in frame-wise cross-entropy training. In this work, we propose a flat-start factored hybrid model trained by modeling the f… ▽ More

    Submitted 12 October, 2021; v1 submitted 6 April, 2021; originally announced April 2021.

    MSC Class: 68T10 ACM Class: I.2.7

  38. arXiv:2103.16710  [pdf, other

    cs.CL cs.AI cs.CV

    A study of latent monotonic attention variants

    Authors: Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: End-to-end models reach state-of-the-art performance for speech recognition, but global soft attention is not monotonic, which might lead to convergence problems, to instability, to bad generalisation, cannot be used for online streaming, and is also inefficient in calculation. Monotonicity can potentially fix all of this. There are several ad-hoc solutions or heuristics to introduce monotonicity,… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

  39. arXiv:2011.12167  [pdf, other

    cs.CL cs.LG

    Tight Integrated End-to-End Training for Cascaded Speech Translation

    Authors: Parnia Bahar, Tobias Bieschke, Ralf Schlüter, Hermann Ney

    Abstract: A cascaded speech translation model relies on discrete and non-differentiable transcription, which provides a supervision signal from the source side and helps the transformation between source speech and target text. Such modeling suffers from error propagation between ASR and MT models. Direct speech translation is an alternative method to avoid error propagation; however, its performance is oft… ▽ More

    Submitted 24 November, 2020; originally announced November 2020.

    Comments: 8 pages, accepted at SLT2021

  40. arXiv:2010.16368  [pdf, other

    cs.CL eess.AS

    Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition

    Authors: Wei Zhou, Simon Berger, Ralf Schlüter, Hermann Ney

    Abstract: To join the advantages of classical and end-to-end approaches for speech recognition, we present a simple, novel and competitive approach for phoneme-based neural transducer modeling. Different alignment label topologies are compared and word-end-based phoneme label augmentation is proposed to improve performance. Utilizing the local dependency of phonemes, we adopt a simplified neural network str… ▽ More

    Submitted 20 April, 2021; v1 submitted 30 October, 2020; originally announced October 2020.

    Comments: accepted at ICASSP2021

  41. arXiv:2005.10089  [pdf, other

    eess.AS cs.CL cs.SD

    Investigation of Large-Margin Softmax in Neural Language Modeling

    Authors: **g**g Huo, Yingbo Gao, Weiyue Wang, Ralf Schlüter, Hermann Ney

    Abstract: To encourage intra-class compactness and inter-class separability among trainable feature vectors, large-margin softmax methods are developed and widely applied in the face recognition community. The introduction of the large-margin concept into the softmax is reported to have good properties such as enhanced discriminative power, less overfitting and well-defined geometric intuitions. Nowadays, l… ▽ More

    Submitted 21 April, 2021; v1 submitted 20 May, 2020; originally announced May 2020.

    Comments: Proceedings of INTERSPEECH 2020

  42. arXiv:2005.10049  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.SD stat.ML

    Early Stage LM Integration Using Local and Global Log-Linear Combination

    Authors: Wilfried Michel, Ralf Schlüter, Hermann Ney

    Abstract: Sequence-to-sequence models with an implicit alignment mechanism (e.g. attention) are closing the performance gap towards traditional hybrid hidden Markov models (HMM) for the task of automatic speech recognition. One important factor to improve word error rate in both cases is the use of an external language model (LM) trained on large text-only corpora. Language model integration is straightforw… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  43. arXiv:2005.09336  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.NE

    A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models

    Authors: Mohammad Zeineldeen, Albert Zeyer, Wei Zhou, Thomas Ng, Ralf Schlüter, Hermann Ney

    Abstract: Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The map** from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phone… ▽ More

    Submitted 15 April, 2021; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 5 pages, 6 tables

  44. arXiv:2005.09319  [pdf, other

    eess.AS cs.LG cs.NE stat.ML

    A New Training Pipeline for an Improved Neural Transducer

    Authors: Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

    Abstract: The RNN transducer is a promising end-to-end model candidate. We compare the original training criterion with the full marginalization over all alignments, to the commonly used maximum approximation, which simplifies, improves and speeds up our training. We also generalize from the original neural network model and study more powerful models, made possible due to the maximum approximation. We furt… ▽ More

    Submitted 18 November, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: published at Interspeech 2020

  45. Context-Dependent Acoustic Modeling without Explicit Phone Clustering

    Authors: Tina Raissi, Eugen Beck, Ralf Schlüter, Hermann Ney

    Abstract: Phoneme-based acoustic modeling of large vocabulary automatic speech recognition takes advantage of phoneme context. The large number of context-dependent (CD) phonemes and their highly varying statistics require tying or smoothing to enable robust training. Usually, classification and regression trees are used for phonetic clustering, which is standard in hidden Markov model (HMM)-based systems.… ▽ More

    Submitted 7 April, 2021; v1 submitted 15 May, 2020; originally announced May 2020.

    Comments: Proceedings of Interspeech 2020

    MSC Class: 68T10 ACM Class: I.2.7

  46. arXiv:2004.00967  [pdf, other

    eess.AS cs.SD

    Full-Sum Decoding for Hybrid HMM based Speech Recognition using LSTM Language Model

    Authors: Wei Zhou, Ralf Schlüter, Hermann Ney

    Abstract: In hybrid HMM based speech recognition, LSTM language models have been widely applied and achieved large improvements. The theoretical capability of modeling any unlimited context suggests that no recombination should be applied in decoding. This motivates to reconsider full summation over the HMM-state sequences instead of Viterbi approximation in decoding. We explore the potential gain from more… ▽ More

    Submitted 2 April, 2020; originally announced April 2020.

    Comments: accepted at ICASSP 2020

  47. arXiv:2004.00960  [pdf, other

    eess.AS cs.SD

    The RWTH ASR System for TED-LIUM Release 2: Improving Hybrid HMM with SpecAugment

    Authors: Wei Zhou, Wilfried Michel, Kazuki Irie, Markus Kitza, Ralf Schlüter, Hermann Ney

    Abstract: We present a complete training pipeline to build a state-of-the-art hybrid HMM-based ASR system on the 2nd release of the TED-LIUM corpus. Data augmentation using SpecAugment is successfully applied to improve performance on top of our best SAT model using i-vectors. By investigating the effect of different maskings, we achieve improvements from SpecAugment on hybrid HMM models without increasing… ▽ More

    Submitted 2 April, 2020; originally announced April 2020.

    Comments: accepted at ICASSP 2020

  48. arXiv:1912.09257  [pdf, other

    cs.CL cs.LG eess.AS

    Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

    Authors: Nick Rossenbach, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: Recent advances in text-to-speech (TTS) led to the development of flexible multi-speaker end-to-end TTS systems. We extend state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself. ASR and TTS systems are built separately to show that text-only data can be used to enhance existing end-to-end AS… ▽ More

    Submitted 17 February, 2020; v1 submitted 19 December, 2019; originally announced December 2019.

    Comments: Accepted to ICASSP 2020

  49. arXiv:1911.08888  [pdf, other

    cs.CL cs.LG eess.AS

    On using 2D sequence-to-sequence models for speech recognition

    Authors: Parnia Bahar, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition. Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more explicit alignment processes, like in classical HMM-based modeling. In contrast, here we apply a novel two-dimensional long short-term memory (2DLSTM) architecture… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

    Comments: 5 pages, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, May 2019

  50. arXiv:1911.08876  [pdf, other

    cs.CL cs.LG eess.AS

    On Using SpecAugment for End-to-End Speech Translation

    Authors: Parnia Bahar, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: This work investigates a simple data augmentation technique, SpecAugment, for end-to-end speech translation. SpecAugment is a low-cost implementation method applied directly to the audio input features and it consists of masking blocks of frequency channels, and/or time steps. We apply SpecAugment on end-to-end speech translation tasks and achieve up to +2.2\% \BLEU on LibriSpeech Audiobooks En->F… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

    Comments: 8 pages, International Workshop on Spoken Language Translation (IWSLT), Hong Kong, China, November 2019