Skip to main content

Showing 1–11 of 11 results for author: Zeyer, A

Searching in archive eess. Search in all archives.
.
  1. arXiv:2309.08436  [pdf, other

    eess.AS cs.SD stat.ML

    Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition

    Authors: Mohammad Zeineldeen, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks. A special end-of-chunk (EOC) symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol. This modification, while minor, situates our model as equivalent to a transduc… ▽ More

    Submitted 17 January, 2024; v1 submitted 15 September, 2023; originally announced September 2023.

    Comments: Accepted at ICASSP 2024

  2. arXiv:2210.14742  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Monotonic segmental attention for automatic speech recognition

    Authors: Albert Zeyer, Robin Schmitt, Wei Zhou, Ralf Schlüter, Hermann Ney

    Abstract: We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, on… ▽ More

    Submitted 26 October, 2022; originally announced October 2022.

    Comments: accepted at SLT: https://slt2022.org/

  3. arXiv:2105.14849  [pdf, other

    cs.LG cs.AI cs.CL cs.NE cs.SD eess.AS math.ST

    Why does CTC result in peaky behavior?

    Authors: Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: The peaky behavior of CTC models is well known experimentally. However, an understanding about why peaky behavior occurs is missing, and whether this is a good property. We provide a formal analysis of the peaky behavior and gradient descent convergence properties of the CTC loss and related training criteria. Our analysis provides a deep understanding why peaky behavior occurs and when it is subo… ▽ More

    Submitted 3 June, 2021; v1 submitted 31 May, 2021; originally announced May 2021.

  4. Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept

    Authors: Wei Zhou, Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

    Abstract: With the advent of direct models in automatic speech recognition (ASR), the formerly prevalent frame-wise acoustic modeling based on hidden Markov models (HMM) diversified into a number of modeling architectures like encoder-decoder attention models, transducer models and segmental models (direct HMM). While transducer models stay with a frame-level model definition, segmental models are defined o… ▽ More

    Submitted 15 June, 2021; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: accepted at Interspeech2021

  5. arXiv:2104.05544  [pdf, ps, other

    cs.CL cs.SD eess.AS stat.ML

    Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models

    Authors: Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A Bayesian interpretation as in the hybrid autoregressive transducer (HAT) suggests dividing by the prior of the discriminative acoustic model, which corresponds to… ▽ More

    Submitted 17 June, 2021; v1 submitted 12 April, 2021; originally announced April 2021.

    Comments: accepted to Interspeech 2021

  6. arXiv:2005.09336  [pdf, ps, other

    eess.AS cs.CL cs.LG cs.NE

    A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models

    Authors: Mohammad Zeineldeen, Albert Zeyer, Wei Zhou, Thomas Ng, Ralf Schlüter, Hermann Ney

    Abstract: Following the rationale of end-to-end modeling, CTC, RNN-T or encoder-decoder-attention models for automatic speech recognition (ASR) use graphemes or grapheme-based subword units based on e.g. byte-pair encoding (BPE). The map** from pronunciation to spelling is learned completely from data. In contrast to this, classical approaches to ASR employ secondary knowledge sources in the form of phone… ▽ More

    Submitted 15 April, 2021; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: 5 pages, 6 tables

  7. arXiv:2005.09319  [pdf, other

    eess.AS cs.LG cs.NE stat.ML

    A New Training Pipeline for an Improved Neural Transducer

    Authors: Albert Zeyer, André Merboldt, Ralf Schlüter, Hermann Ney

    Abstract: The RNN transducer is a promising end-to-end model candidate. We compare the original training criterion with the full marginalization over all alignments, to the commonly used maximum approximation, which simplifies, improves and speeds up our training. We also generalize from the original neural network model and study more powerful models, made possible due to the maximum approximation. We furt… ▽ More

    Submitted 18 November, 2020; v1 submitted 19 May, 2020; originally announced May 2020.

    Comments: published at Interspeech 2020

  8. arXiv:1912.09257  [pdf, other

    cs.CL cs.LG eess.AS

    Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems

    Authors: Nick Rossenbach, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: Recent advances in text-to-speech (TTS) led to the development of flexible multi-speaker end-to-end TTS systems. We extend state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself. ASR and TTS systems are built separately to show that text-only data can be used to enhance existing end-to-end AS… ▽ More

    Submitted 17 February, 2020; v1 submitted 19 December, 2019; originally announced December 2019.

    Comments: Accepted to ICASSP 2020

  9. arXiv:1911.08888  [pdf, other

    cs.CL cs.LG eess.AS

    On using 2D sequence-to-sequence models for speech recognition

    Authors: Parnia Bahar, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: Attention-based sequence-to-sequence models have shown promising results in automatic speech recognition. Using these architectures, one-dimensional input and output sequences are related by an attention approach, thereby replacing more explicit alignment processes, like in classical HMM-based modeling. In contrast, here we apply a novel two-dimensional long short-term memory (2DLSTM) architecture… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

    Comments: 5 pages, IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Brighton, UK, May 2019

  10. arXiv:1911.08876  [pdf, other

    cs.CL cs.LG eess.AS

    On Using SpecAugment for End-to-End Speech Translation

    Authors: Parnia Bahar, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: This work investigates a simple data augmentation technique, SpecAugment, for end-to-end speech translation. SpecAugment is a low-cost implementation method applied directly to the audio input features and it consists of masking blocks of frequency channels, and/or time steps. We apply SpecAugment on end-to-end speech translation tasks and achieve up to +2.2\% \BLEU on LibriSpeech Audiobooks En->F… ▽ More

    Submitted 20 November, 2019; originally announced November 2019.

    Comments: 8 pages, International Workshop on Spoken Language Translation (IWSLT), Hong Kong, China, November 2019

  11. RWTH ASR Systems for LibriSpeech: Hybrid vs Attention -- w/o Data Augmentation

    Authors: Christoph Lüscher, Eugen Beck, Kazuki Irie, Markus Kitza, Wilfried Michel, Albert Zeyer, Ralf Schlüter, Hermann Ney

    Abstract: We present state-of-the-art automatic speech recognition (ASR) systems employing a standard hybrid DNN/HMM architecture compared to an attention-based encoder-decoder design for the LibriSpeech task. Detailed descriptions of the system development, including model design, pretraining schemes, training schedules, and optimization approaches are provided for both system architectures. Both hybrid DN… ▽ More

    Submitted 25 July, 2019; v1 submitted 8 May, 2019; originally announced May 2019.

    Comments: Proceedings of INTERSPEECH 2019