Skip to main content

Showing 1–13 of 13 results for author: Stueker, S

Searching in archive eess. Search in all archives.
.
  1. arXiv:2310.11532  [pdf, other

    cs.CL eess.AS

    Multi-stage Large Language Model Correction for Speech Recognition

    Authors: Jie Pu, Thai-Son Nguyen, Sebastian Stüker

    Abstract: In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from previous LLM-based ASR error correction methods, we propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of LLMs. Specifically, the proposed approach has two stages: the first stage… ▽ More

    Submitted 17 June, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

  2. arXiv:2105.03010  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Efficient Weight factorization for Multilingual Speech Recognition

    Authors: Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stueker, Alexander Waibel

    Abstract: End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages, resulting in a single neural network to handle transcribing different languages. Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously. In th… ▽ More

    Submitted 6 May, 2021; originally announced May 2021.

    Comments: Submitted to Interspeech 2021

  3. arXiv:2005.09940  [pdf, other

    eess.AS cs.CL cs.SD

    Relative Positional Encoding for Speech Recognition and Direct Translation

    Authors: Ngoc-Quan Pham, Thanh-Le Ha, Tuan-Nam Nguyen, Thai-Son Nguyen, Elizabeth Salesky, Sebastian Stueker, Jan Niehues, Alexander Waibel

    Abstract: Transformer models are powerful sequence-to-sequence architectures that are capable of directly map** speech inputs to transcriptions or translations. However, the mechanism for modeling positions in this model was tailored for text modeling, and thus is less ideal for acoustic inputs. In this work, we adapt the relative position encoding scheme to the Speech Transformer, where the key addition… ▽ More

    Submitted 20 May, 2020; originally announced May 2020.

    Comments: Submitted to Interspeech 2020

  4. arXiv:2003.10022  [pdf, other

    eess.AS cs.CL cs.SD

    High Performance Sequence-to-Sequence Model for Streaming Speech Recognition

    Authors: Thai-Son Nguyen, Ngoc-Quan Pham, Sebastian Stueker, Alex Waibel

    Abstract: Recently sequence-to-sequence models have started to achieve state-of-the-art performance on standard speech recognition tasks when processing audio data in batch mode, i.e., the complete audio data is available when starting processing. However, when it comes to performing run-on recognition on an input stream of audio data while producing recognition results in real-time and with low word-based… ▽ More

    Submitted 26 July, 2020; v1 submitted 22 March, 2020; originally announced March 2020.

    Comments: To appear in Interspeech 2020

  5. arXiv:2003.09891  [pdf, other

    eess.AS cs.CL cs.SD

    Low Latency ASR for Simultaneous Speech Translation

    Authors: Thai Son Nguyen, Jan Niehues, Eunah Cho, Thanh-Le Ha, Kevin Kilgour, Markus Muller, Matthias Sperber, Sebastian Stueker, Alex Waibel

    Abstract: User studies have shown that reducing the latency of our simultaneous lecture translation system should be the most important goal. We therefore have worked on several techniques for reducing the latency for both components, the automatic speech recognition and the speech translation module. Since the commonly used commitment latency is not appropriate in our case of continuous stream decoding, we… ▽ More

    Submitted 22 March, 2020; originally announced March 2020.

  6. arXiv:2003.04194  [pdf, ps, other

    eess.AS cs.CV cs.LG cs.SD

    Toward Cross-Domain Speech Recognition with End-to-End Models

    Authors: Thai-Son Nguyen, Sebastian Stüker, Alex Waibel

    Abstract: In the area of multi-domain speech recognition, research in the past focused on hybrid acoustic models to build cross-domain and domain-invariant speech recognition systems. In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems when mixing acoustic training data from several domains. For these experiments we composed a multi-d… ▽ More

    Submitted 9 March, 2020; originally announced March 2020.

    Comments: Presented in Life-Long Learning for Spoken Language Systems Workshop - ASRU 2019

  7. arXiv:1910.13296  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    Improving sequence-to-sequence speech recognition training with on-the-fly data augmentation

    Authors: Thai-Son Nguyen, Sebastian Stueker, Jan Niehues, Alex Waibel

    Abstract: Sequence-to-Sequence (S2S) models recently started to show state-of-the-art performance for automatic speech recognition (ASR). With these large and deep models overfitting remains the largest problem, outweighing performance improvements that can be obtained from better architectures. One solution to the overfitting problem is increasing the amount of available training data and the variety exhib… ▽ More

    Submitted 3 February, 2020; v1 submitted 29 October, 2019; originally announced October 2019.

    Comments: To appear in ICASSP 2020

  8. arXiv:1904.13377  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Very Deep Self-Attention Networks for End-to-End Speech Recognition

    Authors: Ngoc-Quan Pham, Thai-Son Nguyen, Jan Niehues, Markus Müller, Sebastian Stüker, Alexander Waibel

    Abstract: Recently, end-to-end sequence-to-sequence models for speech recognition have gained significant interest in the research community. While previous architecture choices revolve around time-delay neural networks (TDNN) and long short-term memory (LSTM) recurrent neural networks, we propose to use self-attention via the Transformer architecture as an alternative. Our analysis shows that deep Transfor… ▽ More

    Submitted 3 May, 2019; v1 submitted 30 April, 2019; originally announced April 2019.

    Comments: Submitted to INTERSPEECH 2019

  9. arXiv:1904.02147  [pdf, other

    eess.AS cs.LG cs.SD

    Learning Shared Encoding Representation for End-to-End Speech Recognition Models

    Authors: Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

    Abstract: In this work, we learn a shared encoding representation for a multi-task neural network model optimized with connectionist temporal classification (CTC) and conventional framewise cross-entropy training criteria. Our experiments show that the multi-task training not only tackles the complexity of optimizing CTC models such as acoustic-to-word but also results in significant improvement compared to… ▽ More

    Submitted 31 March, 2019; originally announced April 2019.

    Comments: arXiv admin note: substantial text overlap with arXiv:1902.01951

  10. arXiv:1902.01951   

    eess.AS cs.CL cs.LG cs.SD

    Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models

    Authors: Thai-Son Nguyen, Sebastian Stueker, Alex Waibel

    Abstract: Acoustic-to-word (A2W) models that allow direct map** from acoustic signals to word sequences are an appealing approach to end-to-end automatic speech recognition due to their simplicity. However, prior works have shown that modelling A2W typically encounters issues of data sparsity that prevent training such a model directly. So far, pre-training initialization is the only approach proposed to… ▽ More

    Submitted 15 May, 2019; v1 submitted 2 February, 2019; originally announced February 2019.

    Comments: submitted newer work which includes this paper results

  11. arXiv:1807.01956  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    Neural Language Codes for Multilingual Acoustic Models

    Authors: Markus Müller, Sebastian Stüker, Alex Waibel

    Abstract: Multilingual Speech Recognition is one of the most costly AI problems, because each language (7,000+) and even different accents require their own acoustic models to obtain best recognition performance. Even though they all use the same phoneme symbols, each language and accent imposes its own coloring or "twang". Many adaptive approaches have been proposed, but they require further training, addi… ▽ More

    Submitted 5 July, 2018; originally announced July 2018.

    Comments: 5 pages, 3 figures, accepted at Interspeech 2018

  12. arXiv:1711.04569  [pdf, ps, other

    eess.AS cs.AI cs.CL

    Multilingual Adaptation of RNN Based ASR Systems

    Authors: Markus Müller, Sebastian Stüker, Alex Waibel

    Abstract: In this work, we focus on multilingual systems based on recurrent neural networks (RNNs), trained using the Connectionist Temporal Classification (CTC) loss function. Using a multilingual set of acoustic units poses difficulties. To address this issue, we proposed Language Feature Vectors (LFVs) to train language adaptive multilingual systems. Language adaptation, in contrast to speaker adaptation… ▽ More

    Submitted 27 February, 2018; v1 submitted 13 November, 2017; originally announced November 2017.

    Comments: 5 pages, 1 figure, to appear in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2018)

  13. arXiv:1711.04564  [pdf, ps, other

    eess.AS cs.AI cs.CL

    Phonemic and Graphemic Multilingual CTC Based Speech Recognition

    Authors: Markus Müller, Sebastian Stüker, Alex Waibel

    Abstract: Training automatic speech recognition (ASR) systems requires large amounts of data in the target language in order to achieve good performance. Whereas large training corpora are readily available for languages like English, there exists a long tail of languages which do suffer from a lack of resources. One method to handle data sparsity is to use data from additional source languages and build a… ▽ More

    Submitted 13 November, 2017; originally announced November 2017.