Skip to main content

Showing 1–11 of 11 results for author: Sudo, Y

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.16120  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

    Authors: Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

    Abstract: Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediat… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  2. arXiv:2406.02950  [pdf, other

    eess.AS cs.CL cs.SD

    4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on appl… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

  3. arXiv:2405.13514  [pdf, other

    eess.AS cs.CL cs.SD

    Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

    Authors: Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optim… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Accepted to IEEE ICASSP 2024 workshop Hands-free Speech Communication and Microphone Arrays (HSCMA 2024)

  4. arXiv:2405.13344  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized Automatic Speech Recognition with Dynamic Vocabulary

    Authors: Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Shinji Watanabe

    Abstract: Deep biasing (DB) improves the performance of end-to-end automatic speech recognition (E2E-ASR) for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary, which can result in ineffective learning of the dependencies between subwords. More advanced techniques address this problem by incorporat… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  5. arXiv:2402.12654  [pdf, other

    cs.CL cs.SD eess.AS

    OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

    Authors: Yifan Peng, Yui Sudo, Muhammad Shakeel, Shinji Watanabe

    Abstract: There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Thou… ▽ More

    Submitted 16 June, 2024; v1 submitted 19 February, 2024; originally announced February 2024.

    Comments: Accepted at ACL 2024 main conference

  6. arXiv:2401.16658  [pdf, ps, other

    cs.CL eess.AS

    OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

    Authors: Yifan Peng, **chuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe

    Abstract: Recent studies have highlighted the importance of fully open foundation models. The Open Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper using public data and open-source toolkits. However, previous versions of OWSM (v1 to v3) are still based on standard Transformer, which might lead to inferior performance compared to state-of-the-art speech encoder archite… ▽ More

    Submitted 16 June, 2024; v1 submitted 29 January, 2024; originally announced January 2024.

    Comments: Accepted at INTERSPEECH 2024. Webpage: https://www.wavlab.org/activities/2024/owsm/

  7. arXiv:2401.10449  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or de… ▽ More

    Submitted 18 January, 2024; originally announced January 2024.

    Comments: accepted by ICASSP20224

  8. arXiv:2309.13876  [pdf, other

    cs.CL cs.SD eess.AS

    Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

    Authors: Yifan Peng, **chuan Tian, Brian Yan, Dan Berrebbi, Xuankai Chang, Xinjian Li, Jiatong Shi, Siddhant Arora, William Chen, Roshan Sharma, Wangyou Zhang, Yui Sudo, Muhammad Shakeel, Jee-weon Jung, Soumi Maiti, Shinji Watanabe

    Abstract: Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for develo** such models (from data collection to training) is not publicly accessib… ▽ More

    Submitted 24 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

  9. arXiv:2305.17846  [pdf, other

    cs.SD cs.CL eess.AS

    Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation

    Authors: Yui Sudo, Kazuya Hata, Kazuhiro Nakadai

    Abstract: End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and part of speech that are spelled differently. This often occurs with Japanese personal names that have the same pronunciation but different Kanji charact… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: accepted by INTERSPEECH2023

  10. arXiv:2305.17651  [pdf, other

    cs.CL cs.SD eess.AS

    DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

    Authors: Yifan Peng, Yui Sudo, Shakeel Muhammad, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment. Knowledge distillation trains a small student model to mimic the behavior of a large teacher model. However, the student architecture usually needs to be manually designed and will remain fixed during training, which requires prio… ▽ More

    Submitted 28 May, 2023; originally announced May 2023.

    Comments: Accepted at INTERSPEECH 2023. Code will be available at: https://github.com/pyf98/DPHuBERT

  11. arXiv:2212.10818  [pdf, other

    cs.SD cs.CL eess.AS

    4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

    Authors: Yui Sudo, Muhammad Shakeel, Brian Yan, Jiatong Shi, Shinji Watanabe

    Abstract: The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models d… ▽ More

    Submitted 29 May, 2023; v1 submitted 21 December, 2022; originally announced December 2022.

    Comments: Accepted by INTERRSPEECH2023