Skip to main content

Showing 1–22 of 22 results for author: Higuchi, Y

Searching in archive eess. Search in all archives.
.
  1. Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

    Authors: Masao Someki, Nicholas Eng, Yosuke Higuchi, Shinji Watanabe

    Abstract: Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is primarily attributed to the incremental calculation of the decoder. This work proposes a partially AR framework, which employs segment-level vectorized beam sea… ▽ More

    Submitted 30 September, 2023; v1 submitted 26 September, 2023; originally announced September 2023.

    Comments: Accepted at ASRU 2023

    Journal ref: IEEE Automatic Speech Recognition and Understanding Workshop 2023

  2. arXiv:2309.10524  [pdf, other

    eess.AS cs.CL cs.SD

    Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

    Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

    Abstract: We present a novel integration of an instruction-tuned large language model (LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can perform a wide range of linguistic tasks within zero-shot learning when provided with a precise instruction or a prompt to guide the text generation process towards the desired task. We explore using this zero-shot capability of LLMs to extract lingui… ▽ More

    Submitted 19 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP2024

  3. arXiv:2309.04654  [pdf, other

    cs.SD eess.AS

    Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

    Authors: Huaibo Zhao, Yosuke Higuchi, Yusuke Kida, Tetsuji Ogawa, Tetsunori Kobayashi

    Abstract: Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipate… ▽ More

    Submitted 8 September, 2023; originally announced September 2023.

    Comments: Accepted to EUSIPCO 2023

  4. arXiv:2211.05869  [pdf, other

    cs.CL cs.SD eess.AS

    A Study on the Integration of Pre-trained SSL, ASR, LM and SLU Models for Spoken Language Understanding

    Authors: Yifan Peng, Siddhant Arora, Yosuke Higuchi, Yushi Ueda, Sujay Kumar, Karthik Ganesan, Siddharth Dalmia, Xuankai Chang, Shinji Watanabe

    Abstract: Collecting sufficient labeled data for spoken language understanding (SLU) is expensive and time-consuming. Recent studies achieved promising results by using pre-trained models in low-resource scenarios. Inspired by this, we aim to ask: which (if any) pre-training strategies can improve performance across SLU benchmarks? To answer this question, we employ four types of pre-trained models and thei… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: Accepted at SLT 2022

  5. arXiv:2211.00795  [pdf, other

    eess.AS cs.CL cs.SD

    InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

    Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

    Abstract: This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision. Momentum PL (MPL) trains a connectionist temporal classification (CTC)-based model on unlabeled data by continuously generating pseudo-labels on the fly and improving their quality. In contrast to autoregressive formulati… ▽ More

    Submitted 16 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP2023

  6. arXiv:2211.00792  [pdf, other

    eess.AS cs.CL cs.SD

    BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

    Authors: Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

    Abstract: We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging… ▽ More

    Submitted 16 March, 2023; v1 submitted 1 November, 2022; originally announced November 2022.

    Comments: Accepted to ICASSP2023

  7. arXiv:2210.16663  [pdf, other

    eess.AS cs.CL

    BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

    Authors: Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori Kobayashi, Shinji Watanabe

    Abstract: This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the… ▽ More

    Submitted 19 April, 2023; v1 submitted 29 October, 2022; originally announced October 2022.

    Comments: v1: Accepted to Findings of EMNLP2022, v2: Minor corrections and clearer derivation of Eq. (21)

  8. arXiv:2210.05200  [pdf, other

    cs.CL cs.SD eess.AS

    CTC Alignments Improve Autoregressive Translation

    Authors: Brian Yan, Siddharth Dalmia, Yosuke Higuchi, Graham Neubig, Florian Metze, Alan W Black, Shinji Watanabe

    Abstract: Connectionist Temporal Classification (CTC) is a widely used approach for automatic speech recognition (ASR) that performs conditionally independent monotonic alignment. However for translation, CTC exhibits clear limitations due to the contextual and non-monotonic nature of the task and thus lags behind attentional decoder approaches in terms of translation quality. In this work, we argue that CT… ▽ More

    Submitted 11 October, 2022; originally announced October 2022.

  9. arXiv:2209.09756  [pdf, other

    eess.AS

    ESPnet-ONNX: Bridging a Gap Between Research and Production

    Authors: Masao Someki, Yosuke Higuchi, Tomoki Hayashi, Shinji Watanabe

    Abstract: In the field of deep learning, researchers often focus on inventing novel neural network models and improving benchmarks. In contrast, application developers are interested in making models suitable for actual products, which involves optimizing a model for faster inference and adapting a model to various platforms (e.g., C++ and Python). In this work, to fill the gap between the two, we establish… ▽ More

    Submitted 14 November, 2022; v1 submitted 20 September, 2022; originally announced September 2022.

    Comments: Accepted to APSIPA ASC 2022

  10. arXiv:2201.10103  [pdf, other

    eess.AS cs.SD

    Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

    Authors: Keqi Deng, Zehui Yang, Shinji Watanabe, Yosuke Higuchi, Gaofeng Cheng, Pengyuan Zhang

    Abstract: While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference… ▽ More

    Submitted 26 January, 2022; v1 submitted 25 January, 2022; originally announced January 2022.

    Comments: Accepted by ICASSP2022

  11. arXiv:2110.10402  [pdf, other

    cs.SD cs.LG eess.AS

    An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

    Authors: Huaibo Zhao, Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi

    Abstract: In the present paper, an attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system that provides high performance with low latency. The triggered attention mechanism, which performs autoregressive decoding triggered by the CTC spike, has shown to be effective in streaming ASR. However, in order to maintai… ▽ More

    Submitted 20 October, 2021; originally announced October 2021.

    Comments: Accepted to APSIPA 2021

  12. arXiv:2110.05249  [pdf, other

    eess.AS cs.CL cs.SD

    A Comparative Study on Non-Autoregressive Modelings for Speech-to-Text Generation

    Authors: Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, Shinji Watanabe

    Abstract: Non-autoregressive (NAR) models simultaneously generate multiple outputs in a sequence, which significantly reduces the inference speed at the cost of accuracy drop compared to autoregressive baselines. Showing great potential for real-time applications, an increasing number of NAR models have been explored in different fields to mitigate the performance gap against AR models. In this work, we con… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

    Comments: Accepted to ASRU2021

  13. arXiv:2110.04948  [pdf, other

    eess.AS cs.SD

    Advancing Momentum Pseudo-Labeling with Conformer and Initialization Strategy

    Authors: Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

    Abstract: Pseudo-labeling (PL), a semi-supervised learning (SSL) method where a seed model performs self-training using pseudo-labels generated from untranscribed speech, has been shown to enhance the performance of end-to-end automatic speech recognition (ASR). Our prior work proposed momentum pseudo-labeling (MPL), which performs PL-based SSL via an interaction between online and offline models, inspired… ▽ More

    Submitted 10 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP2022

  14. arXiv:2110.04109  [pdf, other

    eess.AS cs.CL

    Hierarchical Conditional End-to-End ASR with CTC and Multi-Granular Subword Units

    Authors: Yosuke Higuchi, Keita Karube, Tetsuji Ogawa, Tetsunori Kobayashi

    Abstract: In end-to-end automatic speech recognition (ASR), a model is expected to implicitly learn representations suitable for recognizing a word-level sequence. However, the huge abstraction gap between input acoustic signals and output linguistic tokens makes it challenging for a model to learn the representations. In this work, to promote the word-level representation learning in end-to-end ASR, we pro… ▽ More

    Submitted 8 February, 2022; v1 submitted 8 October, 2021; originally announced October 2021.

    Comments: Accepted to ICASSP2022

  15. arXiv:2109.04411  [pdf, other

    eess.AS cs.CL cs.SD

    Non-autoregressive End-to-end Speech Translation with Parallel Autoregressive Rescoring

    Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: This article describes an efficient end-to-end speech translation (E2E-ST) framework based on non-autoregressive (NAR) models. End-to-end speech translation models have several advantages over traditional cascade systems such as inference latency reduction. However, conventional AR decoding methods are not fast enough because each token is generated incrementally. NAR models, however, can accelera… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

  16. arXiv:2106.08922  [pdf, other

    eess.AS cs.LG cs.SD

    Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition

    Authors: Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori

    Abstract: Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label updat… ▽ More

    Submitted 16 June, 2021; originally announced June 2021.

    Comments: Accepted to Interspeech 2021

  17. arXiv:2012.13006  [pdf, other

    eess.AS cs.SD

    The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

    Authors: Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma, Naoyuki Kamo, Shigeki Karita, Chenda Li, **g Shi, Aswin Shanmugam Subramanian, Wangyou Zhang

    Abstract: This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text… ▽ More

    Submitted 23 December, 2020; originally announced December 2020.

  18. arXiv:2011.00174  [pdf, other

    eess.IV cs.CV

    Dense Pixel-wise Micro-motion Estimation of Object Surface by using Low Dimensional Embedding of Laser Speckle Pattern

    Authors: Ryusuke Sagawa, Yusuke Higuchi, Hiroshi Kawasaki, Ryo Furukawa, Takahiro Ito

    Abstract: This paper proposes a method of estimating micro-motion of an object at each pixel that is too small to detect under a common setup of camera and illumination. The method introduces an active-lighting approach to make the motion visually detectable. The approach is based on speckle pattern, which is produced by the mutual interference of laser light on object's surface and continuously changes its… ▽ More

    Submitted 30 October, 2020; originally announced November 2020.

    Comments: to be published in ACCV2020

  19. arXiv:2010.13956  [pdf, other

    eess.AS cs.SD

    Recent Developments on ESPnet Toolkit Boosted by Conformer

    Authors: Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, **g Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang

    Abstract: In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-… ▽ More

    Submitted 29 October, 2020; v1 submitted 26 October, 2020; originally announced October 2020.

  20. arXiv:2010.13270  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Improved Mask-CTC for Non-Autoregressive End-to-End ASR

    Authors: Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi

    Abstract: For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC), Mask-CTC, fulfills this demand by generating tokens in a non-autoregressive fashion. While Mask-CTC achie… ▽ More

    Submitted 16 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted to ICASSP2021

  21. arXiv:2010.13047  [pdf, other

    cs.CL cs.SD eess.AS

    Orthros: Non-autoregressive End-to-end Speech Translation with Dual-decoder

    Authors: Hirofumi Inaguma, Yosuke Higuchi, Kevin Duh, Tatsuya Kawahara, Shinji Watanabe

    Abstract: Fast inference speed is an important goal towards real-world deployment of speech translation (ST) systems. End-to-end (E2E) models based on the encoder-decoder architecture are more suitable for this goal than traditional cascaded systems, but their effectiveness regarding decoding speed has not been explored so far. Inspired by recent progress in non-autoregressive (NAR) methods in text-based tr… ▽ More

    Submitted 18 February, 2021; v1 submitted 25 October, 2020; originally announced October 2020.

    Comments: Accepted at IEEE ICASSP 2021

  22. arXiv:2005.08700  [pdf, other

    eess.AS cs.SD

    Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict

    Authors: Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi

    Abstract: We present Mask CTC, a novel non-autoregressive end-to-end automatic speech recognition (ASR) framework, which generates a sequence by refining outputs of the connectionist temporal classification (CTC). Neural sequence-to-sequence models are usually \textit{autoregressive}: each output token is generated by conditioning on previously generated tokens, at the cost of requiring as many iterations a… ▽ More

    Submitted 17 August, 2020; v1 submitted 18 May, 2020; originally announced May 2020.

    Comments: Accepted to INTERSPEECH2020