Skip to main content

Showing 1–50 of 55 results for author: Kanda, N

Searching in archive eess. Search in all archives.
.
  1. arXiv:2406.18009  [pdf, other

    eess.AS cs.SD

    E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

    Abstract: This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  2. arXiv:2406.05699  [pdf, ps, other

    eess.AS cs.AI eess.SP

    An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS

    Authors: Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Yufei Xia, **zhu Li, Sheng Zhao, **yu Li, Naoyuki Kanda

    Abstract: Recently, zero-shot text-to-speech (TTS) systems, capable of synthesizing any speaker's voice from a short audio prompt, have made rapid advancements. However, the quality of the generated speech significantly deteriorates when the audio prompt contains noise, and limited research has been conducted to address this issue. In this paper, we explored various strategies to enhance the quality of audi… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH2024

  3. arXiv:2406.04281  [pdf, other

    eess.AS

    Total-Duration-Aware Duration Modeling for Text-to-Speech Systems

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, **yu Li, Sheng Zhao, Naoyuki Kanda

    Abstract: Accurate control of the total duration of generated speech by adjusting the speech rate is crucial for various text-to-speech (TTS) applications. However, the impact of adjusting the speech rate on speech quality, such as intelligibility and speaker characteristics, has been underexplored. In this work, we propose a novel total-duration-aware (TDA) duration model for TTS, where phoneme durations a… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  4. arXiv:2402.07383  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like

    Authors: Naoyuki Kanda, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Hemin Yang, Zirun Zhu, Min Tang, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Yufei Xia, **zhu Li, Yanqing Liu, Sheng Zhao, Michael Zeng

    Abstract: Laughter is one of the most expressive and natural aspects of human speech, conveying emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the ability to produce realistic and appropriate laughter sounds, limiting their applications and user experience. While there have been prior works to generate natural laughter, they fell short in terms of controlling the timing an… ▽ More

    Submitted 4 March, 2024; v1 submitted 11 February, 2024; originally announced February 2024.

    Comments: See https://aka.ms/elate/ for demo samples, v2: subjective evaluation has been added

  5. arXiv:2401.08887  [pdf, ps, other

    cs.SD cs.AI cs.CL eess.AS

    NOTSOFAR-1 Challenge: New Datasets, Baseline, and Tasks for Distant Meeting Transcription

    Authors: Alon Vinnikov, Amir Ivry, Aviv Hurvitz, Igor Abramovski, Sharon Koubi, Ilya Gurvich, Shai Pe`er, Xiong Xiao, Benjamin Martinez Elizalde, Naoyuki Kanda, Xiaofei Wang, Shalev Shaer, Stav Yagev, Yossi Asher, Sunit Sivasankaran, Yifan Gong, Min Tang, Huaming Wang, Eyal Krupka

    Abstract: We introduce the first Natural Office Talkers in Settings of Far-field Audio Recordings (``NOTSOFAR-1'') Challenge alongside datasets and baseline system. The challenge focuses on distant speaker diarization and automatic speech recognition (DASR) in far-field meeting scenarios, with single-channel and known-geometry multi-channel tracks, and serves as a launch platform for two new datasets: First… ▽ More

    Submitted 16 January, 2024; originally announced January 2024.

    Comments: preprint

  6. arXiv:2310.14806  [pdf, ps, other

    cs.CL cs.SD eess.AS

    Leveraging Timestamp Information for Serialized Joint Streaming Recognition and Translation

    Authors: Sara Papi, Peidong Wang, Junkun Chen, Jian Xue, Naoyuki Kanda, **yu Li, Yashesh Gaur

    Abstract: The growing need for instant spoken language transcription and translation is driven by increased global communication and cross-lingual interactions. This has made offering translations in multiple languages essential for user applications. Traditional approaches to automatic speech recognition (ASR) and speech translation (ST) have often relied on separate systems, leading to inefficiencies in c… ▽ More

    Submitted 23 October, 2023; originally announced October 2023.

    Comments: \c{opyright} 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

  7. arXiv:2309.12521  [pdf, other

    cs.SD eess.AS

    Profile-Error-Tolerant Target-Speaker Voice Activity Detection

    Authors: Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya Yoshioka, Jian Wu

    Abstract: Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. T… ▽ More

    Submitted 3 April, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: Submission for ICASSP 2024

  8. arXiv:2309.08131  [pdf, other

    eess.AS cs.SD

    t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

    Authors: Jian Wu, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao, Zhuo Chen, **yu Li

    Abstract: Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly cons… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: 5 pages, 2 figures, submitted to ICASSP2024

  9. arXiv:2309.08007  [pdf, ps, other

    eess.AS cs.CL cs.SD

    DiariST: Streaming Speech Translation with Speaker Diarization

    Authors: Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, **yu Li, Takuya Yoshioka

    Abstract: End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlap** speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level seri… ▽ More

    Submitted 22 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  10. arXiv:2308.06873  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

    Authors: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, **yu Li, Takuya Yoshioka

    Abstract: Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile… ▽ More

    Submitted 25 June, 2024; v1 submitted 13 August, 2023; originally announced August 2023.

    Comments: To appear in TASLP. See https://aka.ms/speechx for demo samples

  11. arXiv:2305.18747  [pdf, other

    eess.AS cs.CL

    Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

    Authors: Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng

    Abstract: State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is often seen in meeting conversations. We propose an approach to adapt USMs for multi-talker ASR. We first develop an enhanced version of serialized out… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted by Interspeech 2023

  12. arXiv:2305.12311  [pdf, other

    cs.CL cs.AI cs.CV cs.LG eess.AS

    i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

    Authors: Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang

    Abstract: The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is a… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

  13. arXiv:2302.12369  [pdf, other

    eess.AS cs.CL cs.SD

    Factual Consistency Oriented Speech Recognition

    Authors: Naoyuki Kanda, Takuya Yoshioka, Yang Liu

    Abstract: This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experime… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: 5 pages, 1 figure, 3 tables

  14. arXiv:2211.06493  [pdf, other

    eess.AS cs.SD eess.SP

    Handling Trade-Offs in Speech Separation with Sparsely-Gated Mixture of Experts

    Authors: Xiaofei Wang, Zhuo Chen, Yu Shi, Jian Wu, Naoyuki Kanda, Takuya Yoshioka

    Abstract: Employing a monaural speech separation (SS) model as a front-end for automatic speech recognition (ASR) involves balancing two kinds of trade-offs. First, while a larger model improves the SS performance, it also requires a higher computational cost. Second, an SS model that is more optimized for handling overlapped speech is likely to introduce more processing artifacts in non-overlapped-speech r… ▽ More

    Submitted 30 May, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

  15. arXiv:2211.05564  [pdf, other

    eess.AS cs.SD

    Self-supervised learning with bi-label masked speech prediction for streaming multi-talker speech recognition

    Authors: Zili Huang, Zhuo Chen, Naoyuki Kanda, Jian Wu, Yiming Wang, **yu Li, Takuya Yoshioka, Xiaofei Wang, Peidong Wang

    Abstract: Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applications, with limited investigations in multi-talker cases, especially for streaming scenarios. In this paper, we investigate SSL for streaming multi-t… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: submitted to ICASSP 2023

  16. arXiv:2211.05172  [pdf, other

    eess.AS cs.CL cs.SD

    Speech separation with large-scale self-supervised learning

    Authors: Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, **yu Li, Sunit Sivasankaran, Sefik Emre Eskimez

    Abstract: Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours). We also investigate various techniques to efficiently integrate the pre-trained mo… ▽ More

    Submitted 25 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

  17. arXiv:2210.15715  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Simulating realistic speech overlaps improves multi-talker ASR

    Authors: Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlap** speech of multiple speakers. Due to the difficulty in acquiring real conversation data with high-quality human transcriptions, a naïve simulation of multi-talker speech by randomly mixing multiple utterances was conventionally used for model training. In this wo… ▽ More

    Submitted 17 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: v2: fix minor typo

  18. arXiv:2209.04974  [pdf, other

    eess.AS cs.CL cs.SD

    VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

    Authors: Naoyuki Kanda, Jian Wu, Xiaofei Wang, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlap** speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level… ▽ More

    Submitted 3 October, 2022; v1 submitted 11 September, 2022; originally announced September 2022.

    Comments: 6 pages, 2 figure, 3 tables, v2: Appendix A has been added

  19. arXiv:2208.13085  [pdf, other

    eess.AS cs.CL cs.SD

    Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

    Authors: Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu

    Abstract: This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number of speakers, we investigate model architectures that use input tensors with variable-length time and speaker dimensions. Transformer layers are applied to the speaker axis to mak… ▽ More

    Submitted 25 September, 2022; v1 submitted 27 August, 2022; originally announced August 2022.

  20. arXiv:2207.06774  [pdf, ps, other

    eess.SP physics.flu-dyn

    Proof-of-concept Study of Sparse Processing Particle Image Velocimetry for Real Time Flow Observation

    Authors: Naoki Kanda, Chihaya Abe, Shintaro Goto, Keigo Yamada, Kumi Nakai, Yuji Saito, Keisuke Asai, Taku Nonomura

    Abstract: In this paper, we overview, evaluate, and demonstrate the sparse processing particle image velocimetry (SPPIV) as a real-time flow field estimation method using the particle image velocimetry (PIV), whereas SPPIV was previously proposed with its feasibility study and its real-time demonstration is conducted for the first time in this study. In the wind tunnel test, the PIV measurement and real-tim… ▽ More

    Submitted 29 August, 2022; v1 submitted 14 July, 2022; originally announced July 2022.

    Comments: Accepted manuscript for publication in Experiments in Fluids

    Journal ref: Experiments in Fluids 63, Article number: 143 (2022)

  21. arXiv:2205.01818  [pdf, other

    cs.LG cs.AI cs.CL cs.CV eess.AS

    i-Code: An Integrative and Composable Multimodal Learning Framework

    Authors: Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie, Robert Gmyr, Noel Codella, Naoyuki Kanda, Bin Xiao, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang

    Abstract: Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. I… ▽ More

    Submitted 5 May, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

  22. arXiv:2204.03232  [pdf, other

    eess.AS cs.AI eess.SP

    Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation

    Authors: Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka

    Abstract: Existing multi-channel continuous speech separation (CSS) models are heavily dependent on supervised data - either simulated data which causes data mismatch between the training and real-data testing, or the real transcribed overlap** data, which is difficult to be acquired, hindering further improvements in the conversational/meeting transcription tasks. In this paper, we propose a three-stage… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  23. arXiv:2203.16685  [pdf, other

    eess.AS cs.CL cs.SD

    Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

    Authors: Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities,… ▽ More

    Submitted 14 July, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted for presentation at Interspeech 2022

  24. arXiv:2202.00842  [pdf, other

    eess.AS cs.CL cs.SD

    Streaming Multi-Talker ASR with Token-Level Serialized Output Training

    Authors: Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their em… ▽ More

    Submitted 14 July, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

    Comments: 6 pages, 1 figure, 7 tables, v2: minor fixes, v3: Appendix D has been added, v4: citation to [27] has been added, v5: citations to [28][29][30] have been added with minor fixes, short version accepted for presentation at Interspeech 2022

  25. arXiv:2110.14142  [pdf, other

    eess.AS cs.SD

    Separating Long-Form Speech with Group-Wise Permutation Invariant Training

    Authors: Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, **yu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

    Abstract: Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription. Speech separation is often required to handle overlapped speech that is commonly observed in conversation. Although the original utterancelevel permutation invariant training-based continuous speech separation approach has proven to be effective in various conditions, it l… ▽ More

    Submitted 17 November, 2021; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 3 tables, submitted to IEEE ICASSP 2022

  26. arXiv:2110.13900  [pdf, other

    cs.CL cs.SD eess.AS

    WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

    Authors: Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, **yu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei

    Abstract: Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained… ▽ More

    Submitted 17 June, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: Submitted to the Journal of Selected Topics in Signal Processing (JSTSP)

  27. arXiv:2110.06428  [pdf, other

    eess.AS cs.SD

    All-neural beamformer for continuous speech separation

    Authors: Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei Wang, Dongmei Wang, Sefik Emre Eskimez

    Abstract: Continuous speech separation (CSS) aims to separate overlap** voices from a continuous influx of conversational audio containing an unknown number of utterances spoken by an unknown number of speakers. A common application scenario is transcribing a meeting conversation recorded by a microphone array. Prior studies explored various deep learning models for time-frequency mask estimation, followe… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 2 tables

  28. arXiv:2110.05745  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    VarArray: Array-Geometry-Agnostic Continuous Speech Separation

    Authors: Takuya Yoshioka, Xiaofei Wang, Dongmei Wang, Min Tang, Zirun Zhu, Zhuo Chen, Naoyuki Kanda

    Abstract: Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription. This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model. The proposed model is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input… ▽ More

    Submitted 26 October, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: 5 pages, 1 figure, 3 tables, submitted to ICASSP 2022; updated reference information of [33]

  29. arXiv:2110.05354  [pdf, ps, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

    Authors: Zhong Meng, Yashesh Gaur, Naoyuki Kanda, **yu Li, Xie Chen, Yu Wu, Yifan Gong

    Abstract: Text-only adaptation of an end-to-end (E2E) model remains a challenging task for automatic speech recognition (ASR). Language model (LM) fusion-based approaches require an additional external LM during inference, significantly increasing the computation cost. To overcome this, we propose an internal LM adaptation (ILMA) of the E2E model using text-only data. Trained with audio-transcript pairs, an… ▽ More

    Submitted 26 June, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: 5 pages, in Interspeech 2022

    Journal ref: Interspeech 2022, Incheon, Korea

  30. arXiv:2110.03151  [pdf, other

    eess.AS cs.CL cs.SD

    Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

    Authors: Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: This paper presents Transcribe-to-Diarize, a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR). The E2E SA-ASR is a joint model that was recently proposed for speaker counting, multi-talker speech recognition, and speaker identification from monaural audio that contains overlap** speech. Although the E2E SA-ASR mode… ▽ More

    Submitted 21 January, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: To appear in ICASSP 2022; System labels (SC and VBx) in Table 1 have been fixed

  31. arXiv:2107.02852  [pdf, other

    eess.AS cs.CL cs.SD

    A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

    Authors: Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In th… ▽ More

    Submitted 17 September, 2021; v1 submitted 6 July, 2021; originally announced July 2021.

    Comments: To appear in ASRU 2021

  32. arXiv:2107.01922  [pdf, ps, other

    eess.AS cs.SD

    Investigation of Practical Aspects of Single Channel Speech Separation for ASR

    Authors: Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, **yu Li

    Abstract: Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR). However, a speech separation model often introduces target speech distortion, resulting in a sub-optimum word error rate (WER). In this… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: Accepted by Interspeech 2021

  33. arXiv:2106.02302  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

    Authors: Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, **yu Li, Yifan Gong

    Abstract: Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam searc… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: 5 pages, Interspeech 2021

    Journal ref: Interspeech 2021, Brno, Czech Republic

  34. arXiv:2104.02128  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speaker-Attributed ASR with Transformer

    Authors: Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Submitted to INTERSPEECH 2021

  35. arXiv:2103.16776  [pdf, other

    eess.AS cs.CL cs.SD

    Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone

    Authors: Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR). While various approaches have been proposed, all previous studies on the monaural overlapped speech recognition problem were based on either simulation data or small-scale real data. In this paper, we extensively invest… ▽ More

    Submitted 12 April, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH 2021

  36. arXiv:2102.06283  [pdf, other

    cs.CL cs.SD eess.AS

    Speech-language Pre-training for End-to-end Spoken Language Understanding

    Authors: Yao Qian, Ximo Bian, Yu Shi, Naoyuki Kanda, Leo Shen, Zhen Xiao, Michael Zeng

    Abstract: End-to-end (E2E) spoken language understanding (SLU) can infer semantics directly from speech signal without cascading an automatic speech recognizer (ASR) with a natural language understanding (NLU) module. However, paired utterance recordings and corresponding semantics may not always be available or sufficient to train an E2E SLU model in a real production environment. In this paper, we propose… ▽ More

    Submitted 11 February, 2021; originally announced February 2021.

  37. arXiv:2102.01380  [pdf, ps, other

    eess.AS cs.AI cs.CL cs.LG cs.SD

    Internal Language Model Training for Domain-Adaptive End-to-End Speech Recognition

    Authors: Zhong Meng, Naoyuki Kanda, Yashesh Gaur, Sarangarajan Parthasarathy, Eric Sun, Liang Lu, Xie Chen, **yu Li, Yifan Gong

    Abstract: The efficacy of external language model (LM) integration with existing end-to-end (E2E) automatic speech recognition (ASR) systems can be improved significantly using the internal language model estimation (ILME) method. In this method, the internal LM score is subtracted from the score obtained by interpolating the E2E score with the external LM score, during inference. To improve the ILME-based… ▽ More

    Submitted 22 April, 2021; v1 submitted 2 February, 2021; originally announced February 2021.

    Comments: 5 pages, ICASSP 2021

    Journal ref: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, Canada

  38. arXiv:2101.09624  [pdf, other

    eess.AS cs.CL cs.SD

    A Review of Speaker Diarization: Recent Advances with Deep Learning

    Authors: Tae ** Park, Naoyuki Kanda, Dimitrios Dimitriadis, Kyu J. Han, Shinji Watanabe, Shrikanth Narayanan

    Abstract: Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application o… ▽ More

    Submitted 26 November, 2021; v1 submitted 23 January, 2021; originally announced January 2021.

    Comments: This article is a preprint version of the article published in Computer Speech & Language, Volume 72, March 2022, 101317

  39. arXiv:2101.01853  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings

    Authors: Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

    Abstract: An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch bet… ▽ More

    Submitted 5 January, 2021; originally announced January 2021.

    Comments: Submitted to ICASSP 2021

  40. arXiv:2011.13148  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Streaming end-to-end multi-talker speech recognition

    Authors: Liang Lu, Naoyuki Kanda, **yu Li, Yifan Gong

    Abstract: End-to-end multi-talker speech recognition is an emerging research trend in the speech community due to its vast potential in applications such as conversation and meeting transcriptions. To the best of our knowledge, all existing research works are constrained in the offline scenario. In this work, we propose the Streaming Unmixing and Recognition Transducer (SURT) for end-to-end multi-talker spe… ▽ More

    Submitted 12 March, 2021; v1 submitted 26 November, 2020; originally announced November 2020.

    Comments: 5 pages, 3 figures. Accepted to IEEE Signal Processing Letters 2021

  41. arXiv:2011.03110  [pdf, other

    eess.AS cs.SD

    Exploring End-to-End Multi-channel ASR with Bias Information for Meeting Transcription

    Authors: Xiaofei Wang, Naoyuki Kanda, Yashesh Gaur, Zhuo Chen, Zhong Meng, Takuya Yoshioka

    Abstract: Joint optimization of multi-channel front-end and automatic speech recognition (ASR) has attracted much interest. While promising results have been reported for various tasks, past studies on its meeting transcription application were limited to small scale experiments. It is still unclear whether such a joint framework can be beneficial for a more practical setup where a massive amount of single… ▽ More

    Submitted 25 November, 2020; v1 submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted to SLT2021

  42. arXiv:2011.02921  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

    Authors: Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka

    Abstract: Recently, an end-to-end speaker-attributed automatic speech recognition (E2E SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. In the previous study, the model parameters were trained based on the speaker-attributed maximum mutual information (SA-MMI) criterion, with which the joint posterior probability f… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021. arXiv admin note: text overlap with arXiv:2006.10930, arXiv:2008.04546

  43. arXiv:2011.02014  [pdf, other

    eess.AS cs.SD

    Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

    Authors: Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, **yu Li, Scott Wisdom, John R. Hershey

    Abstract: Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this pape… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  44. arXiv:2011.01991  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

    Authors: Zhong Meng, Sarangarajan Parthasarathy, Eric Sun, Yashesh Gaur, Naoyuki Kanda, Liang Lu, Xie Chen, Rui Zhao, **yu Li, Yifan Gong

    Abstract: The external language models (LM) integration remains a challenging task for end-to-end (E2E) automatic speech recognition (ASR) which has no clear division between acoustic and language models. In this work, we propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models with no additional model training, including… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: 8 pages, 2 figures, SLT 2021

    Journal ref: 2021 IEEE Spoken Language Technology Workshop (SLT)

  45. arXiv:2010.12673  [pdf, other

    cs.CL eess.AS

    On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

    Authors: Liang Lu, Zhong Meng, Naoyuki Kanda, **yu Li, Yifan Gong

    Abstract: Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end acoustic model that extends the standard Recurrent Neural Network Transducer (RNN-T) for the purpose of the external language model (LM) fusion. In HAT, the blank probability and the label probability are estimated using two separate probability distributions, which provides a more accurate solution for internal LM score esti… ▽ More

    Submitted 26 March, 2021; v1 submitted 23 October, 2020; originally announced October 2020.

    Comments: 5 pages, 1 figure. Accepted to ICASSP 2021, but we withdrawn due to a bug in code. We updated the results after the bug fix, and submitted the paper to Interspeech 2021

  46. arXiv:2010.11458  [pdf, other

    eess.AS cs.SD

    Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020

    Authors: Xiong Xiao, Naoyuki Kanda, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka, Sanyuan Chen, Yong Zhao, Gang Liu, Yu Wu, Jian Wu, Shujie Liu, **yu Li, Yifan Gong

    Abstract: This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding… ▽ More

    Submitted 22 October, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: 5 pages, 3 figures, 2 tables

  47. arXiv:2008.04546  [pdf, other

    eess.AS cs.CL cs.SD

    Investigation of End-To-End Speaker-Attributed ASR for Continuous Multi-Talker Recordings

    Authors: Naoyuki Kanda, Xuankai Chang, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: Recently, an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. It showed promising results for simulated speech mixtures consisting of various numbers of speakers. However, the model required prior knowledge of speaker profiles to perform sp… ▽ More

    Submitted 11 August, 2020; originally announced August 2020.

  48. arXiv:2006.10930  [pdf, other

    eess.AS cs.CL cs.SD

    Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers

    Authors: Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka

    Abstract: We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech. Our model is built on serialized output training (SOT) with attention-based encoder-decoder, a recently proposed method for recognizing overlapped speech comprising an arbitrary number of speakers. We extend SOT… ▽ More

    Submitted 8 August, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

    Comments: Accepted to INTERSPEECH 2020

  49. arXiv:2004.09249  [pdf, other

    cs.SD cs.CL eess.AS

    CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings

    Authors: Shinji Watanabe, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, Vimal Manohar, Daniel Povey, Desh Raj, David Snyder, Aswin Shanmugam Subramanian, Jan Trmal, Bar Ben Yair, Christoph Boeddeker, Zhaoheng Ni, Yusuke Fujita, Shota Horiguchi, Naoyuki Kanda, Takuya Yoshioka, Neville Ryant

    Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous C… ▽ More

    Submitted 2 May, 2020; v1 submitted 20 April, 2020; originally announced April 2020.

  50. arXiv:2003.12687  [pdf, other

    cs.CL cs.SD eess.AS

    Serialized Output Training for End-to-End Overlapped Speech Recognition

    Authors: Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

    Abstract: This paper proposes serialized output training (SOT), a novel framework for multi-speaker overlapped speech recognition based on an attention-based encoder-decoder approach. Instead of having multiple output layers as with the permutation invariant training (PIT), SOT uses a model with only one output layer that generates the transcriptions of multiple speakers one after another. The attention and… ▽ More

    Submitted 8 August, 2020; v1 submitted 27 March, 2020; originally announced March 2020.

    Comments: Accepted to INTERSPEECH 2020