Skip to main content

Showing 1–50 of 65 results for author: Yoshioka, T

Searching in archive eess. Search in all archives.
.
  1. arXiv:2405.06289  [pdf, other

    cs.SD cs.AI eess.AS

    Look Once to Hear: Target Speech Hearing with Noisy Examples

    Authors: Bandhav Veluri, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota

    Abstract: In crowded settings, the human brain can focus on speech from a target speaker, given prior knowledge of how they sound. We introduce a novel intelligent hearable system that achieves this capability, enabling target speech hearing to ignore all interfering speech and noise, but the target speaker. A naive approach is to require a clean speech example to enroll the target speaker. This is however… ▽ More

    Submitted 29 May, 2024; v1 submitted 10 May, 2024; originally announced May 2024.

    Comments: Best paper honorable mention at CHI 2024

  2. arXiv:2404.09841  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Anatomy of Industrial Scale Multilingual ASR

    Authors: Francis McCann Ramirez, Luka Chkhetiani, Andrew Ehrenberg, Robert McHardy, Rami Botros, Yash Khare, Andrea Vanzo, Taufiquzzaman Peyash, Gabriel Oexle, Michael Liang, Ilya Sklyar, Enver Fakhan, Ahmed Etefy, Daniel McCrystal, Sam Flamini, Domenic Donato, Takuya Yoshioka

    Abstract: This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed descriptio… ▽ More

    Submitted 16 April, 2024; v1 submitted 15 April, 2024; originally announced April 2024.

  3. arXiv:2311.00320  [pdf, other

    cs.SD cs.LG eess.AS

    Semantic Hearing: Programming Acoustic Scenes with Binaural Hearables

    Authors: Bandhav Veluri, Malek Itani, Justin Chan, Takuya Yoshioka, Shyamnath Gollakota

    Abstract: Imagine being able to listen to the birds chir** in a park without hearing the chatter from other hikers, or being able to block out traffic noise on a busy street while still being able to hear emergency sirens and car honks. We introduce semantic hearing, a novel capability for hearable devices that enables them to, in real-time, focus on, or ignore, specific sounds from real-world environment… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

  4. arXiv:2309.12521  [pdf, other

    cs.SD eess.AS

    Profile-Error-Tolerant Target-Speaker Voice Activity Detection

    Authors: Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Midia Yousefi, Takuya Yoshioka, Jian Wu

    Abstract: Target-Speaker Voice Activity Detection (TS-VAD) utilizes a set of speaker profiles alongside an input audio signal to perform speaker diarization. While its superiority over conventional methods has been demonstrated, the method can suffer from errors in speaker profiles, as those profiles are typically obtained by running a traditional clustering-based diarization method over the input signal. T… ▽ More

    Submitted 3 April, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

    Comments: Submission for ICASSP 2024

  5. arXiv:2309.08131  [pdf, other

    eess.AS cs.SD

    t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability

    Authors: Jian Wu, Naoyuki Kanda, Takuya Yoshioka, Rui Zhao, Zhuo Chen, **yu Li

    Abstract: Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly cons… ▽ More

    Submitted 14 September, 2023; originally announced September 2023.

    Comments: 5 pages, 2 figures, submitted to ICASSP2024

  6. arXiv:2309.08007  [pdf, ps, other

    eess.AS cs.CL cs.SD

    DiariST: Streaming Speech Translation with Speaker Diarization

    Authors: Mu Yang, Naoyuki Kanda, Xiaofei Wang, Junkun Chen, Peidong Wang, Jian Xue, **yu Li, Takuya Yoshioka

    Abstract: End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlap** speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level seri… ▽ More

    Submitted 22 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Accepted to ICASSP 2024

  7. arXiv:2308.06873  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    SpeechX: Neural Codec Language Model as a Versatile Speech Transformer

    Authors: Xiaofei Wang, Manthan Thakker, Zhuo Chen, Naoyuki Kanda, Sefik Emre Eskimez, Sanyuan Chen, Min Tang, Shujie Liu, **yu Li, Takuya Yoshioka

    Abstract: Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile… ▽ More

    Submitted 25 June, 2024; v1 submitted 13 August, 2023; originally announced August 2023.

    Comments: To appear in TASLP. See https://aka.ms/speechx for demo samples

  8. arXiv:2305.18747  [pdf, other

    eess.AS cs.CL

    Adapting Multi-Lingual ASR Models for Handling Multiple Talkers

    Authors: Chenda Li, Yao Qian, Zhuo Chen, Naoyuki Kanda, Dongmei Wang, Takuya Yoshioka, Yanmin Qian, Michael Zeng

    Abstract: State-of-the-art large-scale universal speech models (USMs) show a decent automatic speech recognition (ASR) performance across multiple domains and languages. However, it remains a challenge for these models to recognize overlapped speech, which is often seen in meeting conversations. We propose an approach to adapt USMs for multi-talker ASR. We first develop an enhanced version of serialized out… ▽ More

    Submitted 30 May, 2023; originally announced May 2023.

    Comments: Accepted by Interspeech 2023

  9. arXiv:2305.12311  [pdf, other

    cs.CL cs.AI cs.CV cs.LG eess.AS

    i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data

    Authors: Ziyi Yang, Mahmoud Khademi, Yichong Xu, Reid Pryzant, Yuwei Fang, Chenguang Zhu, Dongdong Chen, Yao Qian, Mei Gao, Yi-Ling Chen, Robert Gmyr, Naoyuki Kanda, Noel Codella, Bin Xiao, Yu Shi, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang

    Abstract: The convergence of text, visual, and audio data is a key step towards human-like artificial intelligence, however the current Vision-Language-Speech landscape is dominated by encoder-only models which lack generative abilities. We propose closing this gap with i-Code V2, the first model capable of generating natural language from any combination of Vision, Language, and Speech data. i-Code V2 is a… ▽ More

    Submitted 20 May, 2023; originally announced May 2023.

  10. arXiv:2303.08372  [pdf, other

    eess.AS cs.SD

    Target Sound Extraction with Variable Cross-modality Clues

    Authors: Chenda Li, Yao Qian, Zhuo Chen, Dongmei Wang, Takuya Yoshioka, Shujie Liu, Yanmin Qian, Michael Zeng

    Abstract: Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources. It often uses a model conditioned on a fixed form of target sound clues, such as a sound class label, which limits the ways in which users can interact with the model to specify the target sounds. To leverage… ▽ More

    Submitted 15 March, 2023; originally announced March 2023.

    Comments: Accepted by ICASSP 2023

  11. arXiv:2302.12369  [pdf, other

    eess.AS cs.CL cs.SD

    Factual Consistency Oriented Speech Recognition

    Authors: Naoyuki Kanda, Takuya Yoshioka, Yang Liu

    Abstract: This paper presents a novel optimization framework for automatic speech recognition (ASR) with the aim of reducing hallucinations produced by an ASR model. The proposed framework optimizes the ASR model to maximize an expected factual consistency score between ASR hypotheses and ground-truth transcriptions, where the factual consistency score is computed by a separately trained estimator. Experime… ▽ More

    Submitted 23 February, 2023; originally announced February 2023.

    Comments: 5 pages, 1 figure, 3 tables

  12. arXiv:2211.09988  [pdf, ps, other

    eess.AS cs.SD

    Exploring WavLM on Speech Enhancement

    Authors: Hyungchan Song, Sanyuan Chen, Zhuo Chen, Yu Wu, Takuya Yoshioka, Min Tang, Jong Won Shin, Shujie Liu

    Abstract: There is a surge in interest in self-supervised learning approaches for end-to-end speech encoding in recent years as they have achieved great success. Especially, WavLM showed state-of-the-art performance on various speech processing tasks. To better understand the efficacy of self-supervised learning models for speech enhancement, in this work, we design and conduct a series of experiments with… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

    Comments: Accepted by IEEE SLT 2022

  13. arXiv:2211.06493  [pdf, other

    eess.AS cs.SD eess.SP

    Handling Trade-Offs in Speech Separation with Sparsely-Gated Mixture of Experts

    Authors: Xiaofei Wang, Zhuo Chen, Yu Shi, Jian Wu, Naoyuki Kanda, Takuya Yoshioka

    Abstract: Employing a monaural speech separation (SS) model as a front-end for automatic speech recognition (ASR) involves balancing two kinds of trade-offs. First, while a larger model improves the SS performance, it also requires a higher computational cost. Second, an SS model that is more optimized for handling overlapped speech is likely to introduce more processing artifacts in non-overlapped-speech r… ▽ More

    Submitted 30 May, 2023; v1 submitted 11 November, 2022; originally announced November 2022.

  14. arXiv:2211.05564  [pdf, other

    eess.AS cs.SD

    Self-supervised learning with bi-label masked speech prediction for streaming multi-talker speech recognition

    Authors: Zili Huang, Zhuo Chen, Naoyuki Kanda, Jian Wu, Yiming Wang, **yu Li, Takuya Yoshioka, Xiaofei Wang, Peidong Wang

    Abstract: Self-supervised learning (SSL), which utilizes the input data itself for representation learning, has achieved state-of-the-art results for various downstream speech tasks. However, most of the previous studies focused on offline single-talker applications, with limited investigations in multi-talker cases, especially for streaming scenarios. In this paper, we investigate SSL for streaming multi-t… ▽ More

    Submitted 10 November, 2022; originally announced November 2022.

    Comments: submitted to ICASSP 2023

  15. arXiv:2211.05172  [pdf, other

    eess.AS cs.CL cs.SD

    Speech separation with large-scale self-supervised learning

    Authors: Zhuo Chen, Naoyuki Kanda, Jian Wu, Yu Wu, Xiaofei Wang, Takuya Yoshioka, **yu Li, Sunit Sivasankaran, Sefik Emre Eskimez

    Abstract: Self-supervised learning (SSL) methods such as WavLM have shown promising speech separation (SS) results in small-scale simulation-based experiments. In this work, we extend the exploration of the SSL-based SS by massively scaling up both the pre-training data (more than 300K hours) and fine-tuning data (10K hours). We also investigate various techniques to efficiently integrate the pre-trained mo… ▽ More

    Submitted 25 November, 2022; v1 submitted 9 November, 2022; originally announced November 2022.

  16. arXiv:2211.02944  [pdf, other

    eess.AS cs.SD

    Breaking the trade-off in personalized speech enhancement with cross-task knowledge distillation

    Authors: Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka

    Abstract: Personalized speech enhancement (PSE) models achieve promising results compared with unconditional speech enhancement models due to their ability to remove interfering speech in addition to background noise. Unlike unconditional speech enhancement, causal PSE models may occasionally remove the target speech by mistake. The PSE models also tend to leak interfering speech when the target speaker is… ▽ More

    Submitted 5 November, 2022; originally announced November 2022.

    Comments: Submitted to ICASSP 2023

  17. arXiv:2211.02773  [pdf, other

    eess.AS cs.SD

    Real-Time Joint Personalized Speech Enhancement and Acoustic Echo Cancellation

    Authors: Sefik Emre Eskimez, Takuya Yoshioka, Alex Ju, Min Tang, Tanel Parnamaa, Huaming Wang

    Abstract: Personalized speech enhancement (PSE) is a real-time SE approach utilizing a speaker embedding of a target person to remove background noise, reverberation, and interfering voices. To deploy a PSE model for full duplex communications, the model must be combined with acoustic echo cancellation (AEC), although such a combination has been less explored. This paper proposes a series of methods that ar… ▽ More

    Submitted 25 May, 2023; v1 submitted 4 November, 2022; originally announced November 2022.

    Comments: Accepted to Interspeech 2023

  18. arXiv:2211.02250  [pdf, other

    cs.SD cs.LG eess.AS

    Real-Time Target Sound Extraction

    Authors: Bandhav Veluri, Justin Chan, Malek Itani, Tuochao Chen, Takuya Yoshioka, Shyamnath Gollakota

    Abstract: We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computat… ▽ More

    Submitted 19 April, 2023; v1 submitted 3 November, 2022; originally announced November 2022.

    Comments: ICASSP 2023 camera-ready

  19. arXiv:2210.15715  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Simulating realistic speech overlaps improves multi-talker ASR

    Authors: Muqiao Yang, Naoyuki Kanda, Xiaofei Wang, Jian Wu, Sunit Sivasankaran, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: Multi-talker automatic speech recognition (ASR) has been studied to generate transcriptions of natural conversation including overlap** speech of multiple speakers. Due to the difficulty in acquiring real conversation data with high-quality human transcriptions, a naïve simulation of multi-talker speech by randomly mixing multiple utterances was conventionally used for model training. In this wo… ▽ More

    Submitted 17 November, 2022; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: v2: fix minor typo

  20. arXiv:2209.04974  [pdf, other

    eess.AS cs.CL cs.SD

    VarArray Meets t-SOT: Advancing the State of the Art of Streaming Distant Conversational Speech Recognition

    Authors: Naoyuki Kanda, Jian Wu, Xiaofei Wang, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: This paper presents a novel streaming automatic speech recognition (ASR) framework for multi-talker overlap** speech captured by a distant microphone array with an arbitrary geometry. Our framework, named t-SOT-VA, capitalizes on independently developed two recent technologies; array-geometry-agnostic continuous speech separation, or VarArray, and streaming multi-talker ASR based on token-level… ▽ More

    Submitted 3 October, 2022; v1 submitted 11 September, 2022; originally announced September 2022.

    Comments: 6 pages, 2 figure, 3 tables, v2: Appendix A has been added

  21. arXiv:2208.13085  [pdf, other

    eess.AS cs.CL cs.SD

    Target Speaker Voice Activity Detection with Transformers and Its Integration with End-to-End Neural Diarization

    Authors: Dongmei Wang, Xiong Xiao, Naoyuki Kanda, Takuya Yoshioka, Jian Wu

    Abstract: This paper describes a speaker diarization model based on target speaker voice activity detection (TS-VAD) using transformers. To overcome the original TS-VAD model's drawback of being unable to handle an arbitrary number of speakers, we investigate model architectures that use input tensors with variable-length time and speaker dimensions. Transformer layers are applied to the speaker axis to mak… ▽ More

    Submitted 25 September, 2022; v1 submitted 27 August, 2022; originally announced August 2022.

  22. arXiv:2205.01818  [pdf, other

    cs.LG cs.AI cs.CL cs.CV eess.AS

    i-Code: An Integrative and Composable Multimodal Learning Framework

    Authors: Ziyi Yang, Yuwei Fang, Chenguang Zhu, Reid Pryzant, Dongdong Chen, Yu Shi, Yichong Xu, Yao Qian, Mei Gao, Yi-Ling Chen, Liyang Lu, Yujia Xie, Robert Gmyr, Noel Codella, Naoyuki Kanda, Bin Xiao, Lu Yuan, Takuya Yoshioka, Michael Zeng, Xuedong Huang

    Abstract: Human intelligence is multimodal; we integrate visual, linguistic, and acoustic signals to maintain a holistic worldview. Most current pretraining methods, however, are limited to one or two modalities. We present i-Code, a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations. I… ▽ More

    Submitted 5 May, 2022; v1 submitted 3 May, 2022; originally announced May 2022.

  23. Ultra Fast Speech Separation Model with Teacher Student Learning

    Authors: Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, **yu Li, Xiangzhan Yu

    Abstract: Transformer has been successfully applied to speech separation recently with its strong long-dependency modeling capacity using a self-attention mechanism. However, Transformer tends to have heavy run-time costs due to the deep encoder layers, which hinders its deployment on edge devices. A small Transformer model with fewer encoder layers is preferred for computational efficiency, but it is prone… ▽ More

    Submitted 27 April, 2022; originally announced April 2022.

    Comments: Accepted by interspeech 2021

  24. arXiv:2204.03232  [pdf, other

    eess.AS cs.AI eess.SP

    Leveraging Real Conversational Data for Multi-Channel Continuous Speech Separation

    Authors: Xiaofei Wang, Dongmei Wang, Naoyuki Kanda, Sefik Emre Eskimez, Takuya Yoshioka

    Abstract: Existing multi-channel continuous speech separation (CSS) models are heavily dependent on supervised data - either simulated data which causes data mismatch between the training and real-data testing, or the real transcribed overlap** data, which is difficult to be acquired, hindering further improvements in the conversational/meeting transcription tasks. In this paper, we propose a three-stage… ▽ More

    Submitted 7 April, 2022; originally announced April 2022.

    Comments: Submitted to INTERSPEECH 2022

  25. arXiv:2204.00771  [pdf, other

    eess.AS cs.SD eess.SP

    Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

    Authors: Manthan Thakker, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang

    Abstract: This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is $3\times$ faster than a baseline STFT-based model. Besides, we use KD techniques to develop compresse… ▽ More

    Submitted 2 April, 2022; originally announced April 2022.

    Comments: Submitted to Interspeech conference 2022 https://interspeech2022.org/

  26. arXiv:2203.16685  [pdf, other

    eess.AS cs.CL cs.SD

    Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings

    Authors: Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities,… ▽ More

    Submitted 14 July, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

    Comments: Accepted for presentation at Interspeech 2022

  27. arXiv:2202.13288  [pdf, other

    eess.AS cs.SD

    ICASSP 2022 Deep Noise Suppression Challenge

    Authors: Harishchandra Dubey, Vishak Gopal, Ross Cutler, Ashkan Aazami, Sergiy Matusevych, Sebastian Braun, Sefik Emre Eskimez, Manthan Thakker, Takuya Yoshioka, Hannes Gamper, Robert Aichner

    Abstract: The Deep Noise Suppression (DNS) challenge is designed to foster innovation in the area of noise suppression to achieve superior perceptual speech quality. This is the 4th DNS challenge, with the previous editions held at INTERSPEECH 2020, ICASSP 2021, and INTERSPEECH 2021. We open-source datasets and test sets for researchers to train their deep noise suppression models, as well as a subjective e… ▽ More

    Submitted 26 February, 2022; originally announced February 2022.

  28. arXiv:2202.00842  [pdf, other

    eess.AS cs.CL cs.SD

    Streaming Multi-Talker ASR with Token-Level Serialized Output Training

    Authors: Naoyuki Kanda, Jian Wu, Yu Wu, Xiong Xiao, Zhong Meng, Xiaofei Wang, Yashesh Gaur, Zhuo Chen, **yu Li, Takuya Yoshioka

    Abstract: This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their em… ▽ More

    Submitted 14 July, 2022; v1 submitted 1 February, 2022; originally announced February 2022.

    Comments: 6 pages, 1 figure, 7 tables, v2: minor fixes, v3: Appendix D has been added, v4: citation to [27] has been added, v5: citations to [28][29][30] have been added with minor fixes, short version accepted for presentation at Interspeech 2022

  29. arXiv:2201.09586  [pdf, other

    eess.AS cs.CL cs.SD

    PickNet: Real-Time Channel Selection for Ad Hoc Microphone Arrays

    Authors: Takuya Yoshioka, Xiaofei Wang, Dongmei Wang

    Abstract: This paper proposes PickNet, a neural network model for real-time channel selection for an ad hoc microphone array consisting of multiple recording devices like cell phones. Assuming at most one person to be vocally active at each time point, PickNet identifies the device that is spatially closest to the active person for each time frame by using a short spectral patch of just hundreds of millisec… ▽ More

    Submitted 24 January, 2022; originally announced January 2022.

    Comments: 5 pages, 2 figure, 2 tables, accepted for presentation at ICASSP 2022

  30. arXiv:2110.15430  [pdf, other

    cs.SD cs.CL eess.AS

    Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction

    Authors: Heming Wang, Yao Qian, Xiaofei Wang, Yiming Wang, Chengyi Wang, Shujie Liu, Takuya Yoshioka, **yu Li, DeLiang Wang

    Abstract: Noise robustness is essential for deploying automatic speech recognition (ASR) systems in real-world environments. One way to reduce the effect of noise interference is to employ a preprocessing module that conducts speech enhancement, and then feed the enhanced speech to an ASR backend. In this work, instead of suppressing background noise with a conventional cascaded pipeline, we employ a noise-… ▽ More

    Submitted 28 October, 2021; originally announced October 2021.

    Comments: 5 pages, 1 figure, submitted to ICASSP 2022

  31. arXiv:2110.14838  [pdf, other

    eess.AS cs.SD

    Continuous Speech Separation with Recurrent Selective Attention Network

    Authors: Yixuan Zhang, Zhuo Chen, Jian Wu, Takuya Yoshioka, Peidong Wang, Zhong Meng, **yu Li

    Abstract: While permutation invariant training (PIT) based continuous speech separation (CSS) significantly improves the conversation transcription accuracy, it often suffers from speech leakages and failures in separation at "hot spot" regions because it has a fixed number of output channels. In this paper, we propose to apply recurrent selective attention network (RSAN) to CSS, which generates a variable… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  32. arXiv:2110.14142  [pdf, other

    eess.AS cs.SD

    Separating Long-Form Speech with Group-Wise Permutation Invariant Training

    Authors: Wangyou Zhang, Zhuo Chen, Naoyuki Kanda, Shujie Liu, **yu Li, Sefik Emre Eskimez, Takuya Yoshioka, Xiong Xiao, Zhong Meng, Yanmin Qian, Furu Wei

    Abstract: Multi-talker conversational speech processing has drawn many interests for various applications such as meeting transcription. Speech separation is often required to handle overlapped speech that is commonly observed in conversation. Although the original utterancelevel permutation invariant training-based continuous speech separation approach has proven to be effective in various conditions, it l… ▽ More

    Submitted 17 November, 2021; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 3 tables, submitted to IEEE ICASSP 2022

  33. arXiv:2110.13900  [pdf, other

    cs.CL cs.SD eess.AS

    WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

    Authors: Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, **yu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, Jian Wu, Long Zhou, Shuo Ren, Yanmin Qian, Yao Qian, Jian Wu, Michael Zeng, Xiangzhan Yu, Furu Wei

    Abstract: Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. To tackle the problem, we propose a new pre-trained… ▽ More

    Submitted 17 June, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

    Comments: Submitted to the Journal of Selected Topics in Signal Processing (JSTSP)

  34. arXiv:2110.10330  [pdf, other

    eess.AS cs.SD

    One model to enhance them all: array geometry agnostic multi-channel personalized speech enhancement

    Authors: Hassan Taherian, Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Zhuo Chen, Xuedong Huang

    Abstract: With the recent surge of video conferencing tools usage, providing high-quality speech signals and accurate captions have become essential to conduct day-to-day business or connect with friends and families. Single-channel personalized speech enhancement (PSE) methods show promising results compared with the unconditional speech enhancement (SE) methods in these scenarios due to their ability to r… ▽ More

    Submitted 19 October, 2021; originally announced October 2021.

    Comments: Submitted to ICASSP 2022

  35. arXiv:2110.09625  [pdf, other

    eess.AS cs.LG cs.SD

    Personalized Speech Enhancement: New Models and Comprehensive Evaluation

    Authors: Sefik Emre Eskimez, Takuya Yoshioka, Huaming Wang, Xiaofei Wang, Zhuo Chen, Xuedong Huang

    Abstract: Personalized speech enhancement (PSE) models utilize additional cues, such as speaker embeddings like d-vectors, to remove background noise and interfering speech in real-time and thus improve the speech quality of online video conferencing systems for various acoustic scenarios. In this work, we propose two neural networks for PSE that achieve superior performance to the previously proposed Voice… ▽ More

    Submitted 18 October, 2021; originally announced October 2021.

  36. arXiv:2110.06428  [pdf, other

    eess.AS cs.SD

    All-neural beamformer for continuous speech separation

    Authors: Zhuohuang Zhang, Takuya Yoshioka, Naoyuki Kanda, Zhuo Chen, Xiaofei Wang, Dongmei Wang, Sefik Emre Eskimez

    Abstract: Continuous speech separation (CSS) aims to separate overlap** voices from a continuous influx of conversational audio containing an unknown number of utterances spoken by an unknown number of speakers. A common application scenario is transcribing a meeting conversation recorded by a microphone array. Prior studies explored various deep learning models for time-frequency mask estimation, followe… ▽ More

    Submitted 12 October, 2021; originally announced October 2021.

    Comments: 5 pages, 3 figures, 2 tables

  37. arXiv:2110.05745  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    VarArray: Array-Geometry-Agnostic Continuous Speech Separation

    Authors: Takuya Yoshioka, Xiaofei Wang, Dongmei Wang, Min Tang, Zirun Zhu, Zhuo Chen, Naoyuki Kanda

    Abstract: Continuous speech separation using a microphone array was shown to be promising in dealing with the speech overlap problem in natural conversation transcription. This paper proposes VarArray, an array-geometry-agnostic speech separation neural network model. The proposed model is applicable to any number of microphones without retraining while leveraging the nonlinear correlation between the input… ▽ More

    Submitted 26 October, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

    Comments: 5 pages, 1 figure, 3 tables, submitted to ICASSP 2022; updated reference information of [33]

  38. arXiv:2110.03151  [pdf, other

    eess.AS cs.CL cs.SD

    Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR

    Authors: Naoyuki Kanda, Xiong Xiao, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: This paper presents Transcribe-to-Diarize, a new approach for neural speaker diarization that uses an end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR). The E2E SA-ASR is a joint model that was recently proposed for speaker counting, multi-talker speech recognition, and speaker identification from monaural audio that contains overlap** speech. Although the E2E SA-ASR mode… ▽ More

    Submitted 21 January, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

    Comments: To appear in ICASSP 2022; System labels (SC and VBx) in Table 1 have been fixed

  39. arXiv:2107.02852  [pdf, other

    eess.AS cs.CL cs.SD

    A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio

    Authors: Naoyuki Kanda, Xiong Xiao, Jian Wu, Tianyan Zhou, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In th… ▽ More

    Submitted 17 September, 2021; v1 submitted 6 July, 2021; originally announced July 2021.

    Comments: To appear in ASRU 2021

  40. arXiv:2107.01922  [pdf, ps, other

    eess.AS cs.SD

    Investigation of Practical Aspects of Single Channel Speech Separation for ASR

    Authors: Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, **yu Li

    Abstract: Speech separation has been successfully applied as a frontend processing module of conversation transcription systems thanks to its ability to handle overlapped speech and its flexibility to combine with downstream tasks such as automatic speech recognition (ASR). However, a speech separation model often introduces target speech distortion, resulting in a sub-optimum word error rate (WER). In this… ▽ More

    Submitted 5 July, 2021; originally announced July 2021.

    Comments: Accepted by Interspeech 2021

  41. arXiv:2106.02896  [pdf, other

    eess.AS

    Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka

    Abstract: With the surge of online meetings, it has become more critical than ever to provide high-quality speech audio and live captioning under various noise conditions. However, most monaural speech enhancement (SE) models introduce processing artifacts and thus degrade the performance of downstream tasks, including automatic speech recognition (ASR). This paper proposes a multi-task training framework t… ▽ More

    Submitted 5 June, 2021; originally announced June 2021.

    Comments: Accepted to INTERSPEECH2021

  42. arXiv:2104.02128  [pdf, other

    eess.AS cs.CL cs.SD

    End-to-End Speaker-Attributed ASR with Transformer

    Authors: Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: This paper presents our recent effort on end-to-end speaker-attributed automatic speech recognition, which jointly performs speaker counting, speech recognition and speaker identification for monaural multi-talker audio. Firstly, we thoroughly update the model architecture that was previously designed based on a long short-term memory (LSTM)-based attention encoder decoder by applying transformer… ▽ More

    Submitted 5 April, 2021; originally announced April 2021.

    Comments: Submitted to INTERSPEECH 2021

  43. arXiv:2103.16776  [pdf, other

    eess.AS cs.CL cs.SD

    Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone

    Authors: Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka

    Abstract: Transcribing meetings containing overlapped speech with only a single distant microphone (SDM) has been one of the most challenging problems for automatic speech recognition (ASR). While various approaches have been proposed, all previous studies on the monaural overlapped speech recognition problem were based on either simulation data or small-scale real data. In this paper, we extensively invest… ▽ More

    Submitted 12 April, 2021; v1 submitted 30 March, 2021; originally announced March 2021.

    Comments: Submitted to INTERSPEECH 2021

  44. arXiv:2103.02378  [pdf, other

    cs.SD cs.AI cs.LG eess.AS eess.SP

    Continuous Speech Separation with Ad Hoc Microphone Arrays

    Authors: Dongmei Wang, Takuya Yoshioka, Zhuo Chen, Xiaofei Wang, Tianyan Zhou, Zhong Meng

    Abstract: Speech separation has been shown effective for multi-talker speech recognition. Under the ad hoc microphone array setup where the array consists of spatially distributed asynchronous microphones, additional challenges must be overcome as the geometry and number of microphones are unknown beforehand. Prior studies show, with a spatial-temporalinterleaving structure, neural networks can efficiently… ▽ More

    Submitted 3 March, 2021; originally announced March 2021.

  45. arXiv:2101.01853  [pdf, ps, other

    cs.SD cs.CL eess.AS

    Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings

    Authors: Xuankai Chang, Naoyuki Kanda, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Takuya Yoshioka

    Abstract: An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. However, the E2E modeling approach is susceptible to the mismatch bet… ▽ More

    Submitted 5 January, 2021; originally announced January 2021.

    Comments: Submitted to ICASSP 2021

  46. arXiv:2011.03110  [pdf, other

    eess.AS cs.SD

    Exploring End-to-End Multi-channel ASR with Bias Information for Meeting Transcription

    Authors: Xiaofei Wang, Naoyuki Kanda, Yashesh Gaur, Zhuo Chen, Zhong Meng, Takuya Yoshioka

    Abstract: Joint optimization of multi-channel front-end and automatic speech recognition (ASR) has attracted much interest. While promising results have been reported for various tasks, past studies on its meeting transcription application were limited to small scale experiments. It is still unclear whether such a joint framework can be beneficial for a more practical setup where a massive amount of single… ▽ More

    Submitted 25 November, 2020; v1 submitted 5 November, 2020; originally announced November 2020.

    Comments: Accepted to SLT2021

  47. arXiv:2011.02921  [pdf, ps, other

    eess.AS cs.CL cs.SD

    Minimum Bayes Risk Training for End-to-End Speaker-Attributed ASR

    Authors: Naoyuki Kanda, Zhong Meng, Liang Lu, Yashesh Gaur, Xiaofei Wang, Zhuo Chen, Takuya Yoshioka

    Abstract: Recently, an end-to-end speaker-attributed automatic speech recognition (E2E SA-ASR) model was proposed as a joint model of speaker counting, speech recognition and speaker identification for monaural overlapped speech. In the previous study, the model parameters were trained based on the speaker-attributed maximum mutual information (SA-MMI) criterion, with which the joint posterior probability f… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Submitted to ICASSP 2021. arXiv admin note: text overlap with arXiv:2006.10930, arXiv:2008.04546

  48. arXiv:2011.02014  [pdf, other

    eess.AS cs.SD

    Integration of speech separation, diarization, and recognition for multi-speaker meetings: System description, comparison, and analysis

    Authors: Desh Raj, Pavel Denisov, Zhuo Chen, Hakan Erdogan, Zili Huang, Maokui He, Shinji Watanabe, Jun Du, Takuya Yoshioka, Yi Luo, Naoyuki Kanda, **yu Li, Scott Wisdom, John R. Hershey

    Abstract: Multi-speaker speech recognition of unsegmented recordings has diverse applications such as meeting transcription and automatic subtitle generation. With technical advances in systems dealing with speech separation, speaker diarization, and automatic speech recognition (ASR) in the last decade, it has become possible to build pipelines that achieve reasonable error rates on this task. In this pape… ▽ More

    Submitted 3 November, 2020; originally announced November 2020.

    Comments: Accepted to IEEE SLT 2021

  49. arXiv:2010.12180  [pdf, other

    cs.SD cs.CL eess.AS

    Don't shoot butterfly with rifles: Multi-channel Continuous Speech Separation with Early Exit Transformer

    Authors: Sanyuan Chen, Yu Wu, Zhuo Chen, Takuya Yoshioka, Shujie Liu, **yu Li

    Abstract: With its strong modeling capacity that comes from a multi-head and multi-layer structure, Transformer is a very powerful model for learning a sequential representation and has been successfully applied to speech separation recently. However, multi-channel speech separation sometimes does not necessarily need such a heavy structure for all time frames especially when the cross-talker challenge happ… ▽ More

    Submitted 23 October, 2020; originally announced October 2020.

  50. arXiv:2010.11458  [pdf, other

    eess.AS cs.SD

    Microsoft Speaker Diarization System for the VoxCeleb Speaker Recognition Challenge 2020

    Authors: Xiong Xiao, Naoyuki Kanda, Zhuo Chen, Tianyan Zhou, Takuya Yoshioka, Sanyuan Chen, Yong Zhao, Gang Liu, Yu Wu, Jian Wu, Shujie Liu, **yu Li, Yifan Gong

    Abstract: This paper describes the Microsoft speaker diarization system for monaural multi-talker recordings in the wild, evaluated at the diarization track of the VoxCeleb Speaker Recognition Challenge(VoxSRC) 2020. We will first explain our system design to address issues in handling real multi-talker recordings. We then present the details of the components, which include Res2Net-based speaker embedding… ▽ More

    Submitted 22 October, 2020; v1 submitted 22 October, 2020; originally announced October 2020.

    Comments: 5 pages, 3 figures, 2 tables